Freeze avoidance of very large table.
Hi all,
I'd like to propose read-only table to avoid full scanning to the very
large table.
The WIP patch is attached.
- Background
Postgres can have tuple forever by freezing it, but freezing tuple
needs to scan whole table.
It would negatively affect to system performance, especially in very
large database system.
There is no command that will guarantee a whole table has been
completely frozen,
so postgres needs to run freezing tuples even we have not written table at all.
We need a DDL command will ensure all tuples are frozen and mark table
as read-only, as one way to avoid full scanning to the very large
table.
This topic has been already discussed before, proposed by Simon.
- Feature
I tried to implement this feature called ALTER TABLE SET READ ONLY,
and SET READ WRITE.
What I'm imagining feature is attached this mail as patch file, it's
WIP version patch.
The patch does followings.
* Add new column relreadonly to pg_class.
* Add new syntax ALTER TABLE SET READ ONLY, and ALTER TABLE SET READ WRTIE
* When marking read-only, all tuple of table are frozen with ShareLock
at one pass (like VACUUM FREEZE),
and then update pg_class.relreadonly to true.
* When un-marking read-only, just update pg_class.readonly to false.
* If table has TOAST table then TOAST table is marked as well at same time.
* The writing and vacuum to read-only table are completely restricted
or ignored.
e.g., INSERT, UPDATE ,DELTET, explicit vacuum, auto vacuum
There are a few but not critical problem.
* Processing freezing all tuple are quite similar to VACUUM FREEZE,
but calling lazy_vacuum_rel() would be overkill, I think.
* Need to consider lock level.
Please give me feedback.
Regards,
-------
Sawada Masahiko
Attachments:
000_read_only_table_v0.patchtext/x-patch; charset=US-ASCII; name=000_read_only_table_v0.patchDownload
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 002319e..a41be00 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -16,6 +16,7 @@
#include "access/genam.h"
#include "access/heapam.h"
+#include "access/heapam_xlog.h"
#include "access/multixact.h"
#include "access/reloptions.h"
#include "access/relscan.h"
@@ -79,6 +80,7 @@
#include "storage/lmgr.h"
#include "storage/lock.h"
#include "storage/predicate.h"
+#include "storage/procarray.h"
#include "storage/smgr.h"
#include "utils/acl.h"
#include "utils/builtins.h"
@@ -392,9 +394,13 @@ static void change_owner_fix_column_acls(Oid relationOid,
Oid oldOwnerId, Oid newOwnerId);
static void change_owner_recurse_to_sequences(Oid relationOid,
Oid newOwnerId, LOCKMODE lockmode);
+static void change_relreadonly(Oid relOid, bool flag);
static ObjectAddress ATExecClusterOn(Relation rel, const char *indexName,
LOCKMODE lockmode);
static void ATExecDropCluster(Relation rel, LOCKMODE lockmode);
+static void ATPrepSetReadOnly(Relation rel, LOCKMODE lockmode);
+static void ATExecSetReadOnly(Relation rel);
+static void ATExecSetReadWrite(Relation rel);
static bool ATPrepChangePersistence(Relation rel, bool toLogged);
static void ATPrepSetTableSpace(AlteredTableInfo *tab, Relation rel,
char *tablespacename, LOCKMODE lockmode);
@@ -3006,6 +3012,9 @@ AlterTableGetLockLevel(List *cmds)
cmd_lockmode = ShareUpdateExclusiveLock;
break;
+ case AT_SetReadOnly:
+ case AT_SetReadWrite:
+ cmd_lockmode = ShareLock;
case AT_SetLogged:
case AT_SetUnLogged:
cmd_lockmode = AccessExclusiveLock;
@@ -3254,6 +3263,16 @@ ATPrepCmd(List **wqueue, Relation rel, AlterTableCmd *cmd,
}
pass = AT_PASS_MISC;
break;
+ case AT_SetReadOnly: /* SET READ ONLY */
+ ATSimplePermissions(rel, ATT_TABLE);
+ /* Performs freezing all tuples */
+ ATPrepSetReadOnly(rel, lockmode);
+ pass = AT_PASS_MISC;
+ break;
+ case AT_SetReadWrite: /* SET READ WRITE */
+ ATSimplePermissions(rel, ATT_TABLE);
+ pass = AT_PASS_MISC;
+ break;
case AT_AddOids: /* SET WITH OIDS */
ATSimplePermissions(rel, ATT_TABLE | ATT_FOREIGN_TABLE);
if (!rel->rd_rel->relhasoids || recursing)
@@ -3548,6 +3567,14 @@ ATExecCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
case AT_SetLogged: /* SET LOGGED */
case AT_SetUnLogged: /* SET UNLOGGED */
break;
+ case AT_SetReadOnly: /* SET READ ONLY */
+ /* Update system catalog to change flag to true */
+ ATExecSetReadOnly(rel);
+ break;
+ case AT_SetReadWrite: /* SET READ WRITE */
+ /* Update system catalog to change flag to false */
+ ATExecSetReadWrite(rel);
+ break;
case AT_AddOids: /* SET WITH OIDS */
/* Use the ADD COLUMN code, unless prep decided to do nothing */
if (cmd->def != NULL)
@@ -9168,6 +9195,192 @@ ATExecDropCluster(Relation rel, LOCKMODE lockmode)
}
/*
+ * ALTER TABLE SET READ ONLY
+ *
+ * We have to do freezing the all live tuples at first, and then update
+ * relreadonly column of pg_class system catalog to true.
+ * Processing freezing tuples always scan all pages and freeze tuple
+ * one by one like VACUUM FREEZE.
+ */
+static void
+ATPrepSetReadOnly(Relation rel, LOCKMODE lockmode)
+{
+ BlockNumber nblocks,
+ blkno;
+ OffsetNumber offnum;
+ HeapTupleData tuple;
+ xl_heap_freeze_tuple *frozen;
+ int nfrozen;
+ int i;
+ TransactionId oldestxmin, freezelimit;
+ MultiXactId mxactcutoff;
+
+ PreventTransactionChain(true, "ALTER TABLE SET READ ONLY");
+
+ relation_open(RelationGetRelid(rel), lockmode);
+ nblocks = RelationGetNumberOfBlocks(rel);
+ oldestxmin = freezelimit = GetOldestXmin(rel, true);
+ mxactcutoff = ReadNextMultiXactId();
+ frozen = palloc(sizeof(xl_heap_freeze_tuple) * MaxHeapTuplesPerPage);
+
+ for (blkno = 0; blkno < nblocks; blkno++)
+ {
+ Buffer buf;
+ Page page;
+ OffsetNumber maxoff;
+
+ nfrozen = 0;
+
+ buf = ReadBuffer(rel, blkno);
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ page = BufferGetPage(buf);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ for (offnum = FirstOffsetNumber;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid;
+
+ itemid = PageGetItemId(page, offnum);
+
+ /* Skip unused items */
+ if (!ItemIdIsUsed(itemid) |
+ ItemIdIsRedirected(itemid) |
+ ItemIdIsDead(itemid))
+ continue;
+
+ Assert(ItemIdIsNormal(itemid));
+
+ ItemPointerSet(&(tuple.t_self), blkno, offnum);
+ tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+ tuple.t_len = ItemIdGetLength(itemid);
+ tuple.t_tableOid = RelationGetRelid(rel);
+
+ /*
+ * Check state of each tuple by using mechanism which is used in
+ * VACUUM. We are interested in only live tuple, so skip dead tuple.
+ */
+ switch(HeapTupleSatisfiesVacuum(&tuple, oldestxmin, buf))
+ {
+ case HEAPTUPLE_DEAD:
+ break;
+ case HEAPTUPLE_LIVE:
+ case HEAPTUPLE_RECENTLY_DEAD:
+ case HEAPTUPLE_INSERT_IN_PROGRESS:
+ case HEAPTUPLE_DELETE_IN_PROGRESS:
+ if (heap_prepare_freeze_tuple(tuple.t_data, freezelimit,
+ mxactcutoff, &frozen[nfrozen]))
+ frozen[nfrozen++].offset = offnum;
+ break;
+ default:
+ elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
+ break;
+ }
+ }
+
+ /*
+ * If there are any tuple which is needed to freeze then we do that.
+ * Also if relation needs WAL, we do logging WAL as well.
+ */
+ if (nfrozen > 0)
+ {
+ START_CRIT_SECTION();
+ MarkBufferDirty(buf);
+
+ for (i = 0; i < nfrozen; i++)
+ {
+ ItemId itemid;
+ HeapTupleHeader htup;
+
+ itemid = PageGetItemId(page, frozen[i].offset);
+ htup = (HeapTupleHeader) PageGetItem(page, itemid);
+
+ heap_execute_freeze_tuple(htup, &frozen[i]);
+ }
+
+ if (RelationNeedsWAL(rel))
+ {
+ XLogRecPtr recptr;
+
+ recptr = log_heap_freeze(rel, buf, freezelimit, frozen, nfrozen);
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+ }
+
+ UnlockReleaseBuffer(buf);
+ }
+
+ pfree(frozen);
+ relation_close(rel, lockmode);
+}
+
+/*
+ * It's pharse 2 in ALTER TABLE SET READ ONLY here.
+ * We just do to updating pg_class.relreadonly to true
+ */
+static void
+ATExecSetReadOnly(Relation rel)
+{
+ Oid relid,
+ toast_relid;
+
+ relid = RelationGetRelid(rel);
+ Assert(OidIsValid(relid));
+
+ /* Change readonly flag to true */
+ change_relreadonly(relid, true);
+
+ /* If relation has TOAST table then change readonly flag as well */
+ toast_relid = rel->rd_rel->reltoastrelid;
+ if (OidIsValid(toast_relid))
+ change_relreadonly(toast_relid, true);
+}
+
+static void
+change_relreadonly(Oid relOid, bool flag)
+{
+ Relation relationRelation;
+ HeapTuple tuple;
+ Form_pg_class classtuple;
+
+ relationRelation = heap_open(RelationRelationId, AccessShareLock);
+ tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relOid));
+
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "cache look up filed for relation %u", relOid);
+
+ classtuple = (Form_pg_class) GETSTRUCT(tuple);
+ classtuple->relreadonly = flag;
+
+ simple_heap_update(relationRelation, &tuple->t_self, tuple);
+ CatalogUpdateIndexes(relationRelation, tuple);
+
+ heap_freetuple(tuple);
+ heap_close(relationRelation, AccessShareLock);
+}
+
+static void
+ATExecSetReadWrite(Relation rel)
+{
+ Oid relid,
+ toast_relid;
+
+ relid = RelationGetRelid(rel);
+ Assert(OidIsValid(relid));
+
+ /* Change readonly flag to false */
+ change_relreadonly(relid, false);
+
+ /* If relation has TOAST table then change readonly flag as well */
+ toast_relid = rel->rd_rel->reltoastrelid;
+ if (OidIsValid(toast_relid))
+ change_relreadonly(toast_relid, false);
+}
+
+/*
* ALTER TABLE SET TABLESPACE
*/
static void
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index bd57b68..4adddb3 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1297,6 +1297,20 @@ vacuum_rel(Oid relid, RangeVar *relation, int options, VacuumParams *params)
}
/*
+ * Check whether it's read-only table
+ */
+ if (RelationIsReadOnly(onerel))
+ {
+ ereport(WARNING,
+ (errmsg("skipping \"%s\" --- cannot vacuum read-only table",
+ RelationGetRelationName(onerel))));
+ relation_close(onerel, lmode);
+ PopActiveSnapshot();
+ CommitTransactionCommand();
+ return false;
+ }
+
+ /*
* Silently ignore tables that are temp tables of other backends ---
* trying to vacuum these will lead to great unhappiness, since their
* contents are probably not up-to-date on disk. (We don't throw a
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index f96fb24..96a2acf 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -185,6 +185,13 @@ ExecInsert(TupleTableSlot *slot,
resultRelationDesc = resultRelInfo->ri_RelationDesc;
/*
+ * check if we are going to do INSERT read-only table
+ */
+ if (RelationIsReadOnly(resultRelationDesc))
+ ereport(ERROR, (errmsg("cannot INSERT to read-only table: \"%s\"",
+ RelationGetRelationName(resultRelationDesc))));
+
+ /*
* If the result relation has OIDs, force the tuple's OID to zero so that
* heap_insert will assign a fresh OID. Usually the OID already will be
* zero at this point, but there are corner cases where the plan tree can
@@ -337,6 +344,14 @@ ExecDelete(ItemPointer tupleid,
resultRelInfo = estate->es_result_relation_info;
resultRelationDesc = resultRelInfo->ri_RelationDesc;
+ /*
+ * check if we are going to do DELETE read-only table.
+ */
+ if (RelationIsReadOnly(resultRelationDesc))
+ ereport(ERROR, (errmsg("cannot DELETE to read-only table: \"%s\"",
+ RelationGetRelationName(resultRelationDesc))));
+
+
/* BEFORE ROW DELETE Triggers */
if (resultRelInfo->ri_TrigDesc &&
resultRelInfo->ri_TrigDesc->trig_delete_before_row)
@@ -599,6 +614,13 @@ ExecUpdate(ItemPointer tupleid,
resultRelInfo = estate->es_result_relation_info;
resultRelationDesc = resultRelInfo->ri_RelationDesc;
+ /*
+ * chec if we are going to do UPDATE read-only table.
+ */
+ if (RelationIsReadOnly(resultRelationDesc))
+ ereport(ERROR, (errmsg("cannot UPDATE to read-only table: \"%s\"",
+ RelationGetRelationName(resultRelationDesc))));
+
/* BEFORE ROW UPDATE Triggers */
if (resultRelInfo->ri_TrigDesc &&
resultRelInfo->ri_TrigDesc->trig_update_before_row)
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 88ec83c..6d160f2 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -2132,6 +2132,20 @@ alter_table_cmd:
n->subtype = AT_SetUnLogged;
$$ = (Node *)n;
}
+ /* ALTER TABLE <name> SET READ ONLY */
+ | SET READ ONLY
+ {
+ AlterTableCmd *n = makeNode(AlterTableCmd);
+ n->subtype = AT_SetReadOnly;
+ $$ = (Node *)n;
+ }
+ /* ALTER TABLE <name> SET READ WRITE */
+ | SET READ WRITE
+ {
+ AlterTableCmd *n = makeNode(AlterTableCmd);
+ n->subtype = AT_SetReadWrite;
+ $$ = (Node *)n;
+ }
/* ALTER TABLE <name> ENABLE TRIGGER <trig> */
| ENABLE_P TRIGGER name
{
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index c93b412..7898e89 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1985,6 +1985,13 @@ do_autovacuum(void)
classForm->relkind != RELKIND_MATVIEW)
continue;
+ /*
+ * Skip if the table is marked as read-only.
+ */
+ if (classForm->relreadonly)
+ continue;
+
+
relid = HeapTupleGetOid(tuple);
/* Fetch reloptions and the pgstat entry for this table */
@@ -2100,6 +2107,12 @@ do_autovacuum(void)
if (classForm->relpersistence == RELPERSISTENCE_TEMP)
continue;
+ /*
+ * Skip if the table is marked as read-only.
+ */
+ if (classForm->relreadonly)
+ continue;
+
relid = HeapTupleGetOid(tuple);
/*
diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
index 8b4c35c..6742dc7 100644
--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -50,6 +50,7 @@ CATALOG(pg_class,1259) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83) BKI_SCHEMA_MACRO
Oid reltoastrelid; /* OID of toast table; 0 if none */
bool relhasindex; /* T if has (or has had) any indexes */
bool relisshared; /* T if shared across databases */
+ bool relreadonly; /* T if read only table */
char relpersistence; /* see RELPERSISTENCE_xxx constants below */
char relkind; /* see RELKIND_xxx constants below */
int16 relnatts; /* number of user attributes */
@@ -95,7 +96,7 @@ typedef FormData_pg_class *Form_pg_class;
* ----------------
*/
-#define Natts_pg_class 30
+#define Natts_pg_class 31
#define Anum_pg_class_relname 1
#define Anum_pg_class_relnamespace 2
#define Anum_pg_class_reltype 3
@@ -110,22 +111,23 @@ typedef FormData_pg_class *Form_pg_class;
#define Anum_pg_class_reltoastrelid 12
#define Anum_pg_class_relhasindex 13
#define Anum_pg_class_relisshared 14
-#define Anum_pg_class_relpersistence 15
-#define Anum_pg_class_relkind 16
-#define Anum_pg_class_relnatts 17
-#define Anum_pg_class_relchecks 18
-#define Anum_pg_class_relhasoids 19
-#define Anum_pg_class_relhaspkey 20
-#define Anum_pg_class_relhasrules 21
-#define Anum_pg_class_relhastriggers 22
-#define Anum_pg_class_relhassubclass 23
-#define Anum_pg_class_relrowsecurity 24
-#define Anum_pg_class_relispopulated 25
-#define Anum_pg_class_relreplident 26
-#define Anum_pg_class_relfrozenxid 27
-#define Anum_pg_class_relminmxid 28
-#define Anum_pg_class_relacl 29
-#define Anum_pg_class_reloptions 30
+#define Anum_pg_class_relreadonly 15
+#define Anum_pg_class_relpersistence 16
+#define Anum_pg_class_relkind 17
+#define Anum_pg_class_relnatts 18
+#define Anum_pg_class_relchecks 19
+#define Anum_pg_class_relhasoids 20
+#define Anum_pg_class_relhaspkey 21
+#define Anum_pg_class_relhasrules 22
+#define Anum_pg_class_relhastriggers 23
+#define Anum_pg_class_relhassubclass 24
+#define Anum_pg_class_relrowsecurity 25
+#define Anum_pg_class_relispopulated 26
+#define Anum_pg_class_relreplident 27
+#define Anum_pg_class_relfrozenxid 28
+#define Anum_pg_class_relminmxid 29
+#define Anum_pg_class_relacl 30
+#define Anum_pg_class_reloptions 31
/* ----------------
* initial contents of pg_class
@@ -140,13 +142,13 @@ typedef FormData_pg_class *Form_pg_class;
* Note: "3" in the relfrozenxid column stands for FirstNormalTransactionId;
* similarly, "1" in relminmxid stands for FirstMultiXactId
*/
-DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 f f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 f f f p r 21 0 f f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 f f p r 27 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 f f f p r 27 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 f f f p r 31 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 2893cef..869df79 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -1336,6 +1336,8 @@ typedef enum AlterTableType
AT_DropCluster, /* SET WITHOUT CLUSTER */
AT_SetLogged, /* SET LOGGED */
AT_SetUnLogged, /* SET UNLOGGED */
+ AT_SetReadOnly, /* SET READ ONLY */
+ AT_SetReadWrite, /* SET READ WREITE */
AT_AddOids, /* SET WITH OIDS */
AT_AddOidsRecurse, /* internal to commands/tablecmds.c */
AT_DropOids, /* SET WITHOUT OIDS */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 6bd786d..12620fa 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -435,6 +435,13 @@ typedef struct ViewOptions
((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
/*
+ * RelationIsReadOnly
+ * True if relation is read only.
+ */
+#define RelationIsReadOnly(relation) \
+ ((relation)->rd_rel->relreadonly)
+
+/*
* RelationUsesLocalBuffers
* True if relation's pages are stored in local buffers.
*/
On 4/3/15 12:59 AM, Sawada Masahiko wrote:
+ case HEAPTUPLE_LIVE: + case HEAPTUPLE_RECENTLY_DEAD: + case HEAPTUPLE_INSERT_IN_PROGRESS: + case HEAPTUPLE_DELETE_IN_PROGRESS: + if (heap_prepare_freeze_tuple(tuple.t_data, freezelimit, + mxactcutoff, &frozen[nfrozen])) + frozen[nfrozen++].offset = offnum; + break;
This doesn't seem safe enough to me. Can't there be tuples that are
still new enough that they can't be frozen, and are still live? I don't
think it's safe to leave tuples as dead either, even if they're hinted.
The hint may not be written. Also, the patch seems to be completely
ignoring actually freezing the toast relation; I can't see how that's
actually safe.
I'd feel a heck of a lot safer if any time heap_prepare_freeze_tuple
returned false we did a second check on the tuple to ensure it was truly
frozen.
Somewhat related... instead of forcing the freeze to happen
synchronously, can't we set this up so a table is in one of three
states? Read/Write, Read Only, Frozen. AT_SetReadOnly and
AT_SetReadWrite would simply change to the appropriate state, and all
the vacuum infrastructure would continue to process those tables as it
does today. lazy_vacuum_rel would become responsible for tracking if
there were any non-frozen tuples if it was also attempting a freeze. If
it discovered there were none, AND the table was marked as ReadOnly,
then it would change the table state to Frozen and set relfrozenxid =
InvalidTransactionId and relminxid = InvalidMultiXactId. AT_SetReadWrite
could change relfrozenxid to it's own Xid as an optimization. Doing it
that way leaves all the complicated vacuum code in one place, and would
eliminate concerns about race conditions with still running
transactions, etc.
BTW, you also need to put things in place to ensure it's impossible to
unfreeze a tuple in a relation that's marked ReadOnly or Frozen. I'm not
sure what the right way to do that would be.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 4/4/15 5:10 PM, Jim Nasby wrote:
On 4/3/15 12:59 AM, Sawada Masahiko wrote:
+ case HEAPTUPLE_LIVE: + case HEAPTUPLE_RECENTLY_DEAD: + case HEAPTUPLE_INSERT_IN_PROGRESS: + case HEAPTUPLE_DELETE_IN_PROGRESS: + if (heap_prepare_freeze_tuple(tuple.t_data, freezelimit, + mxactcutoff, &frozen[nfrozen])) + frozen[nfrozen++].offset = offnum; + break;This doesn't seem safe enough to me. Can't there be tuples that are
still new enough that they can't be frozen, and are still live? I don't
think it's safe to leave tuples as dead either, even if they're hinted.
The hint may not be written. Also, the patch seems to be completely
ignoring actually freezing the toast relation; I can't see how that's
actually safe.I'd feel a heck of a lot safer if any time heap_prepare_freeze_tuple
returned false we did a second check on the tuple to ensure it was truly
frozen.Somewhat related... instead of forcing the freeze to happen
synchronously, can't we set this up so a table is in one of three
states? Read/Write, Read Only, Frozen. AT_SetReadOnly and
AT_SetReadWrite would simply change to the appropriate state, and all
the vacuum infrastructure would continue to process those tables as it
does today. lazy_vacuum_rel would become responsible for tracking if
there were any non-frozen tuples if it was also attempting a freeze. If
it discovered there were none, AND the table was marked as ReadOnly,
then it would change the table state to Frozen and set relfrozenxid =
InvalidTransactionId and relminxid = InvalidMultiXactId. AT_SetReadWrite
could change relfrozenxid to it's own Xid as an optimization. Doing it
that way leaves all the complicated vacuum code in one place, and would
eliminate concerns about race conditions with still running
transactions, etc.BTW, you also need to put things in place to ensure it's impossible to
unfreeze a tuple in a relation that's marked ReadOnly or Frozen. I'm not
sure what the right way to do that would be.
Answering my own question... I think visibilitymap_clear() would be the
right place. AFAICT this is basically as critical as clearing the VM,
and that function has the Relation, so it can see what mode the relation
is in.
There is another possibility here, too. We can completely divorce a
ReadOnly mode (which I think is useful for other things besides
freezing) from the question of whether we need to force-freeze a
relation if we create a FrozenMap, similar to the visibility map. This
has the added advantage of helping freeze scans on relations that are
not ReadOnly in the case of tables that are insert-mostly or any other
pattern where most pages stay all-frozen.
Prior to the visibility map this would have been a rather daunting
project, but I believe this could piggyback on the VM code rather
nicely. Anytime you clear the VM you clearly must clear the FrozenMap as
well. The logic for setting the FM is clearly different, but that would
be entirely self-contained to vacuum. Unlike the VM, I don't see any
point to marking special bits in the page itself for FM.
It would be nice if each bit in the FM covered multiple pages, but that
can be optimized later.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Apr 4, 2015 at 3:10 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 4/3/15 12:59 AM, Sawada Masahiko wrote:
+ case HEAPTUPLE_LIVE: + case HEAPTUPLE_RECENTLY_DEAD: + case HEAPTUPLE_INSERT_IN_PROGRESS: + case HEAPTUPLE_DELETE_IN_PROGRESS: + if (heap_prepare_freeze_tuple(tuple.t_data, freezelimit, + mxactcutoff, &frozen[nfrozen])) + frozen[nfrozen++].offset = offnum; + break;This doesn't seem safe enough to me. Can't there be tuples that are still
new enough that they can't be frozen, and are still live?
Yep. I've set a table to read only while it contained unfreezable tuples,
and the tuples remain unfrozen yet the read-only action claims to have
succeeded.
Somewhat related... instead of forcing the freeze to happen synchronously,
can't we set this up so a table is in one of three states? Read/Write, Read
Only, Frozen. AT_SetReadOnly and AT_SetReadWrite would simply change to the
appropriate state, and all the vacuum infrastructure would continue to
process those tables as it does today. lazy_vacuum_rel would become
responsible for tracking if there were any non-frozen tuples if it was also
attempting a freeze. If it discovered there were none, AND the table was
marked as ReadOnly, then it would change the table state to Frozen and set
relfrozenxid = InvalidTransactionId and relminxid = InvalidMultiXactId.
AT_SetReadWrite could change relfrozenxid to it's own Xid as an
optimization. Doing it that way leaves all the complicated vacuum code in
one place, and would eliminate concerns about race conditions with still
running transactions, etc.
+1 here as well. I might want to set tables to read only for reasons other
than to avoid repeated freezing. (After all, the name of the command
suggests it is a general purpose thing) and wouldn't want to automatically
trigger a vacuum freeze in the process.
Cheers,
Jeff
On Sun, Apr 5, 2015 at 8:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Sat, Apr 4, 2015 at 3:10 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 4/3/15 12:59 AM, Sawada Masahiko wrote:
+ case HEAPTUPLE_LIVE: + case HEAPTUPLE_RECENTLY_DEAD: + case HEAPTUPLE_INSERT_IN_PROGRESS: + case HEAPTUPLE_DELETE_IN_PROGRESS: + if (heap_prepare_freeze_tuple(tuple.t_data, freezelimit, + mxactcutoff, &frozen[nfrozen])) + frozen[nfrozen++].offset = offnum; + break;This doesn't seem safe enough to me. Can't there be tuples that are still
new enough that they can't be frozen, and are still live?Yep. I've set a table to read only while it contained unfreezable tuples,
and the tuples remain unfrozen yet the read-only action claims to have
succeeded.Somewhat related... instead of forcing the freeze to happen synchronously,
can't we set this up so a table is in one of three states? Read/Write, Read
Only, Frozen. AT_SetReadOnly and AT_SetReadWrite would simply change to the
appropriate state, and all the vacuum infrastructure would continue to
process those tables as it does today. lazy_vacuum_rel would become
responsible for tracking if there were any non-frozen tuples if it was also
attempting a freeze. If it discovered there were none, AND the table was
marked as ReadOnly, then it would change the table state to Frozen and set
relfrozenxid = InvalidTransactionId and relminxid = InvalidMultiXactId.
AT_SetReadWrite could change relfrozenxid to it's own Xid as an
optimization. Doing it that way leaves all the complicated vacuum code in
one place, and would eliminate concerns about race conditions with still
running transactions, etc.+1 here as well. I might want to set tables to read only for reasons other
than to avoid repeated freezing. (After all, the name of the command
suggests it is a general purpose thing) and wouldn't want to automatically
trigger a vacuum freeze in the process.
Thank you for comments.
Somewhat related... instead of forcing the freeze to happen synchronously, can't we set this up so a table is in one of three states? Read/Write, Read Only, Frozen. AT_SetReadOnly and AT_SetReadWrite would simply change to > the appropriate state, and all the vacuum infrastructure would continue to process those tables as it does today. lazy_vacuum_rel would become responsible for tracking if there were any non-frozen tuples if it was also attempting > a freeze. If it discovered there were none, AND the table was marked as ReadOnly, then it would change the table state to Frozen and set relfrozenxid = InvalidTransactionId and relminxid = InvalidMultiXactId. AT_SetReadWrite > could change relfrozenxid to it's own Xid as an optimization. Doing it that way leaves all the complicated vacuum code in one place, and would eliminate concerns about race conditions with still running transactions, etc.
I agree with 3 status, Read/Write, ReadOnly and Frozen.
But I'm not sure when we should do to freeze tuples, e.g., scan whole tables.
I think that the any changes to table are completely
ignored/restricted if table is marked as ReadOnly table,
and it's accompanied by freezing tuples, just mark as ReadOnly.
Frozen table ensures that all tuples of its table completely has been
frozen, so it also needs to scan whole table as well.
e.g., we should need to scan whole table at two times. right?
+1 here as well. I might want to set tables to read only for reasons other than to avoid repeated freezing. (After all, the name of the command suggests it is a general purpose thing) and wouldn't want to automatically trigger a
vacuum freeze in the process.There is another possibility here, too. We can completely divorce a ReadOnly mode (which I think is useful for other things besides freezing) from the question of whether we need to force-freeze a relation if we create a
FrozenMap, similar to the visibility map. This has the added advantage of helping freeze scans on relations that are not ReadOnly in the case of tables that are insert-mostly or any other pattern where most pages stay all-frozen.
Prior to the visibility map this would have been a rather daunting project, but I believe this could piggyback on the VM code rather nicely. Anytime you clear the VM you clearly must clear the FrozenMap as well. The logic for
setting the FM is clearly different, but that would be entirely self-contained to vacuum. Unlike the VM, I don't see any point to marking special bits in the page itself for FM.
I was thinking this idea (FM) to avoid freezing all tuples actually.
As you said, it might not be good idea (or overkill) that the reason
why settings table to read only is avoidance repeated freezing.
I'm attempting to try design FM to avoid freezing relations as well.
Is it enough that each bit of FM has information that corresponding
pages are completely frozen on each bit?
Regards,
-------
Sawada Masahiko
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 4/6/15 1:46 AM, Sawada Masahiko wrote:
On Sun, Apr 5, 2015 at 8:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Sat, Apr 4, 2015 at 3:10 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 4/3/15 12:59 AM, Sawada Masahiko wrote:
+ case HEAPTUPLE_LIVE: + case HEAPTUPLE_RECENTLY_DEAD: + case HEAPTUPLE_INSERT_IN_PROGRESS: + case HEAPTUPLE_DELETE_IN_PROGRESS: + if (heap_prepare_freeze_tuple(tuple.t_data, freezelimit, + mxactcutoff, &frozen[nfrozen])) + frozen[nfrozen++].offset = offnum; + break;This doesn't seem safe enough to me. Can't there be tuples that are still
new enough that they can't be frozen, and are still live?Yep. I've set a table to read only while it contained unfreezable tuples,
and the tuples remain unfrozen yet the read-only action claims to have
succeeded.Somewhat related... instead of forcing the freeze to happen synchronously,
can't we set this up so a table is in one of three states? Read/Write, Read
Only, Frozen. AT_SetReadOnly and AT_SetReadWrite would simply change to the
appropriate state, and all the vacuum infrastructure would continue to
process those tables as it does today. lazy_vacuum_rel would become
responsible for tracking if there were any non-frozen tuples if it was also
attempting a freeze. If it discovered there were none, AND the table was
marked as ReadOnly, then it would change the table state to Frozen and set
relfrozenxid = InvalidTransactionId and relminxid = InvalidMultiXactId.
AT_SetReadWrite could change relfrozenxid to it's own Xid as an
optimization. Doing it that way leaves all the complicated vacuum code in
one place, and would eliminate concerns about race conditions with still
running transactions, etc.+1 here as well. I might want to set tables to read only for reasons other
than to avoid repeated freezing. (After all, the name of the command
suggests it is a general purpose thing) and wouldn't want to automatically
trigger a vacuum freeze in the process.Thank you for comments.
Somewhat related... instead of forcing the freeze to happen synchronously, can't we set this up so a table is in one of three states? Read/Write, Read Only, Frozen. AT_SetReadOnly and AT_SetReadWrite would simply change to > the appropriate state, and all the vacuum infrastructure would continue to process those tables as it does today. lazy_vacuum_rel would become responsible for tracking if there were any non-frozen tuples if it was also attempting > a freeze. If it discovered there were none, AND the table was marked as ReadOnly, then it would change the table state to Frozen and set relfrozenxid = InvalidTransactionId and relminxid = InvalidMultiXactId. AT_SetReadWrite > could change relfrozenxid to it's own Xid as an optimization. Doing it that way leaves all the complicated vacuum code in one place, and would eliminate concerns about race conditions with still running transactions, etc.
I agree with 3 status, Read/Write, ReadOnly and Frozen.
But I'm not sure when we should do to freeze tuples, e.g., scan whole tables.
I think that the any changes to table are completely
ignored/restricted if table is marked as ReadOnly table,
and it's accompanied by freezing tuples, just mark as ReadOnly.
Frozen table ensures that all tuples of its table completely has been
frozen, so it also needs to scan whole table as well.
e.g., we should need to scan whole table at two times. right?
No. You would be free to set a table as ReadOnly any time you wanted,
without scanning anything. All that setting does is disable any DML on
the table.
The Frozen state would only be set by the vacuum code, IFF:
- The table state is ReadOnly *at the start of vacuum* and did not
change during vacuum
- Vacuum ensured that there were no un-frozen tuples in the table
That does not necessitate 2 scans.
+1 here as well. I might want to set tables to read only for reasons other than to avoid repeated freezing. (After all, the name of the command suggests it is a general purpose thing) and wouldn't want to automatically trigger a
vacuum freeze in the process.There is another possibility here, too. We can completely divorce a ReadOnly mode (which I think is useful for other things besides freezing) from the question of whether we need to force-freeze a relation if we create a
FrozenMap, similar to the visibility map. This has the added advantage of helping freeze scans on relations that are not ReadOnly in the case of tables that are insert-mostly or any other pattern where most pages stay all-frozen.
Prior to the visibility map this would have been a rather daunting project, but I believe this could piggyback on the VM code rather nicely. Anytime you clear the VM you clearly must clear the FrozenMap as well. The logic for
setting the FM is clearly different, but that would be entirely self-contained to vacuum. Unlike the VM, I don't see any point to marking special bits in the page itself for FM.I was thinking this idea (FM) to avoid freezing all tuples actually.
As you said, it might not be good idea (or overkill) that the reason
why settings table to read only is avoidance repeated freezing.
I'm attempting to try design FM to avoid freezing relations as well.
Is it enough that each bit of FM has information that corresponding
pages are completely frozen on each bit?
If I'm understanding your implied question correctly, I don't think
there would actually be any relationship between FM and marking
ReadOnly. It would come into play if we wanted to do the Frozen state,
but if we have the FM, marking an entire relation as Frozen becomes a
lot less useful. What's going to happen with a VACUUM FREEZE once we
have FM is that vacuum will be able to skip reading pages if they are
all-visible *and* the FM shows them as frozen, whereas today we can't
use the VM to skip pages if scan_all is true.
For simplicity, I would start out with each FM bit representing a single
page. That means the FM would be very similar in operation to the VM;
the only difference would be when a bit in the FM was set. I would
absolutely split this into 2 patches as well; one for ReadOnly (and skip
the Frozen status for now), and one for FM.
When I looked at the VM code briefly it occurred to me that it might be
quite difficult to have 1 FM bit represent multiple pages. The issue is
the locking necessary between VACUUM and clearing a FM bit. In the VM
that's handled by the cleanup lock, but that will only work at a page
level. We'd need something to ensure that nothing came in and performed
DML while the vacuum code was getting ready to set a FM bit. There's
probably several ways this could be accomplished, but I think it would
be foolish to try and do anything about it in the initial patch.
Especially because it's only supposition that there would be much
benefit to having multiple pages per bit.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Apr 6, 2015 at 10:17 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 4/6/15 1:46 AM, Sawada Masahiko wrote:
On Sun, Apr 5, 2015 at 8:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Sat, Apr 4, 2015 at 3:10 PM, Jim Nasby <Jim.Nasby@bluetreble.com>
wrote:On 4/3/15 12:59 AM, Sawada Masahiko wrote:
+ case HEAPTUPLE_LIVE: + case HEAPTUPLE_RECENTLY_DEAD: + case HEAPTUPLE_INSERT_IN_PROGRESS: + case HEAPTUPLE_DELETE_IN_PROGRESS: + if (heap_prepare_freeze_tuple(tuple.t_data, freezelimit, + mxactcutoff, &frozen[nfrozen])) + frozen[nfrozen++].offset = offnum; + break;This doesn't seem safe enough to me. Can't there be tuples that are
still
new enough that they can't be frozen, and are still live?Yep. I've set a table to read only while it contained unfreezable
tuples,
and the tuples remain unfrozen yet the read-only action claims to have
succeeded.Somewhat related... instead of forcing the freeze to happen
synchronously,
can't we set this up so a table is in one of three states? Read/Write,
Read
Only, Frozen. AT_SetReadOnly and AT_SetReadWrite would simply change to
the
appropriate state, and all the vacuum infrastructure would continue to
process those tables as it does today. lazy_vacuum_rel would become
responsible for tracking if there were any non-frozen tuples if it was
also
attempting a freeze. If it discovered there were none, AND the table was
marked as ReadOnly, then it would change the table state to Frozen and
set
relfrozenxid = InvalidTransactionId and relminxid = InvalidMultiXactId.
AT_SetReadWrite could change relfrozenxid to it's own Xid as an
optimization. Doing it that way leaves all the complicated vacuum code
in
one place, and would eliminate concerns about race conditions with still
running transactions, etc.+1 here as well. I might want to set tables to read only for reasons
other
than to avoid repeated freezing. (After all, the name of the command
suggests it is a general purpose thing) and wouldn't want to
automatically
trigger a vacuum freeze in the process.Thank you for comments.
Somewhat related... instead of forcing the freeze to happen
synchronously, can't we set this up so a table is in one of three states?
Read/Write, Read Only, Frozen. AT_SetReadOnly and AT_SetReadWrite would
simply change to > the appropriate state, and all the vacuum infrastructure
would continue to process those tables as it does today. lazy_vacuum_rel
would become responsible for tracking if there were any non-frozen tuples if
it was also attempting > a freeze. If it discovered there were none, AND the
table was marked as ReadOnly, then it would change the table state to Frozen
and set relfrozenxid = InvalidTransactionId and relminxid =
InvalidMultiXactId. AT_SetReadWrite > could change relfrozenxid to it's own
Xid as an optimization. Doing it that way leaves all the complicated vacuum
code in one place, and would eliminate concerns about race conditions with
still running transactions, etc.I agree with 3 status, Read/Write, ReadOnly and Frozen.
But I'm not sure when we should do to freeze tuples, e.g., scan whole
tables.
I think that the any changes to table are completely
ignored/restricted if table is marked as ReadOnly table,
and it's accompanied by freezing tuples, just mark as ReadOnly.
Frozen table ensures that all tuples of its table completely has been
frozen, so it also needs to scan whole table as well.
e.g., we should need to scan whole table at two times. right?No. You would be free to set a table as ReadOnly any time you wanted,
without scanning anything. All that setting does is disable any DML on the
table.The Frozen state would only be set by the vacuum code, IFF:
- The table state is ReadOnly *at the start of vacuum* and did not change
during vacuum
- Vacuum ensured that there were no un-frozen tuples in the tableThat does not necessitate 2 scans.
I understood this comcept, and have question as I wrote below.
+1 here as well. I might want to set tables to read only for reasons
other than to avoid repeated freezing. (After all, the name of the command
suggests it is a general purpose thing) and wouldn't want to automatically
trigger a
vacuum freeze in the process.There is another possibility here, too. We can completely divorce a
ReadOnly mode (which I think is useful for other things besides freezing)
from the question of whether we need to force-freeze a relation if we create
a
FrozenMap, similar to the visibility map. This has the added advantage of
helping freeze scans on relations that are not ReadOnly in the case of
tables that are insert-mostly or any other pattern where most pages stay
all-frozen.
Prior to the visibility map this would have been a rather daunting
project, but I believe this could piggyback on the VM code rather nicely.
Anytime you clear the VM you clearly must clear the FrozenMap as well. The
logic for
setting the FM is clearly different, but that would be entirely
self-contained to vacuum. Unlike the VM, I don't see any point to marking
special bits in the page itself for FM.I was thinking this idea (FM) to avoid freezing all tuples actually.
As you said, it might not be good idea (or overkill) that the reason
why settings table to read only is avoidance repeated freezing.
I'm attempting to try design FM to avoid freezing relations as well.
Is it enough that each bit of FM has information that corresponding
pages are completely frozen on each bit?If I'm understanding your implied question correctly, I don't think there
would actually be any relationship between FM and marking ReadOnly. It would
come into play if we wanted to do the Frozen state, but if we have the FM,
marking an entire relation as Frozen becomes a lot less useful. What's going
to happen with a VACUUM FREEZE once we have FM is that vacuum will be able
to skip reading pages if they are all-visible *and* the FM shows them as
frozen, whereas today we can't use the VM to skip pages if scan_all is true.For simplicity, I would start out with each FM bit representing a single
page. That means the FM would be very similar in operation to the VM; the
only difference would be when a bit in the FM was set. I would absolutely
split this into 2 patches as well; one for ReadOnly (and skip the Frozen
status for now), and one for FM.
When I looked at the VM code briefly it occurred to me that it might be
quite difficult to have 1 FM bit represent multiple pages. The issue is the
locking necessary between VACUUM and clearing a FM bit. In the VM that's
handled by the cleanup lock, but that will only work at a page level. We'd
need something to ensure that nothing came in and performed DML while the
vacuum code was getting ready to set a FM bit. There's probably several ways
this could be accomplished, but I think it would be foolish to try and do
anything about it in the initial patch. Especially because it's only
supposition that there would be much benefit to having multiple pages per
bit.
Yes, I will separate the patch into two patches.
I'd like to confirm about whether what I'm thinking is correct here.
In first version of patch, each FM bit represent a single page is
imply whether the all tuple of the page completely has been frozen, it
would be one patch.
The second patch adds 3 states and read-only table which disable to
any write to table. The trigger which changes state from Read/Write to
Read-Only is ALTER TABLE SET READ ONLY. And the trigger changes from
Read-Only to Frozen is vacuum only when the table has been marked as
Read-Only at vacuum is started *and* the vacuum did not any freeze
tuple(including skip the page refer to FM). If we support FM, we would
be able to avoid repeated freezing whole table even if the table has
not been marked as Read-Only.
In order to change state to Frozen, we need to do VACUUM FREEZE or
wait for running of auto vacuum. Generally, the threshold of cutoff
xid is different between VACUUM (and autovacuum) and VACUUM FREEZE. We
would not expect to change status using by explicit vacuum and
autovacuum. Inevitably, we would need to do both command ALTER TABLE
SET READ ONLY and VACUUM FREEZE to change state to Frozen.
I think that we should also add DDL which does both freezing tuple and
changing state in one pass, like ALTER TABLE SET READ ONLY WITH FREEZE
or ALTER TABLE SET FROZEN.
Regards,
-------
Sawada Masahiko
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 4/6/15 11:12 AM, Sawada Masahiko wrote:
On Mon, Apr 6, 2015 at 10:17 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 4/6/15 1:46 AM, Sawada Masahiko wrote:
On Sun, Apr 5, 2015 at 8:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Sat, Apr 4, 2015 at 3:10 PM, Jim Nasby <Jim.Nasby@bluetreble.com>
wrote:On 4/3/15 12:59 AM, Sawada Masahiko wrote:
+ case HEAPTUPLE_LIVE: + case HEAPTUPLE_RECENTLY_DEAD: + case HEAPTUPLE_INSERT_IN_PROGRESS: + case HEAPTUPLE_DELETE_IN_PROGRESS: + if (heap_prepare_freeze_tuple(tuple.t_data, freezelimit, + mxactcutoff, &frozen[nfrozen])) + frozen[nfrozen++].offset = offnum; + break;This doesn't seem safe enough to me. Can't there be tuples that are
still
new enough that they can't be frozen, and are still live?Yep. I've set a table to read only while it contained unfreezable
tuples,
and the tuples remain unfrozen yet the read-only action claims to have
succeeded.Somewhat related... instead of forcing the freeze to happen
synchronously,
can't we set this up so a table is in one of three states? Read/Write,
Read
Only, Frozen. AT_SetReadOnly and AT_SetReadWrite would simply change to
the
appropriate state, and all the vacuum infrastructure would continue to
process those tables as it does today. lazy_vacuum_rel would become
responsible for tracking if there were any non-frozen tuples if it was
also
attempting a freeze. If it discovered there were none, AND the table was
marked as ReadOnly, then it would change the table state to Frozen and
set
relfrozenxid = InvalidTransactionId and relminxid = InvalidMultiXactId.
AT_SetReadWrite could change relfrozenxid to it's own Xid as an
optimization. Doing it that way leaves all the complicated vacuum code
in
one place, and would eliminate concerns about race conditions with still
running transactions, etc.+1 here as well. I might want to set tables to read only for reasons
other
than to avoid repeated freezing. (After all, the name of the command
suggests it is a general purpose thing) and wouldn't want to
automatically
trigger a vacuum freeze in the process.Thank you for comments.
Somewhat related... instead of forcing the freeze to happen
synchronously, can't we set this up so a table is in one of three states?
Read/Write, Read Only, Frozen. AT_SetReadOnly and AT_SetReadWrite would
simply change to > the appropriate state, and all the vacuum infrastructure
would continue to process those tables as it does today. lazy_vacuum_rel
would become responsible for tracking if there were any non-frozen tuples if
it was also attempting > a freeze. If it discovered there were none, AND the
table was marked as ReadOnly, then it would change the table state to Frozen
and set relfrozenxid = InvalidTransactionId and relminxid =
InvalidMultiXactId. AT_SetReadWrite > could change relfrozenxid to it's own
Xid as an optimization. Doing it that way leaves all the complicated vacuum
code in one place, and would eliminate concerns about race conditions with
still running transactions, etc.I agree with 3 status, Read/Write, ReadOnly and Frozen.
But I'm not sure when we should do to freeze tuples, e.g., scan whole
tables.
I think that the any changes to table are completely
ignored/restricted if table is marked as ReadOnly table,
and it's accompanied by freezing tuples, just mark as ReadOnly.
Frozen table ensures that all tuples of its table completely has been
frozen, so it also needs to scan whole table as well.
e.g., we should need to scan whole table at two times. right?No. You would be free to set a table as ReadOnly any time you wanted,
without scanning anything. All that setting does is disable any DML on the
table.The Frozen state would only be set by the vacuum code, IFF:
- The table state is ReadOnly *at the start of vacuum* and did not change
during vacuum
- Vacuum ensured that there were no un-frozen tuples in the tableThat does not necessitate 2 scans.
I understood this comcept, and have question as I wrote below.
+1 here as well. I might want to set tables to read only for reasons
other than to avoid repeated freezing. (After all, the name of the command
suggests it is a general purpose thing) and wouldn't want to automatically
trigger a
vacuum freeze in the process.There is another possibility here, too. We can completely divorce a
ReadOnly mode (which I think is useful for other things besides freezing)
from the question of whether we need to force-freeze a relation if we create
a
FrozenMap, similar to the visibility map. This has the added advantage of
helping freeze scans on relations that are not ReadOnly in the case of
tables that are insert-mostly or any other pattern where most pages stay
all-frozen.
Prior to the visibility map this would have been a rather daunting
project, but I believe this could piggyback on the VM code rather nicely.
Anytime you clear the VM you clearly must clear the FrozenMap as well. The
logic for
setting the FM is clearly different, but that would be entirely
self-contained to vacuum. Unlike the VM, I don't see any point to marking
special bits in the page itself for FM.I was thinking this idea (FM) to avoid freezing all tuples actually.
As you said, it might not be good idea (or overkill) that the reason
why settings table to read only is avoidance repeated freezing.
I'm attempting to try design FM to avoid freezing relations as well.
Is it enough that each bit of FM has information that corresponding
pages are completely frozen on each bit?If I'm understanding your implied question correctly, I don't think there
would actually be any relationship between FM and marking ReadOnly. It would
come into play if we wanted to do the Frozen state, but if we have the FM,
marking an entire relation as Frozen becomes a lot less useful. What's going
to happen with a VACUUM FREEZE once we have FM is that vacuum will be able
to skip reading pages if they are all-visible *and* the FM shows them as
frozen, whereas today we can't use the VM to skip pages if scan_all is true.For simplicity, I would start out with each FM bit representing a single
page. That means the FM would be very similar in operation to the VM; the
only difference would be when a bit in the FM was set. I would absolutely
split this into 2 patches as well; one for ReadOnly (and skip the Frozen
status for now), and one for FM.
When I looked at the VM code briefly it occurred to me that it might be
quite difficult to have 1 FM bit represent multiple pages. The issue is the
locking necessary between VACUUM and clearing a FM bit. In the VM that's
handled by the cleanup lock, but that will only work at a page level. We'd
need something to ensure that nothing came in and performed DML while the
vacuum code was getting ready to set a FM bit. There's probably several ways
this could be accomplished, but I think it would be foolish to try and do
anything about it in the initial patch. Especially because it's only
supposition that there would be much benefit to having multiple pages per
bit.Yes, I will separate the patch into two patches.
I'd like to confirm about whether what I'm thinking is correct here.
In first version of patch, each FM bit represent a single page is
imply whether the all tuple of the page completely has been frozen, it
would be one patch.
Yes.
The second patch adds 3 states and read-only table which disable to
Actually, I would start simply with ReadOnly and ReadWrite.
As I understand it, the goal here is to prevent huge amounts of periodic
freeze work due to XID wraparound. I don't think we need the Freeze
state to accomplish that.
With a single bit per page in the Frozen Map, checking a 800GB table
would require reading a mere 100MB of FM. That's pretty tiny, and
largely accomplishes the goal.
Obviously it would be nice to eliminate even that 100MB read, but I
suggest you leave that for a 3rd patch. I think you'll find that just
getting the first 2 accomplished will be a significant amount of work.
Also, note that you don't really even need the ReadOnly patch. As long
as you're not actually touching the table at all the FM will eventually
read as everything is frozen; that gets you 80% of the way there. So I'd
suggest starting with the FM, then doing ReadOnly, and only then
attempting to add the Frozen state.
any write to table. The trigger which changes state from Read/Write to
Read-Only is ALTER TABLE SET READ ONLY. And the trigger changes from
Read-Only to Frozen is vacuum only when the table has been marked as
Read-Only at vacuum is started *and* the vacuum did not any freeze
tuple(including skip the page refer to FM). If we support FM, we would
be able to avoid repeated freezing whole table even if the table has
not been marked as Read-Only.In order to change state to Frozen, we need to do VACUUM FREEZE or
wait for running of auto vacuum. Generally, the threshold of cutoff
xid is different between VACUUM (and autovacuum) and VACUUM FREEZE. We
would not expect to change status using by explicit vacuum and
autovacuum. Inevitably, we would need to do both command ALTER TABLE
SET READ ONLY and VACUUM FREEZE to change state to Frozen.
I think that we should also add DDL which does both freezing tuple and
changing state in one pass, like ALTER TABLE SET READ ONLY WITH FREEZE
or ALTER TABLE SET FROZEN.Regards,
-------
Sawada Masahiko
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Apr 06, 2015 at 12:07:47PM -0500, Jim Nasby wrote:
...
As I understand it, the goal here is to prevent huge amounts of
periodic freeze work due to XID wraparound. I don't think we need
the Freeze state to accomplish that.With a single bit per page in the Frozen Map, checking a 800GB table
would require reading a mere 100MB of FM. That's pretty tiny, and
largely accomplishes the goal.Obviously it would be nice to eliminate even that 100MB read, but I
suggest you leave that for a 3rd patch. I think you'll find that
just getting the first 2 accomplished will be a significant amount
of work.
Hi,
I may have my math wrong, but 800GB ~ 100M pages or 12.5MB and not
100MB.
Regards,
Ken
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 4/6/15 12:29 PM, ktm@rice.edu wrote:
On Mon, Apr 06, 2015 at 12:07:47PM -0500, Jim Nasby wrote:
...
As I understand it, the goal here is to prevent huge amounts of
periodic freeze work due to XID wraparound. I don't think we need
the Freeze state to accomplish that.With a single bit per page in the Frozen Map, checking a 800GB table
would require reading a mere 100MB of FM. That's pretty tiny, and
largely accomplishes the goal.Obviously it would be nice to eliminate even that 100MB read, but I
suggest you leave that for a 3rd patch. I think you'll find that
just getting the first 2 accomplished will be a significant amount
of work.Hi,
I may have my math wrong, but 800GB ~ 100M pages or 12.5MB and not
100MB.
Doh! 8 bits per byte and all that...
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 04/06/2015 10:07 AM, Jim Nasby wrote:
Actually, I would start simply with ReadOnly and ReadWrite.
As I understand it, the goal here is to prevent huge amounts of periodic
freeze work due to XID wraparound. I don't think we need the Freeze
state to accomplish that.With a single bit per page in the Frozen Map, checking a 800GB table
would require reading a mere 100MB of FM. That's pretty tiny, and
largely accomplishes the goal.Obviously it would be nice to eliminate even that 100MB read, but I
suggest you leave that for a 3rd patch. I think you'll find that just
getting the first 2 accomplished will be a significant amount of work.Also, note that you don't really even need the ReadOnly patch. As long
as you're not actually touching the table at all the FM will eventually
read as everything is frozen; that gets you 80% of the way there. So I'd
suggest starting with the FM, then doing ReadOnly, and only then
attempting to add the Frozen state.
+1
There was some reason why we didn't have Freeze Map before, though;
IIRC these were the problems:
1. would need to make sure it gets sync'd to disk and/or WAL-logged
2. every time a page is modified, the map would need to get updated
3. Yet Another Relation File (not inconsequential for the cases we're
discussing).
Also, given that the Visibility Map necessarily needs to have the
superset of the Frozen Map, maybe combining them in some way would make
sense.
I agree with Jim that if we have a trustworthy Frozen Map, having a
ReadOnly flag is of marginal value, unless such a ReadOnly flag allowed
us to skip updating the individual row XIDs entirely. I can think of
some ways to do that, but they have severe tradeoffs.
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: WM9cfb479454c98fc2f6c2fc96595f2270be2f1b232337ae16090513007a4ec49f1f81155e38aa058a6be880a5bf3abffc@asav-2.01.com
Josh Berkus wrote:
I agree with Jim that if we have a trustworthy Frozen Map, having a
ReadOnly flag is of marginal value, unless such a ReadOnly flag allowed
us to skip updating the individual row XIDs entirely. I can think of
some ways to do that, but they have severe tradeoffs.
If you're thinking that the READ ONLY flag is only useful for freezing,
then yeah maybe it's of marginal value. But in the foreign key
constraint area, consider that you could have tables with
frequently-referenced PKs marked as READ ONLY -- then you don't need to
acquire row locks when inserting/updating rows in the referencing
tables. That might give you a good performance benefit that's not in
any way related to freezing, as well as reducing your multixact
consumption rate.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 04/06/2015 11:35 AM, Alvaro Herrera wrote:
Josh Berkus wrote:
I agree with Jim that if we have a trustworthy Frozen Map, having a
ReadOnly flag is of marginal value, unless such a ReadOnly flag allowed
us to skip updating the individual row XIDs entirely. I can think of
some ways to do that, but they have severe tradeoffs.If you're thinking that the READ ONLY flag is only useful for freezing,
then yeah maybe it's of marginal value. But in the foreign key
constraint area, consider that you could have tables with
frequently-referenced PKs marked as READ ONLY -- then you don't need to
acquire row locks when inserting/updating rows in the referencing
tables. That might give you a good performance benefit that's not in
any way related to freezing, as well as reducing your multixact
consumption rate.
Hmmmm. Yeah, that would make it worthwhile, although it would be a
fairly obscure bit of performance optimization for anyone not on this
list ;-)
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: WM1c92bad30268e008b65280a077ebc2d6fb5d7fa7c5cc6eca66f05a12dddd05a283de603e859962174c274cf7cf6b3bb9@asav-1.01.com
On 4/6/15 1:28 PM, Josh Berkus wrote:
On 04/06/2015 10:07 AM, Jim Nasby wrote:
Actually, I would start simply with ReadOnly and ReadWrite.
As I understand it, the goal here is to prevent huge amounts of periodic
freeze work due to XID wraparound. I don't think we need the Freeze
state to accomplish that.With a single bit per page in the Frozen Map, checking a 800GB table
would require reading a mere 100MB of FM. That's pretty tiny, and
largely accomplishes the goal.Obviously it would be nice to eliminate even that 100MB read, but I
suggest you leave that for a 3rd patch. I think you'll find that just
getting the first 2 accomplished will be a significant amount of work.Also, note that you don't really even need the ReadOnly patch. As long
as you're not actually touching the table at all the FM will eventually
read as everything is frozen; that gets you 80% of the way there. So I'd
suggest starting with the FM, then doing ReadOnly, and only then
attempting to add the Frozen state.+1
There was some reason why we didn't have Freeze Map before, though;
IIRC these were the problems:1. would need to make sure it gets sync'd to disk and/or WAL-logged
Same as VM.
2. every time a page is modified, the map would need to get updated
Not everytime, just the first time if FM for a page was set. It would
only be set by vacuum, just like VM.
3. Yet Another Relation File (not inconsequential for the cases we're
discussing).
Sure, which is why I think it might be interesting to either allow for
more than one page per bit, or perhaps some form of compression. That
said, I don't think it's worth worrying about too much because it's
still a 64,000-1 ratio with 8k pages. If you use 32k pages it becomes
256,000-1, or 4GB of FM for 1PB of heap.
Also, given that the Visibility Map necessarily needs to have the
superset of the Frozen Map, maybe combining them in some way would make
sense.
The thing is, I think in many workloads the paterns here will actually
be radically different, in that it's way easier to get a page to be
all-visible than it is to freeze it.
Perhaps there's something we can do here when we look at other ways to
reduce space usage for FM (and maybe VM too), but I don't think now is
the time to put effort into this.
I agree with Jim that if we have a trustworthy Frozen Map, having a
ReadOnly flag is of marginal value, unless such a ReadOnly flag allowed
us to skip updating the individual row XIDs entirely. I can think of
some ways to do that, but they have severe tradeoffs.
Aside from Alvaro's points, I think many users would find it useful as
an easy way to ensure no one is writing to a table, which could be
valuable for any number of reasons. As long as the patch isn't too
complicated I don't see a reason not to do it.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 6 Apr 2015 09:17, "Jim Nasby" <Jim.Nasby@bluetreble.com> wrote:
No. You would be free to set a table as ReadOnly any time you wanted,
without scanning anything. All that setting does is disable any DML on the
table.
The Frozen state would only be set by the vacuum code, IFF:
- The table state is ReadOnly *at the start of vacuum* and did not change
during vacuum
- Vacuum ensured that there were no un-frozen tuples in the table
That does not necessitate 2 scans.
This is exactly what I would suggest.
Only I would suggest thinking of it in terms of two orthogonal boolean
flags rather than three states. It's easier to reason about whether a table
has a specific property than trying to control a state machine in a
predefined pathway.
So I would say the two flags are:
READONLY: guarantees nothing can be dirtied
ALLFROZEN: guarantees no unfrozen tuples are present
In practice you can't have the later without the former since vacuum can't
know everything is frozen unless it knows nobody is inserting. But perhaps
there will be cases in the future where that's not true.
Incidentally there are number of other optimisations tat over had in mind
that are only possible on frozen read-only tables:
1) Compression: compress the pages and pack them one after the other. Build
a new fork with offsets for each page.
2) Automatic partition elimination where the statistics track the minimum
and maximum value per partition (and number of tuples) and treat then as
implicit constraints. In particular it would magically make read only empty
parent partitions be excluded regardless of the where clause.
On 4/6/15 5:18 PM, Greg Stark wrote:
Only I would suggest thinking of it in terms of two orthogonal boolean
flags rather than three states. It's easier to reason about whether a
table has a specific property than trying to control a state machine in
a predefined pathway.So I would say the two flags are:
READONLY: guarantees nothing can be dirtied
ALLFROZEN: guarantees no unfrozen tuples are presentIn practice you can't have the later without the former since vacuum
can't know everything is frozen unless it knows nobody is inserting. But
perhaps there will be cases in the future where that's not true.
I'm not so sure about that. There's a logical state progression here
(see below). ISTM it's easier to just enforce that in one place instead
of a bunch of places having to check multiple conditions. But, I'm not
wed to a single field.
Incidentally there are number of other optimisations tat over had in
mind that are only possible on frozen read-only tables:1) Compression: compress the pages and pack them one after the other.
Build a new fork with offsets for each page.2) Automatic partition elimination where the statistics track the
minimum and maximum value per partition (and number of tuples) and treat
then as implicit constraints. In particular it would magically make read
only empty parent partitions be excluded regardless of the where clause.
AFAICT neither of those actually requires ALLFROZEN, no? You'll need to
uncompact and re-compact for #1 when you actually freeze (which maybe
isn't worth it), but freezing isn't absolutely required. #2 would only
require that everything in the relation is visible; not frozen.
I think there's value here to having an ALLVISIBLE state as well as
ALLFROZEN.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Apr 7, 2015 at 7:53 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 4/6/15 5:18 PM, Greg Stark wrote:
Only I would suggest thinking of it in terms of two orthogonal boolean
flags rather than three states. It's easier to reason about whether a
table has a specific property than trying to control a state machine in
a predefined pathway.So I would say the two flags are:
READONLY: guarantees nothing can be dirtied
ALLFROZEN: guarantees no unfrozen tuples are presentIn practice you can't have the later without the former since vacuum
can't know everything is frozen unless it knows nobody is inserting. But
perhaps there will be cases in the future where that's not true.I'm not so sure about that. There's a logical state progression here (see
below). ISTM it's easier to just enforce that in one place instead of a
bunch of places having to check multiple conditions. But, I'm not wed to a
single field.Incidentally there are number of other optimisations tat over had in
mind that are only possible on frozen read-only tables:1) Compression: compress the pages and pack them one after the other.
Build a new fork with offsets for each page.2) Automatic partition elimination where the statistics track the
minimum and maximum value per partition (and number of tuples) and treat
then as implicit constraints. In particular it would magically make read
only empty parent partitions be excluded regardless of the where clause.AFAICT neither of those actually requires ALLFROZEN, no? You'll need to
uncompact and re-compact for #1 when you actually freeze (which maybe isn't
worth it), but freezing isn't absolutely required. #2 would only require
that everything in the relation is visible; not frozen.I think there's value here to having an ALLVISIBLE state as well as
ALLFROZEN.
Based on may suggestions, I'm going to deal with FM at first as one
patch. It would be simply mechanism and similar to VM, at first patch.
- Each bit of FM represent single page
- The bit is set only by vacuum
- The bit is un-set by inserting and updating and deleting
At second, I'll deal with simply read-only table and 2 states,
Read/Write(default) and ReadOnly as one patch. ITSM the having the
Frozen state needs to more discussion. read-only table just allow us
to disable any updating table, and it's controlled by read-only flag
pg_class has. And DDL command which changes these status is like ALTER
TABLE SET READ ONLY, or READ WRITE.
Also as Alvaro's suggested, the read-only table affect not only
freezing table but also performance optimization. I'll consider
including them when I deal with read-only table.
Regards,
-------
Sawada Masahiko
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Apr 7, 2015 at 11:22 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
On Tue, Apr 7, 2015 at 7:53 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 4/6/15 5:18 PM, Greg Stark wrote:
Only I would suggest thinking of it in terms of two orthogonal boolean
flags rather than three states. It's easier to reason about whether a
table has a specific property than trying to control a state machine in
a predefined pathway.So I would say the two flags are:
READONLY: guarantees nothing can be dirtied
ALLFROZEN: guarantees no unfrozen tuples are presentIn practice you can't have the later without the former since vacuum
can't know everything is frozen unless it knows nobody is inserting. But
perhaps there will be cases in the future where that's not true.I'm not so sure about that. There's a logical state progression here (see
below). ISTM it's easier to just enforce that in one place instead of a
bunch of places having to check multiple conditions. But, I'm not wed to a
single field.Incidentally there are number of other optimisations tat over had in
mind that are only possible on frozen read-only tables:1) Compression: compress the pages and pack them one after the other.
Build a new fork with offsets for each page.2) Automatic partition elimination where the statistics track the
minimum and maximum value per partition (and number of tuples) and treat
then as implicit constraints. In particular it would magically make read
only empty parent partitions be excluded regardless of the where clause.AFAICT neither of those actually requires ALLFROZEN, no? You'll need to
uncompact and re-compact for #1 when you actually freeze (which maybe isn't
worth it), but freezing isn't absolutely required. #2 would only require
that everything in the relation is visible; not frozen.I think there's value here to having an ALLVISIBLE state as well as
ALLFROZEN.Based on may suggestions, I'm going to deal with FM at first as one
patch. It would be simply mechanism and similar to VM, at first patch.
- Each bit of FM represent single page
- The bit is set only by vacuum
- The bit is un-set by inserting and updating and deletingAt second, I'll deal with simply read-only table and 2 states,
Read/Write(default) and ReadOnly as one patch. ITSM the having the
Frozen state needs to more discussion. read-only table just allow us
to disable any updating table, and it's controlled by read-only flag
pg_class has. And DDL command which changes these status is like ALTER
TABLE SET READ ONLY, or READ WRITE.
Also as Alvaro's suggested, the read-only table affect not only
freezing table but also performance optimization. I'll consider
including them when I deal with read-only table.
Attached WIP patch adds Frozen Map which enables us to avoid whole
table vacuuming even when full scan is required: preventing XID
wraparound failures.
Frozen Map is a bitmap with one bit per heap page, and quite similar
to Visibility Map. A set bit means that all tuples on heap page are
completely frozen, therefore we don't need to do vacuum freeze that
page.
A bit is set when vacuum(or autovacuum) figures out that all tuples on
corresponding heap page are completely frozen, and a bit is cleared
when INSERT and UPDATE(only new heap page) are executed.
Current patch adds new source file src/backend/access/heap/frozenmap.c
which is quite similar to visibilitymap.c. They have similar code but
are separated for now. I do refactoring these source code like adding
bitmap.c, if needed.
Also, when skipping vacuum by visibility map, we can skip at least
SKIP_PAGE_THESHOLD consecutive page, but such mechanism is not in
frozen map.
Please give me feedbacks.
Regards,
-------
Sawada Masahiko
Attachments:
000_frozenmap_WIP.patchtext/x-diff; charset=US-ASCII; name=000_frozenmap_WIP.patchDownload
diff --git a/src/backend/access/heap/Makefile b/src/backend/access/heap/Makefile
index b83d496..53f07fd 100644
--- a/src/backend/access/heap/Makefile
+++ b/src/backend/access/heap/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/access/heap
top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
-OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o visibilitymap.o
+OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o visibilitymap.o frozenmap.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/frozenmap.c b/src/backend/access/heap/frozenmap.c
new file mode 100644
index 0000000..6e64cb8
--- /dev/null
+++ b/src/backend/access/heap/frozenmap.c
@@ -0,0 +1,567 @@
+/*-------------------------------------------------------------------------
+ *
+ * frozenmap.c
+ * bitmap for tracking frozen heap tuples
+ *
+ * Portions Copyright (c) 2015, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/heap/frozenmap.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/frozenmap.h"
+#include "access/heapam_xlog.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "storage/lmgr.h"
+#include "storage/smgr.h"
+#include "utils/inval.h"
+
+
+//#define TRACE_FROZENMAP
+
+/*
+ * Size of the bitmap on each frozen map page, in bytes. There's no
+ * extra headers, so the whole page minus the standard page header is
+ * used for the bitmap.
+ */
+#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
+
+/* Number of bits allocated for each heap block. */
+#define BITS_PER_HEAPBLOCK 1
+
+/* Number of heap blocks we can represent in one byte. */
+#define HEAPBLOCKS_PER_BYTE 8
+
+/* Number of heap blocks we can represent in one frozen map page. */
+#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
+
+/* Mapping from heap block number to the right bit in the frozen map */
+#define HEAPBLK_TO_MAPBLOCK(x) ((x) / HEAPBLOCKS_PER_PAGE)
+#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
+#define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
+
+/* table for fast counting of set bits */
+static const uint8 number_of_ones[256] = {
+ 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
+ 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
+ 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
+ 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+ 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
+ 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+ 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+ 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
+ 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
+ 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+ 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+ 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
+ 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+ 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
+ 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
+ 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
+};
+
+/* prototypes for internal routines */
+static Buffer fm_readbuf(Relation rel, BlockNumber blkno, bool extend);
+static void fm_extend(Relation rel, BlockNumber nfmblocks);
+
+
+/*
+ * frozenmap_clear - clear a bit in frozen map
+ *
+ * This function is same logic as visibilitymap_clear.
+ * You must pass a buffer containing the correct map page to this function.
+ * Call frozenmap_pin first to pin the right one. This function doesn't do
+ * any I/O.
+ */
+void
+frozenmap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
+{
+ BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+ int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
+ int mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
+ uint8 mask = 1 << mapBit;
+ char *map;
+
+#ifdef TRACE_FROZENMAP
+ elog(DEBUG1, "fm_clear %s %d", RelationGetRelationName(rel), heapBlk);
+#endif
+
+ if (!BufferIsValid(buf) || BufferGetBlockNumber(buf) != mapBlock)
+ elog(ERROR, "wrong buffer passed to frozenmap_clear");
+
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ map = PageGetContents(BufferGetPage(buf));
+
+ if (map[mapByte] & mask)
+ {
+ map[mapByte] &= ~mask;
+
+ MarkBufferDirty(buf);
+ }
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+}
+
+/*
+ * frozenmap_pin - pin a map page for setting a bit
+ *
+ * This function is same logic as visibilitymap_pin.
+ * Setting a bit in the frozen map is a two-phase operation. First, call
+ * frozenmap_pin, to pin the frozen map page containing the bit for
+ * the heap page. Because that can require I/O to read the map page, you
+ * shouldn't hold a lock on the heap page while doing that. Then, call
+ * frozenmap_set to actually set the bit.
+ *
+ * On entry, *buf should be InvalidBuffer or a valid buffer returned by
+ * an earlier call to frozenmap_pin or frozenmap_test on the same
+ * relation. On return, *buf is a valid buffer with the map page containing
+ * the bit for heapBlk.
+ *
+ * If the page doesn't exist in the map file yet, it is extended.
+ */
+void
+frozenmap_pin(Relation rel, BlockNumber heapBlk, Buffer *buf)
+{
+ BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+
+ /* Reuse the old pinned buffer if possible */
+ if (BufferIsValid(*buf))
+ {
+ if (BufferGetBlockNumber(*buf) == mapBlock)
+ return;
+
+ ReleaseBuffer(*buf);
+ }
+ *buf = fm_readbuf(rel, mapBlock, true);
+}
+
+/*
+ * frozenmap_pin_ok - do we already have the correct page pinned?
+ *
+ * On entry, buf should be InvalidBuffer or a valid buffer returned by
+ * an earlier call to frozenmap_pin or frozenmap_test on the same
+ * relation. The return value indicates whether the buffer covers the
+ * given heapBlk.
+ */
+bool
+frozenmap_pin_ok(BlockNumber heapBlk, Buffer buf)
+{
+ BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+
+ return BufferIsValid(buf) && BufferGetBlockNumber(buf) == mapBlock;
+}
+
+/*
+ * frozenmap_set - set a bit on a previously pinned page
+ *
+ * recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
+ * or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
+ * one provided; in normal running, we generate a new XLOG record and set the
+ * page LSN to that value. cutoff_xid is the largest xmin on the page being
+ * marked all-frozen; it is needed for Hot Standby, and can be
+ * InvalidTransactionId if the page contains no tuples.
+ *
+ * Caller is expected to set the heap page's PD_ALL_FROZEN bit before calling
+ * this function. Except in recovery, caller should also pass the heap
+ * buffer. When checksums are enabled and we're not in recovery, we must add
+ * the heap buffer to the WAL chain to protect it from being torn.
+ *
+ * You must pass a buffer containing the correct map page to this function.
+ * Call frozenmap_pin first to pin the right one. This function doesn't do
+ * any I/O.
+ */
+void
+frozenmap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
+ XLogRecPtr recptr, Buffer fmBuf)
+{
+ BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+ uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
+ uint8 mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
+ Page page;
+ char *map;
+
+#ifdef TRACE_FROZENMAP
+ elog(DEBUG1, "fm_set %s %d", RelationGetRelationName(rel), heapBlk);
+#endif
+
+ Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
+ Assert(InRecovery || BufferIsValid(heapBuf));
+
+ /* Check that we have the right heap page pinned, if present */
+ if (BufferIsValid(heapBuf) && BufferGetBlockNumber(heapBuf) != heapBlk)
+ elog(ERROR, "wrong heap buffer passed to frozenmap_set");
+
+ /* Check that we have the right VM page pinned */
+ if (!BufferIsValid(fmBuf) || BufferGetBlockNumber(fmBuf) != mapBlock)
+ elog(ERROR, "wrong FM buffer passed to frozenmap_set");
+
+ page = BufferGetPage(fmBuf);
+ map = PageGetContents(page);
+ LockBuffer(fmBuf, BUFFER_LOCK_EXCLUSIVE);
+
+ if (!(map[mapByte] & (1 << mapBit)))
+ {
+ START_CRIT_SECTION();
+
+ map[mapByte] |= (1 << mapBit);
+ MarkBufferDirty(fmBuf);
+
+ if (RelationNeedsWAL(rel))
+ {
+ if (XLogRecPtrIsInvalid(recptr))
+ {
+ Assert(!InRecovery);
+ recptr = log_heap_frozenmap(rel->rd_node, heapBuf, fmBuf);
+
+ /*
+ * If data checksums are enabled (or wal_log_hints=on), we
+ * need to protect the heap page from being torn.
+ */
+ if (XLogHintBitIsNeeded())
+ {
+ Page heapPage = BufferGetPage(heapBuf);
+
+ /* caller is expected to set PD_ALL_FROZEN first */
+ Assert(PageIsAllFrozen(heapPage));
+ PageSetLSN(heapPage, recptr);
+ }
+ }
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+ }
+
+ LockBuffer(fmBuf, BUFFER_LOCK_UNLOCK);
+}
+
+/*
+ * frozenmap_test - test if a bit is set
+ *
+ * Are all tuples on heapBlk frozen to all, according to the frozen map?
+ *
+ * On entry, *buf should be InvalidBuffer or a valid buffer returned by an
+ * earlier call to frozenmap_pin or frozenmap_test on the same
+ * relation. On return, *buf is a valid buffer with the map page containing
+ * the bit for heapBlk, or InvalidBuffer. The caller is responsible for
+ * releasing *buf after it's done testing and setting bits.
+ *
+ * NOTE: This function is typically called without a lock on the heap page,
+ * so somebody else could change the bit just after we look at it. In fact,
+ * since we don't lock the frozen map page either, it's even possible that
+ * someone else could have changed the bit just before we look at it, but yet
+ * we might see the old value. It is the caller's responsibility to deal with
+ * all concurrency issues!
+ */
+bool
+frozenmap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
+{
+ BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+ uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
+ uint8 mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
+ bool result;
+ char *map;
+
+#ifdef TRACE_FROZENMAP
+ elog(DEBUG1, "fm_test %s %d", RelationGetRelationName(rel), heapBlk);
+#endif
+
+ /* Reuse the old pinned buffer if possible */
+ if (BufferIsValid(*buf))
+ {
+ if (BufferGetBlockNumber(*buf) != mapBlock)
+ {
+ ReleaseBuffer(*buf);
+ *buf = InvalidBuffer;
+ }
+ }
+
+ if (!BufferIsValid(*buf))
+ {
+ *buf = fm_readbuf(rel, mapBlock, false);
+ if (!BufferIsValid(*buf))
+ return false;
+ }
+
+ map = PageGetContents(BufferGetPage(*buf));
+
+ /*
+ * A single-bit read is atomic. There could be memory-ordering effects
+ * here, but for performance reasons we make it the caller's job to worry
+ * about that.
+ */
+ result = (map[mapByte] & (1 << mapBit)) ? true : false;
+
+ return result;
+}
+
+/*
+ * frozenmap_count - count number of bits set in frozen map
+ *
+ * Note: we ignore the possibility of race conditions when the table is being
+ * extended concurrently with the call. New pages added to the table aren't
+ * going to be marked all-frozen, so they won't affect the result.
+ */
+BlockNumber
+frozenmap_count(Relation rel)
+{
+ BlockNumber result = 0;
+ BlockNumber mapBlock;
+
+ for (mapBlock = 0;; mapBlock++)
+ {
+ Buffer mapBuffer;
+ unsigned char *map;
+ int i;
+
+ /*
+ * Read till we fall off the end of the map. We assume that any extra
+ * bytes in the last page are zeroed, so we don't bother excluding
+ * them from the count.
+ */
+ mapBuffer = fm_readbuf(rel, mapBlock, false);
+ if (!BufferIsValid(mapBuffer))
+ break;
+
+ /*
+ * We choose not to lock the page, since the result is going to be
+ * immediately stale anyway if anyone is concurrently setting or
+ * clearing bits, and we only really need an approximate value.
+ */
+ map = (unsigned char *) PageGetContents(BufferGetPage(mapBuffer));
+
+ for (i = 0; i < MAPSIZE; i++)
+ {
+ result += number_of_ones[map[i]];
+ }
+
+ ReleaseBuffer(mapBuffer);
+ }
+
+ return result;
+}
+
+/*
+ * frozenmap_truncate - truncate the frozen map
+ *
+ * The caller must hold AccessExclusiveLock on the relation, to ensure that
+ * other backends receive the smgr invalidation event that this function sends
+ * before they access the VM again.
+ *
+ * nheapblocks is the new size of the heap.
+ */
+void
+frozenmap_truncate(Relation rel, BlockNumber nheapblocks)
+{
+ BlockNumber newnblocks;
+
+ /* last remaining block, byte, and bit */
+ BlockNumber truncBlock = HEAPBLK_TO_MAPBLOCK(nheapblocks);
+ uint32 truncByte = HEAPBLK_TO_MAPBYTE(nheapblocks);
+ uint8 truncBit = HEAPBLK_TO_MAPBIT(nheapblocks);
+
+#ifdef TRACE_FROZENMAP
+ elog(DEBUG1, "fm_truncate %s %d", RelationGetRelationName(rel), nheapblocks);
+#endif
+
+ RelationOpenSmgr(rel);
+
+ /*
+ * If no frozen map has been created yet for this relation, there's
+ * nothing to truncate.
+ */
+ if (!smgrexists(rel->rd_smgr, FROZENMAP_FORKNUM))
+ return;
+
+ /*
+ * Unless the new size is exactly at a frozen map page boundary, the
+ * tail bits in the last remaining map page, representing truncated heap
+ * blocks, need to be cleared. This is not only tidy, but also necessary
+ * because we don't get a chance to clear the bits if the heap is extended
+ * again.
+ */
+ if (truncByte != 0 || truncBit != 0)
+ {
+ Buffer mapBuffer;
+ Page page;
+ char *map;
+
+ newnblocks = truncBlock + 1;
+
+ mapBuffer = fm_readbuf(rel, truncBlock, false);
+ if (!BufferIsValid(mapBuffer))
+ {
+ /* nothing to do, the file was already smaller */
+ return;
+ }
+
+ page = BufferGetPage(mapBuffer);
+ map = PageGetContents(page);
+
+ LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+ /* Clear out the unwanted bytes. */
+ MemSet(&map[truncByte + 1], 0, MAPSIZE - (truncByte + 1));
+
+ /*----
+ * Mask out the unwanted bits of the last remaining byte.
+ *
+ * ((1 << 0) - 1) = 00000000
+ * ((1 << 1) - 1) = 00000001
+ * ...
+ * ((1 << 6) - 1) = 00111111
+ * ((1 << 7) - 1) = 01111111
+ *----
+ */
+ map[truncByte] &= (1 << truncBit) - 1;
+
+ MarkBufferDirty(mapBuffer);
+ UnlockReleaseBuffer(mapBuffer);
+ }
+ else
+ newnblocks = truncBlock;
+
+ if (smgrnblocks(rel->rd_smgr, FROZENMAP_FORKNUM) <= newnblocks)
+ {
+ /* nothing to do, the file was already smaller than requested size */
+ return;
+ }
+
+ /* Truncate the unused VM pages, and send smgr inval message */
+ smgrtruncate(rel->rd_smgr, FROZENMAP_FORKNUM, newnblocks);
+
+ /*
+ * We might as well update the local smgr_vm_nblocks setting. smgrtruncate
+ * sent an smgr cache inval message, which will cause other backends to
+ * invalidate their copy of smgr_vm_nblocks, and this one too at the next
+ * command boundary. But this ensures it isn't outright wrong until then.
+ */
+ if (rel->rd_smgr)
+ rel->rd_smgr->smgr_fm_nblocks = newnblocks;
+}
+
+/*
+ * Read a frozen map page.
+ *
+ * If the page doesn't exist, InvalidBuffer is returned, or if 'extend' is
+ * true, the frozen map file is extended.
+ */
+static Buffer
+fm_readbuf(Relation rel, BlockNumber blkno, bool extend)
+{
+ Buffer buf;
+
+ /*
+ * We might not have opened the relation at the smgr level yet, or we
+ * might have been forced to close it by a sinval message. The code below
+ * won't necessarily notice relation extension immediately when extend =
+ * false, so we rely on sinval messages to ensure that our ideas about the
+ * size of the map aren't too far out of date.
+ */
+ RelationOpenSmgr(rel);
+
+ /*
+ * If we haven't cached the size of the frozen map fork yet, check it
+ * first.
+ */
+ if (rel->rd_smgr->smgr_fm_nblocks == InvalidBlockNumber)
+ {
+ if (smgrexists(rel->rd_smgr, FROZENMAP_FORKNUM))
+ rel->rd_smgr->smgr_fm_nblocks = smgrnblocks(rel->rd_smgr,
+ FROZENMAP_FORKNUM);
+ else
+ rel->rd_smgr->smgr_fm_nblocks = 0;
+ }
+
+ /* Handle requests beyond EOF */
+ if (blkno >= rel->rd_smgr->smgr_fm_nblocks)
+ {
+ if (extend)
+ fm_extend(rel, blkno + 1);
+ else
+ return InvalidBuffer;
+ }
+
+ /*
+ * Use ZERO_ON_ERROR mode, and initialize the page if necessary. It's
+ * always safe to clear bits, so it's better to clear corrupt pages than
+ * error out.
+ */
+ buf = ReadBufferExtended(rel, FROZENMAP_FORKNUM, blkno,
+ RBM_ZERO_ON_ERROR, NULL);
+ if (PageIsNew(BufferGetPage(buf)))
+ PageInit(BufferGetPage(buf), BLCKSZ, 0);
+ return buf;
+}
+
+/*
+ * Ensure that the frozen map fork is at least vm_nblocks long, extending
+ * it if necessary with zeroed pages.
+ */
+static void
+fm_extend(Relation rel, BlockNumber fm_nblocks)
+{
+ BlockNumber fm_nblocks_now;
+ Page pg;
+
+ pg = (Page) palloc(BLCKSZ);
+ PageInit(pg, BLCKSZ, 0);
+
+ /*
+ * We use the relation extension lock to lock out other backends trying to
+ * extend the frozen map at the same time. It also locks out extension
+ * of the main fork, unnecessarily, but extending the frozen map
+ * happens seldom enough that it doesn't seem worthwhile to have a
+ * separate lock tag type for it.
+ *
+ * Note that another backend might have extended or created the relation
+ * by the time we get the lock.
+ */
+ LockRelationForExtension(rel, ExclusiveLock);
+
+ /* Might have to re-open if a cache flush happened */
+ RelationOpenSmgr(rel);
+
+ /*
+ * Create the file first if it doesn't exist. If smgr_vm_nblocks is
+ * positive then it must exist, no need for an smgrexists call.
+ */
+ if ((rel->rd_smgr->smgr_fm_nblocks == 0 ||
+ rel->rd_smgr->smgr_fm_nblocks == InvalidBlockNumber) &&
+ !smgrexists(rel->rd_smgr, FROZENMAP_FORKNUM))
+ smgrcreate(rel->rd_smgr, FROZENMAP_FORKNUM, false);
+
+ fm_nblocks_now = smgrnblocks(rel->rd_smgr, FROZENMAP_FORKNUM);
+
+ /* Now extend the file */
+ while (fm_nblocks_now < fm_nblocks)
+ {
+ PageSetChecksumInplace(pg, fm_nblocks_now);
+
+ smgrextend(rel->rd_smgr, FROZENMAP_FORKNUM, fm_nblocks_now,
+ (char *) pg, false);
+ fm_nblocks_now++;
+ }
+
+ /*
+ * Send a shared-inval message to force other backends to close any smgr
+ * references they may have for this rel, which we are about to change.
+ * This is a useful optimization because it means that backends don't have
+ * to keep checking for creation or extension of the file, which happens
+ * infrequently.
+ */
+ CacheInvalidateSmgr(rel->rd_smgr->smgr_rnode);
+
+ /* Update local cache with the up-to-date size */
+ rel->rd_smgr->smgr_fm_nblocks = fm_nblocks_now;
+
+ UnlockRelationForExtension(rel, ExclusiveLock);
+
+ pfree(pg);
+}
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index cb6f8a3..7f7c147 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -38,6 +38,7 @@
*/
#include "postgres.h"
+#include "access/frozenmap.h"
#include "access/heapam.h"
#include "access/heapam_xlog.h"
#include "access/hio.h"
@@ -86,7 +87,8 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
HeapTuple newtup, HeapTuple old_key_tup,
- bool all_visible_cleared, bool new_all_visible_cleared);
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool all_frozen_cleared, bool new_all_frozen_cleared);
static void HeapSatisfiesHOTandKeyUpdate(Relation relation,
Bitmapset *hot_attrs,
Bitmapset *key_attrs, Bitmapset *id_attrs,
@@ -2067,8 +2069,10 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
TransactionId xid = GetCurrentTransactionId();
HeapTuple heaptup;
Buffer buffer;
- Buffer vmbuffer = InvalidBuffer;
+ Buffer vmbuffer = InvalidBuffer,
+ fmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
+ bool all_frozen_cleared;
/*
* Fill in tuple header fields, assign an OID, and toast the tuple if
@@ -2092,12 +2096,14 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
/*
- * Find buffer to insert this tuple into. If the page is all visible,
- * this will also pin the requisite visibility map page.
+ * Find buffer to insert this tuple into. If the page is all visible
+ * of all frozen, this will also pin the requisite visibility map and
+ * frozen map page.
*/
buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
InvalidBuffer, options, bistate,
- &vmbuffer, NULL);
+ &vmbuffer, NULL,
+ &fmbuffer, NULL);
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
@@ -2113,6 +2119,15 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
vmbuffer);
}
+ if (PageIsAllFrozen(BufferGetPage(buffer)))
+ {
+ all_frozen_cleared = true;
+ PageClearAllFrozen(BufferGetPage(buffer));
+ frozenmap_clear(relation,
+ ItemPointerGetBlockNumber(&(heaptup->t_self)),
+ fmbuffer);
+ }
+
/*
* XXX Should we set PageSetPrunable on this page ?
*
@@ -2157,6 +2172,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
xlrec.offnum = ItemPointerGetOffsetNumber(&heaptup->t_self);
xlrec.flags = all_visible_cleared ? XLOG_HEAP_ALL_VISIBLE_CLEARED : 0;
+ if (all_frozen_cleared)
+ xlrec.flags |= XLOG_HEAP_ALL_FROZEN_CLEARED;
Assert(ItemPointerGetBlockNumber(&heaptup->t_self) == BufferGetBlockNumber(buffer));
/*
@@ -2199,6 +2216,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
UnlockReleaseBuffer(buffer);
if (vmbuffer != InvalidBuffer)
ReleaseBuffer(vmbuffer);
+ if (fmbuffer != InvalidBuffer)
+ ReleaseBuffer(fmbuffer);
/*
* If tuple is cachable, mark it for invalidation from the caches in case
@@ -2346,8 +2365,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
while (ndone < ntuples)
{
Buffer buffer;
- Buffer vmbuffer = InvalidBuffer;
+ Buffer vmbuffer = InvalidBuffer,
+ fmbuffer = InvalidBuffer;
bool all_visible_cleared = false;
+ bool all_frozen_cleared = false;
int nthispage;
CHECK_FOR_INTERRUPTS();
@@ -2358,7 +2379,8 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
*/
buffer = RelationGetBufferForTuple(relation, heaptuples[ndone]->t_len,
InvalidBuffer, options, bistate,
- &vmbuffer, NULL);
+ &vmbuffer, NULL,
+ &fmbuffer, NULL);
page = BufferGetPage(buffer);
/* NO EREPORT(ERROR) from here till changes are logged */
@@ -2395,6 +2417,15 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
vmbuffer);
}
+ if (PageIsAllFrozen(page))
+ {
+ all_frozen_cleared = true;
+ PageClearAllFrozen(page);
+ frozenmap_clear(relation,
+ BufferGetBlockNumber(buffer),
+ fmbuffer);
+ }
+
/*
* XXX Should we set PageSetPrunable on this page ? See heap_insert()
*/
@@ -2437,6 +2468,8 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
tupledata = scratchptr;
xlrec->flags = all_visible_cleared ? XLOG_HEAP_ALL_VISIBLE_CLEARED : 0;
+ if (all_frozen_cleared)
+ xlrec->flags |= XLOG_HEAP_ALL_FROZEN_CLEARED;
xlrec->ntuples = nthispage;
/*
@@ -2509,6 +2542,8 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
UnlockReleaseBuffer(buffer);
if (vmbuffer != InvalidBuffer)
ReleaseBuffer(vmbuffer);
+ if (fmbuffer != InvalidBuffer)
+ ReleaseBuffer(fmbuffer);
ndone += nthispage;
}
@@ -3053,7 +3088,9 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
Buffer buffer,
newbuf,
vmbuffer = InvalidBuffer,
- vmbuffer_new = InvalidBuffer;
+ vmbuffer_new = InvalidBuffer,
+ fmbuffer = InvalidBuffer,
+ fmbuffer_new = InvalidBuffer;
bool need_toast,
already_marked;
Size newtupsize,
@@ -3067,6 +3104,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
bool key_intact;
bool all_visible_cleared = false;
bool all_visible_cleared_new = false;
+ bool all_frozen_cleared = false;
+ bool all_frozen_cleared_new = false;
bool checked_lockers;
bool locker_remains;
TransactionId xmax_new_tuple,
@@ -3100,14 +3139,17 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
page = BufferGetPage(buffer);
/*
- * Before locking the buffer, pin the visibility map page if it appears to
- * be necessary. Since we haven't got the lock yet, someone else might be
- * in the middle of changing this, so we'll need to recheck after we have
- * the lock.
+ * Before locking the buffer, pin the visibility map and frozen map page
+ * if it appears to be necessary. Since we haven't got the lock yet,
+ * someone else might be in the middle of changing this, so we'll need to
+ * recheck after we have the lock.
*/
if (PageIsAllVisible(page))
visibilitymap_pin(relation, block, &vmbuffer);
+ if (PageIsAllFrozen(page))
+ frozenmap_pin(relation, block, &fmbuffer);
+
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
lp = PageGetItemId(page, ItemPointerGetOffsetNumber(otid));
@@ -3390,19 +3432,21 @@ l2:
UnlockTupleTuplock(relation, &(oldtup.t_self), *lockmode);
if (vmbuffer != InvalidBuffer)
ReleaseBuffer(vmbuffer);
+ if (fmbuffer_new != InvalidBuffer)
+ ReleaseBuffer(fmbuffer);
bms_free(hot_attrs);
bms_free(key_attrs);
return result;
}
/*
- * If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, or during some
- * subsequent window during which we had it unlocked, we'll have to unlock
- * and re-lock, to avoid holding the buffer lock across an I/O. That's a
- * bit unfortunate, especially since we'll now have to recheck whether the
- * tuple has been locked or updated under us, but hopefully it won't
- * happen very often.
+ * If we didn't pin the visibility(and frozen) map page and the page has
+ * become all visible(and frozen) while we were busy locking the buffer,
+ * or during some subsequent window during which we had it unlocked,
+ * we'll have to unlock and re-lock, to avoid holding the buffer lock
+ * across an I/O. That's a bit unfortunate, especially since we'll now
+ * have to recheck whether the tuple has been locked or updated under us,
+ * but hopefully it won't happen very often.
*/
if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
{
@@ -3412,6 +3456,15 @@ l2:
goto l2;
}
+ if (fmbuffer == InvalidBuffer && PageIsAllFrozen(page))
+ {
+ LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+ frozenmap_pin(relation, block, &fmbuffer);
+ LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+ goto l2;
+
+ }
+
/*
* We're about to do the actual update -- check for conflict first, to
* avoid possibly having to roll back work we've just done.
@@ -3570,7 +3623,8 @@ l2:
/* Assume there's no chance to put heaptup on same page. */
newbuf = RelationGetBufferForTuple(relation, heaptup->t_len,
buffer, 0, NULL,
- &vmbuffer_new, &vmbuffer);
+ &vmbuffer_new, &vmbuffer,
+ &fmbuffer_new, &fmbuffer);
}
else
{
@@ -3588,7 +3642,8 @@ l2:
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
newbuf = RelationGetBufferForTuple(relation, heaptup->t_len,
buffer, 0, NULL,
- &vmbuffer_new, &vmbuffer);
+ &vmbuffer_new, &vmbuffer,
+ &fmbuffer_new, &fmbuffer);
}
else
{
@@ -3713,6 +3768,22 @@ l2:
vmbuffer_new);
}
+ /* clear PD_ALL_FROZEN flags */
+ if (newbuf == buffer && PageIsAllFrozen(BufferGetPage(buffer)))
+ {
+ all_frozen_cleared = true;
+ PageClearAllFrozen(BufferGetPage(buffer));
+ frozenmap_clear(relation, BufferGetBlockNumber(buffer),
+ fmbuffer);
+ }
+ else if (newbuf != buffer && PageIsAllFrozen(BufferGetPage(newbuf)))
+ {
+ all_frozen_cleared_new = true;
+ PageClearAllFrozen(BufferGetPage(newbuf));
+ frozenmap_clear(relation, BufferGetBlockNumber(newbuf),
+ fmbuffer_new);
+ }
+
if (newbuf != buffer)
MarkBufferDirty(newbuf);
MarkBufferDirty(buffer);
@@ -3736,7 +3807,9 @@ l2:
newbuf, &oldtup, heaptup,
old_key_tuple,
all_visible_cleared,
- all_visible_cleared_new);
+ all_visible_cleared_new,
+ all_frozen_cleared,
+ all_frozen_cleared_new);
if (newbuf != buffer)
{
PageSetLSN(BufferGetPage(newbuf), recptr);
@@ -3768,6 +3841,10 @@ l2:
ReleaseBuffer(vmbuffer_new);
if (BufferIsValid(vmbuffer))
ReleaseBuffer(vmbuffer);
+ if (BufferIsValid(fmbuffer_new))
+ ReleaseBuffer(fmbuffer_new);
+ if (BufferIsValid(fmbuffer))
+ ReleaseBuffer(fmbuffer);
/*
* Release the lmgr tuple lock, if we had it.
@@ -6534,6 +6611,34 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
}
/*
+ * Perform XLogInsert for a heap-all-frozen operation. heap_buffer is the block
+ * being marked all-frozen, and fm_buffer is the buffer containing the
+ * corresponding frozen map block. Both should have already been modified and dirty.
+ */
+XLogRecPtr
+log_heap_frozenmap(RelFileNode rnode, Buffer heap_buffer, Buffer fm_buffer)
+{
+ XLogRecPtr recptr;
+ uint8 flags;
+
+ Assert(BufferIsValid(heap_buffer));
+ Assert(BufferIsValid(fm_buffer));
+
+ XLogBeginInsert();
+
+ XLogRegisterBuffer(0, fm_buffer, 0);
+
+ flags = REGBUF_STANDARD;
+ if (!XLogHintBitIsNeeded())
+ flags |= REGBUF_NO_IMAGE;
+ XLogRegisterBuffer(1, heap_buffer, flags);
+
+ recptr = XLogInsert(RM_HEAP3_ID, XLOG_HEAP3_FROZENMAP);
+
+ return recptr;
+}
+
+/*
* Perform XLogInsert for a heap-visible operation. 'block' is the block
* being marked all-visible, and vm_buffer is the buffer containing the
* corresponding visibility map block. Both should have already been modified
@@ -6577,7 +6682,8 @@ static XLogRecPtr
log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup, HeapTuple newtup,
HeapTuple old_key_tuple,
- bool all_visible_cleared, bool new_all_visible_cleared)
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool all_frozen_cleared, bool new_all_frozen_cleared)
{
xl_heap_update xlrec;
xl_heap_header xlhdr;
@@ -6660,6 +6766,10 @@ log_heap_update(Relation reln, Buffer oldbuf,
xlrec.flags |= XLOG_HEAP_ALL_VISIBLE_CLEARED;
if (new_all_visible_cleared)
xlrec.flags |= XLOG_HEAP_NEW_ALL_VISIBLE_CLEARED;
+ if (all_frozen_cleared)
+ xlrec.flags |= XLOG_HEAP_ALL_FROZEN_CLEARED;
+ if (new_all_frozen_cleared)
+ xlrec.flags |= XLOG_HEAP_NEW_ALL_FROZEN_CLEARED;
if (prefixlen > 0)
xlrec.flags |= XLOG_HEAP_PREFIX_FROM_OLD;
if (suffixlen > 0)
@@ -7198,6 +7308,75 @@ heap_xlog_visible(XLogReaderState *record)
UnlockReleaseBuffer(vmbuffer);
}
+
+/*
+ * Reply XLOG_HEAP3_FROZENMAP record.
+ */
+static void
+heap_xlog_frozenmap(XLogReaderState *record)
+{
+ XLogRecPtr lsn = record->EndRecPtr;
+ Buffer fmbuffer = InvalidBuffer;
+ Buffer buffer;
+ Page page;
+ RelFileNode rnode;
+ BlockNumber blkno;
+ XLogRedoAction action;
+
+ XLogRecGetBlockTag(record, 1, &rnode, NULL, &blkno);
+
+ /*
+ * Read the heap page, if it still exists. If the heap file has dropped or
+ * truncated later in recovery, we don't need to update the page, but we'd
+ * better still update the frozen map.
+ */
+ action = XLogReadBufferForRedo(record, 1, &buffer);
+ if (action == BLK_NEEDS_REDO)
+ {
+ page = BufferGetPage(buffer);
+ PageSetAllFrozen(page);
+ MarkBufferDirty(buffer);
+ }
+ else if (action == BLK_RESTORED)
+ {
+ /*
+ * If heap block was backed up, restore it. This can only happen with
+ * checksums enabled.
+ */
+ Assert(DataChecksumsEnabled());
+ }
+ if (BufferIsValid(buffer))
+ UnlockReleaseBuffer(buffer);
+
+ if (XLogReadBufferForRedoExtended(record, 0, RBM_ZERO_ON_ERROR, false,
+ &fmbuffer) == BLK_NEEDS_REDO)
+ {
+ Page fmpage = BufferGetPage(fmbuffer);
+ Relation reln;
+
+ /* initialize the page if it was read as zeros */
+ if (PageIsNew(fmpage))
+ PageInit(fmpage, BLCKSZ, 0);
+
+ /*
+ * XLogReplayBufferExtended locked the buffer. But frozenmap_set
+ * will handle locking itself.
+ */
+ LockBuffer(fmbuffer, BUFFER_LOCK_UNLOCK);
+
+ reln = CreateFakeRelcacheEntry(rnode);
+ frozenmap_pin(reln, blkno, &fmbuffer);
+
+ if (lsn > PageGetLSN(fmpage))
+ frozenmap_set(reln, blkno, InvalidBuffer, lsn, fmbuffer);
+
+ ReleaseBuffer(fmbuffer);
+ FreeFakeRelcacheEntry(reln);
+ }
+ else if (BufferIsValid(fmbuffer))
+ UnlockReleaseBuffer(fmbuffer);
+}
+
/*
* Replay XLOG_HEAP2_FREEZE_PAGE records
*/
@@ -7384,6 +7563,20 @@ heap_xlog_insert(XLogReaderState *record)
FreeFakeRelcacheEntry(reln);
}
+ /* The frozen map may need to be fixed even if the heap page is
+ * already up-to-date.
+ */
+ if (xlrec->flags & XLOG_HEAP_ALL_FROZEN_CLEARED)
+ {
+ Relation reln = CreateFakeRelcacheEntry(target_node);
+ Buffer fmbuffer = InvalidBuffer;
+
+ frozenmap_pin(reln, blkno, &fmbuffer);
+ frozenmap_clear(reln, blkno, fmbuffer);
+ ReleaseBuffer(fmbuffer);
+ FreeFakeRelcacheEntry(reln);
+ }
+
/*
* If we inserted the first and only tuple on the page, re-initialize the
* page from scratch.
@@ -7439,6 +7632,9 @@ heap_xlog_insert(XLogReaderState *record)
if (xlrec->flags & XLOG_HEAP_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
+ if (xlrec->flags & XLOG_HEAP_ALL_FROZEN_CLEARED)
+ PageClearAllFrozen(page);
+
MarkBufferDirty(buffer);
}
if (BufferIsValid(buffer))
@@ -7504,6 +7700,21 @@ heap_xlog_multi_insert(XLogReaderState *record)
FreeFakeRelcacheEntry(reln);
}
+ /*
+ * The frozen map may need to be fixed even if the heap page is
+ * already up-to-date.
+ */
+ if (xlrec->flags & XLOG_HEAP_ALL_FROZEN_CLEARED)
+ {
+ Relation reln = CreateFakeRelcacheEntry(rnode);
+ Buffer fmbuffer = InvalidBuffer;
+
+ visibilitymap_pin(reln, blkno, &fmbuffer);
+ visibilitymap_clear(reln, blkno, fmbuffer);
+ ReleaseBuffer(fmbuffer);
+ FreeFakeRelcacheEntry(reln);
+ }
+
if (isinit)
{
buffer = XLogInitBufferForRedo(record, 0);
@@ -7577,6 +7788,8 @@ heap_xlog_multi_insert(XLogReaderState *record)
if (xlrec->flags & XLOG_HEAP_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
+ if (xlrec->flags & XLOG_HEAP_ALL_FROZEN_CLEARED)
+ PageClearAllFrozen(page);
MarkBufferDirty(buffer);
}
@@ -7660,6 +7873,22 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
}
/*
+ * The frozen map may need to be fixed even if the heap page is
+ * already up-to-date.
+ */
+ if (xlrec->flags & XLOG_HEAP_ALL_FROZEN_CLEARED)
+ {
+ Relation reln = CreateFakeRelcacheEntry(rnode);
+ Buffer fmbuffer = InvalidBuffer;
+
+ frozenmap_pin(reln, oldblk, &fmbuffer);
+ frozenmap_clear(reln, oldblk, fmbuffer);
+ ReleaseBuffer(fmbuffer);
+ FreeFakeRelcacheEntry(reln);
+ }
+
+
+ /*
* In normal operation, it is important to lock the two pages in
* page-number order, to avoid possible deadlocks against other update
* operations going the other way. However, during WAL replay there can
@@ -7705,6 +7934,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
if (xlrec->flags & XLOG_HEAP_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
+ if (xlrec->flags & XLOG_HEAP_ALL_FROZEN_CLEARED)
+ PageClearAllFrozen(page);
PageSetLSN(page, lsn);
MarkBufferDirty(obuffer);
@@ -7743,6 +7974,21 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
FreeFakeRelcacheEntry(reln);
}
+ /*
+ * The frozen map may need to be fixed even if the heap page is
+ * already up-to-date.
+ */
+ if (xlrec->flags & XLOG_HEAP_ALL_FROZEN_CLEARED)
+ {
+ Relation reln = CreateFakeRelcacheEntry(rnode);
+ Buffer fmbuffer = InvalidBuffer;
+
+ visibilitymap_pin(reln, oldblk, &fmbuffer);
+ visibilitymap_clear(reln, oldblk, fmbuffer);
+ ReleaseBuffer(fmbuffer);
+ FreeFakeRelcacheEntry(reln);
+ }
+
/* Deal with new tuple */
if (newaction == BLK_NEEDS_REDO)
{
@@ -7840,6 +8086,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
if (xlrec->flags & XLOG_HEAP_NEW_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
+ if (xlrec->flags & XLOG_HEAP_ALL_FROZEN_CLEARED)
+ PageClearAllFrozen(page);
freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */
@@ -8072,6 +8320,21 @@ heap2_redo(XLogReaderState *record)
}
}
+void
+heap3_redo(XLogReaderState *record)
+{
+ uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+ switch (info & XLOG_HEAP_OPMASK)
+ {
+ case XLOG_HEAP3_FROZENMAP:
+ heap_xlog_frozenmap(record);
+ break;
+ default:
+ elog(PANIC, "heap3_redo: unknown op code %u", info);
+ }
+}
+
/*
* heap_sync - sync a heap, for use when no WAL has been written
*
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 6d091f6..5460d4f 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -15,6 +15,7 @@
#include "postgres.h"
+#include "access/frozenmap.h"
#include "access/heapam.h"
#include "access/hio.h"
#include "access/htup_details.h"
@@ -156,6 +157,62 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
}
/*
+ * For each heap page which is all-frozen, acquire a pin on the appropriate
+ * frozen map page, if we haven't already got one.
+ *
+ * This function is same logic as GetVisibilityMapPins function.
+ */
+static void
+GetFrozenMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
+ BlockNumber block1, BlockNumber block2,
+ Buffer *fmbuffer1, Buffer *fmbuffer2)
+{
+ bool need_to_pin_buffer1;
+ bool need_to_pin_buffer2;
+
+ Assert(BufferIsValid(buffer1));
+ Assert(buffer2 == InvalidBuffer || buffer1 <= buffer2);
+
+ while (1)
+ {
+ /* Figure out which pins we need but don't have. */
+ need_to_pin_buffer1 = PageIsAllFrozen(BufferGetPage(buffer1))
+ && !frozenmap_pin_ok(block1, *fmbuffer1);
+ need_to_pin_buffer2 = buffer2 != InvalidBuffer
+ && PageIsAllFrozen(BufferGetPage(buffer2))
+ && !frozenmap_pin_ok(block2, *fmbuffer2);
+ if (!need_to_pin_buffer1 && !need_to_pin_buffer2)
+ return;
+
+ /* We must unlock both buffers before doing any I/O. */
+ LockBuffer(buffer1, BUFFER_LOCK_UNLOCK);
+ if (buffer2 != InvalidBuffer && buffer2 != buffer1)
+ LockBuffer(buffer2, BUFFER_LOCK_UNLOCK);
+
+ /* Get pins. */
+ if (need_to_pin_buffer1)
+ frozenmap_pin(relation, block1, fmbuffer1);
+ if (need_to_pin_buffer2)
+ frozenmap_pin(relation, block2, fmbuffer2);
+
+ /* Relock buffers. */
+ LockBuffer(buffer1, BUFFER_LOCK_EXCLUSIVE);
+ if (buffer2 != InvalidBuffer && buffer2 != buffer1)
+ LockBuffer(buffer2, BUFFER_LOCK_EXCLUSIVE);
+
+ /*
+ * If there are two buffers involved and we pinned just one of them,
+ * it's possible that the second one became all-frozen while we were
+ * busy pinning the first one. If it looks like that's a possible
+ * scenario, we'll need to make a second pass through this loop.
+ */
+ if (buffer2 == InvalidBuffer || buffer1 == buffer2
+ || (need_to_pin_buffer1 && need_to_pin_buffer2))
+ break;
+ }
+}
+
+/*
* RelationGetBufferForTuple
*
* Returns pinned and exclusive-locked buffer of a page in given relation
@@ -215,7 +272,8 @@ Buffer
RelationGetBufferForTuple(Relation relation, Size len,
Buffer otherBuffer, int options,
BulkInsertState bistate,
- Buffer *vmbuffer, Buffer *vmbuffer_other)
+ Buffer *vmbuffer, Buffer *vmbuffer_other,
+ Buffer *fmbuffer, Buffer *fmbuffer_other)
{
bool use_fsm = !(options & HEAP_INSERT_SKIP_FSM);
Buffer buffer = InvalidBuffer;
@@ -316,6 +374,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
buffer = ReadBufferBI(relation, targetBlock, bistate);
if (PageIsAllVisible(BufferGetPage(buffer)))
visibilitymap_pin(relation, targetBlock, vmbuffer);
+ if (PageIsAllFrozen(BufferGetPage(buffer)))
+ frozenmap_pin(relation, targetBlock, fmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
}
else if (otherBlock == targetBlock)
@@ -324,6 +384,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
buffer = otherBuffer;
if (PageIsAllVisible(BufferGetPage(buffer)))
visibilitymap_pin(relation, targetBlock, vmbuffer);
+ if (PageIsAllFrozen(BufferGetPage(buffer)))
+ frozenmap_pin(relation, targetBlock, fmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
}
else if (otherBlock < targetBlock)
@@ -332,6 +394,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
buffer = ReadBuffer(relation, targetBlock);
if (PageIsAllVisible(BufferGetPage(buffer)))
visibilitymap_pin(relation, targetBlock, vmbuffer);
+ if (PageIsAllFrozen(BufferGetPage(buffer)))
+ frozenmap_pin(relation, targetBlock, fmbuffer);
LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
}
@@ -341,6 +405,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
buffer = ReadBuffer(relation, targetBlock);
if (PageIsAllVisible(BufferGetPage(buffer)))
visibilitymap_pin(relation, targetBlock, vmbuffer);
+ if (PageIsAllFrozen(BufferGetPage(buffer)))
+ frozenmap_pin(relation, targetBlock, fmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
}
@@ -367,13 +433,23 @@ RelationGetBufferForTuple(Relation relation, Size len,
* done.
*/
if (otherBuffer == InvalidBuffer || buffer <= otherBuffer)
+ {
GetVisibilityMapPins(relation, buffer, otherBuffer,
targetBlock, otherBlock, vmbuffer,
vmbuffer_other);
+ GetFrozenMapPins(relation, buffer, otherBuffer,
+ targetBlock, otherBlock, fmbuffer,
+ fmbuffer_other);
+ }
else
+ {
GetVisibilityMapPins(relation, otherBuffer, buffer,
otherBlock, targetBlock, vmbuffer_other,
vmbuffer);
+ GetFrozenMapPins(relation, otherBuffer, buffer,
+ otherBlock, targetBlock, fmbuffer_other,
+ fmbuffer);
+ }
/*
* Now we can check to see if there's enough free space here. If so,
diff --git a/src/backend/access/rmgrdesc/heapdesc.c b/src/backend/access/rmgrdesc/heapdesc.c
index 4f06a26..9a67733 100644
--- a/src/backend/access/rmgrdesc/heapdesc.c
+++ b/src/backend/access/rmgrdesc/heapdesc.c
@@ -149,6 +149,20 @@ heap2_desc(StringInfo buf, XLogReaderState *record)
}
}
+void
+heap3_desc(StringInfo buf, XLogReaderState *record)
+{
+ char *rec = XLogRecGetData(record);
+ uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+ if (info == XLOG_HEAP3_FROZENMAP)
+ {
+ xl_heap_clean *xlrec = (xl_heap_clean *) rec;
+
+ appendStringInfo(buf, "remxid %u", xlrec->latestRemovedXid);
+ }
+}
+
const char *
heap_identify(uint8 info)
{
@@ -226,3 +240,18 @@ heap2_identify(uint8 info)
return id;
}
+
+const char *
+heap3_identify(uint8 info)
+{
+ const char *id = NULL;
+
+ switch (info & ~XLR_INFO_MASK)
+ {
+ case XLOG_HEAP3_FROZENMAP:
+ id = "FROZENMAP";
+ break;
+ }
+
+ return id;
+}
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index ce398fc..961775e 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,6 +19,7 @@
#include "postgres.h"
+#include "access/frozenmap.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
#include "access/xlog.h"
@@ -228,6 +229,7 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
{
bool fsm;
bool vm;
+ bool fm;
/* Open it at the smgr level if not already done */
RelationOpenSmgr(rel);
@@ -238,6 +240,7 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
rel->rd_smgr->smgr_targblock = InvalidBlockNumber;
rel->rd_smgr->smgr_fsm_nblocks = InvalidBlockNumber;
rel->rd_smgr->smgr_vm_nblocks = InvalidBlockNumber;
+ rel->rd_smgr->smgr_fm_nblocks = InvalidBlockNumber;
/* Truncate the FSM first if it exists */
fsm = smgrexists(rel->rd_smgr, FSM_FORKNUM);
@@ -249,6 +252,11 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
if (vm)
visibilitymap_truncate(rel, nblocks);
+ /* Truncate the frozen map too if it exists. */
+ fm = smgrexists(rel->rd_smgr, FROZENMAP_FORKNUM);
+ if (fm)
+ frozenmap_truncate(rel, nblocks);
+
/*
* We WAL-log the truncation before actually truncating, which means
* trouble if the truncation fails. If we then crash, the WAL replay
@@ -282,7 +290,7 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
* with a truncated heap, but the FSM or visibility map would still
* contain entries for the non-existent heap pages.
*/
- if (fsm || vm)
+ if (fsm || vm || fm)
XLogFlush(lsn);
}
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 3febdd5..80a9f96 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -17,6 +17,7 @@
*/
#include "postgres.h"
+#include "access/frozenmap.h"
#include "access/multixact.h"
#include "access/relscan.h"
#include "access/rewriteheap.h"
@@ -1484,6 +1485,10 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
Oid mapped_tables[4];
int reindex_flags;
int i;
+ Buffer fmbuffer = InvalidBuffer,
+ buf = InvalidBuffer;
+ Relation rel;
+ BlockNumber nblocks, blkno;
/* Zero out possible results from swapped_relation_files */
memset(mapped_tables, 0, sizeof(mapped_tables));
@@ -1591,6 +1596,26 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
RelationMapRemoveMapping(mapped_tables[i]);
/*
+ * We can ensure that the all tuple of new relation has been completely
+ * frozen at this point since we aquired AccessExclusiveLock already.
+ * We set a bit on frozen map and flag to page header to each page.
+ */
+ rel = relation_open(OIDOldHeap, NoLock);
+ nblocks = RelationGetNumberOfBlocks(rel);
+ for (blkno = 0; blkno < nblocks; blkno++)
+ {
+ buf = ReadBuffer(rel, blkno);
+ PageSetAllFrozen(BufferGetPage(buf));
+ frozenmap_pin(rel, blkno, &fmbuffer);
+ frozenmap_set(rel, blkno, buf, InvalidXLogRecPtr, fmbuffer);
+ ReleaseBuffer(buf);
+ }
+
+ if (fmbuffer != InvalidBuffer)
+ ReleaseBuffer(fmbuffer);
+ relation_close(rel, NoLock);
+
+ /*
* At this point, everything is kosher except that, if we did toast swap
* by links, the toast table's name corresponds to the transient table.
* The name is irrelevant to the backend because it's referenced by OID,
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index c3d6e59..8e9940b 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -37,6 +37,7 @@
#include <math.h>
+#include "access/frozenmap.h"
#include "access/genam.h"
#include "access/heapam.h"
#include "access/heapam_xlog.h"
@@ -106,6 +107,7 @@ typedef struct LVRelStats
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* number of pages we examined */
BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber fmskipped_pages; /* # of pages we skipped by frozen map */
double scanned_tuples; /* counts only tuples on scanned pages */
double old_rel_tuples; /* previous value of pg_class.reltuples */
double new_rel_tuples; /* new estimated total # of tuples */
@@ -222,6 +224,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* than or equal to the requested Xid full-table scan limit; or if the
* table's minimum MultiXactId is older than or equal to the requested
* mxid full-table scan limit.
+ * Even if scan_all is set so far, we could skip to scan some pages
+ * according by frozen map.
*/
scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
xidFullScanLimit);
@@ -247,20 +251,22 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
vac_close_indexes(nindexes, Irel, NoLock);
/*
- * Compute whether we actually scanned the whole relation. If we did, we
- * can adjust relfrozenxid and relminmxid.
+ * Compute whether we actually scanned the whole relation. If we did,
+ * we can adjust relfrozenxid and relminmxid.
*
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
- if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages + vacrelstats->fmskipped_pages)
+ < vacrelstats->rel_pages)
{
- Assert(!scan_all);
scanned_all = false;
}
else
scanned_all = true;
+ scanned_all |= scan_all;
+
/*
* Optionally truncate the relation.
*
@@ -450,7 +456,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
IndexBulkDeleteResult **indstats;
int i;
PGRUsage ru0;
- Buffer vmbuffer = InvalidBuffer;
+ Buffer vmbuffer = InvalidBuffer,
+ fmbuffer = InvalidBuffer;
BlockNumber next_not_all_visible_block;
bool skipping_all_visible_blocks;
xl_heap_freeze_tuple *frozen;
@@ -533,6 +540,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
hastup;
int prev_dead_count;
int nfrozen;
+ int already_nfrozen; /* # of tuples already frozen */
+ int ntup_blk; /* # of tuples in single page */
Size freespace;
bool all_visible_according_to_vm;
bool all_visible;
@@ -562,12 +571,33 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
skipping_all_visible_blocks = false;
all_visible_according_to_vm = false;
+
+ /* Even if current block is not all-visible, we scan skip vacuum
+ * this block only when corresponding frozen map bit is set, and
+ * whole table scanning is required.
+ */
+ if (frozenmap_test(onerel, blkno, &fmbuffer) && scan_all)
+ {
+ vacrelstats->fmskipped_pages++;
+ continue;
+ }
}
else
{
- /* Current block is all-visible */
+ /*
+ * Current block is all-visible.
+ * If frozen map represents that it's all frozen and this
+ * function is called for freezing tuples, we can skip to
+ * vacuum block.
+ */
+ if (frozenmap_test(onerel, blkno, &fmbuffer) && scan_all)
+ {
+ vacrelstats->fmskipped_pages++;
+ continue;
+ }
if (skipping_all_visible_blocks && !scan_all)
continue;
+
all_visible_according_to_vm = true;
}
@@ -592,6 +622,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
vmbuffer = InvalidBuffer;
}
+ if (BufferIsValid(fmbuffer))
+ {
+ ReleaseBuffer(fmbuffer);
+ fmbuffer = InvalidBuffer;
+ }
+
/* Log cleanup info before we touch indexes */
vacuum_log_cleanup_info(onerel, vacrelstats);
@@ -621,6 +657,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* and did a cycle of index vacuuming.
*/
visibilitymap_pin(onerel, blkno, &vmbuffer);
+ frozenmap_pin(onerel, blkno, &fmbuffer);
buf = ReadBufferExtended(onerel, MAIN_FORKNUM, blkno,
RBM_NORMAL, vac_strategy);
@@ -763,6 +800,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
all_visible = true;
has_dead_tuples = false;
nfrozen = 0;
+ already_nfrozen = 0;
+ ntup_blk = 0;
hastup = false;
prev_dead_count = vacrelstats->num_dead_tuples;
maxoff = PageGetMaxOffsetNumber(page);
@@ -917,8 +956,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
{
num_tuples += 1;
+ ntup_blk += 1;
hastup = true;
+ /* If current tuple is already frozen, count it up */
+ if (HeapTupleHeaderXminFrozen(tuple.t_data))
+ already_nfrozen += 1;
+
/*
* Each non-removable tuple must be checked to see if it needs
* freezing. Note we already have exclusive buffer lock.
@@ -952,6 +996,27 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
heap_execute_freeze_tuple(htup, &frozen[i]);
}
+ /*
+ * If the un-frozen tuple is remaining in current page and
+ * current page is marked as ALL_FROZEN, we should clear it.
+ */
+ if (ntup_blk != (nfrozen + already_nfrozen)
+ && PageIsAllFrozen(page))
+ {
+ PageClearAllFrozen(page);
+ frozenmap_clear(onerel, blkno, fmbuffer);
+ }
+ /*
+ * As a result of scanning a page, we ensure that all tuples
+ * are completely frozen. Set bit on frozen map and PD_ALL_FROZEN
+ * flag on page.
+ */
+ else if (ntup_blk == (nfrozen + already_nfrozen))
+ {
+ PageSetAllFrozen(page);
+ frozenmap_set(onerel, blkno, buf, InvalidXLogRecPtr, fmbuffer);
+ }
+
/* Now WAL-log freezing if neccessary */
if (RelationNeedsWAL(onerel))
{
@@ -1077,13 +1142,18 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
num_tuples);
/*
- * Release any remaining pin on visibility map page.
+ * Release any remaining pin on visibility map and frozen map page.
*/
if (BufferIsValid(vmbuffer))
{
ReleaseBuffer(vmbuffer);
vmbuffer = InvalidBuffer;
}
+ if (BufferIsValid(fmbuffer))
+ {
+ ReleaseBuffer(fmbuffer);
+ fmbuffer = InvalidBuffer;
+ }
/* If any tuples need to be deleted, perform final vacuum cycle */
/* XXX put a threshold on min number of tuples here? */
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index f96fb24..67898df 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -92,7 +92,7 @@ ExecCheckPlanOutput(Relation resultRel, List *targetList)
if (exprType((Node *) tle->expr) != attr->atttypid)
ereport(ERROR,
(errcode(ERRCODE_DATATYPE_MISMATCH),
- errmsg("table row type and query-specified row type do not match"),
+ errmsg("table row type and query-specified row type do not match"),
errdetail("Table has type %s at ordinal position %d, but query expects %s.",
format_type_be(attr->atttypid),
attno,
@@ -117,7 +117,7 @@ ExecCheckPlanOutput(Relation resultRel, List *targetList)
if (attno != resultDesc->natts)
ereport(ERROR,
(errcode(ERRCODE_DATATYPE_MISMATCH),
- errmsg("table row type and query-specified row type do not match"),
+ errmsg("table row type and query-specified row type do not match"),
errdetail("Query has too few columns.")));
}
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index eb7293f..d66660d 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -55,6 +55,7 @@ typedef struct XLogRecordBuffer
static void DecodeXLogOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf);
static void DecodeHeapOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf);
static void DecodeHeap2Op(LogicalDecodingContext *ctx, XLogRecordBuffer *buf);
+static void DecodeHeap3Op(LogicalDecodingContext *ctx, XLogRecordBuffer *buf);
static void DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf);
static void DecodeStandbyOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf);
@@ -104,6 +105,10 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
DecodeStandbyOp(ctx, &buf);
break;
+ case RM_HEAP3_ID:
+ DecodeHeap3Op(ctx, &buf);
+ break;
+
case RM_HEAP2_ID:
DecodeHeap2Op(ctx, &buf);
break;
@@ -300,6 +305,29 @@ DecodeStandbyOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
}
/*
+ * Handle rmgr HEAP3_ID records for DecodeRecordIntoReorderBuffer().
+ */
+static void
+DecodeHeap3Op(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
+{
+ uint8 info = XLogRecGetInfo(buf->record) & XLOG_HEAP_OPMASK;
+ SnapBuild *builder = ctx->snapshot_builder;
+
+ /* no point in doing anything yet */
+ if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT)
+ return;
+
+ switch (info)
+ {
+ case XLOG_HEAP3_FROZENMAP:
+ break;
+ default:
+ elog(ERROR, "unexpected RM_HEAP3_ID record type: %u", info);
+ }
+
+}
+
+/*
* Handle rmgr HEAP2_ID records for DecodeRecordIntoReorderBuffer().
*/
static void
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 244b4ea..666e682 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -168,6 +168,7 @@ smgropen(RelFileNode rnode, BackendId backend)
reln->smgr_targblock = InvalidBlockNumber;
reln->smgr_fsm_nblocks = InvalidBlockNumber;
reln->smgr_vm_nblocks = InvalidBlockNumber;
+ reln->smgr_fm_nblocks = InvalidBlockNumber;
reln->smgr_which = 0; /* we only have md.c at present */
/* mark it not open */
diff --git a/src/common/relpath.c b/src/common/relpath.c
index 66dfef1..7eba9ee 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -35,6 +35,7 @@ const char *const forkNames[] = {
"main", /* MAIN_FORKNUM */
"fsm", /* FSM_FORKNUM */
"vm", /* VISIBILITYMAP_FORKNUM */
+ "fm", /* FROZENMAP_FORKNUM */
"init" /* INIT_FORKNUM */
};
@@ -58,7 +59,7 @@ forkname_to_number(const char *forkName)
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("invalid fork name"),
errhint("Valid fork names are \"main\", \"fsm\", "
- "\"vm\", and \"init\".")));
+ "\"vm\", \"fm\" and \"init\".")));
#endif
return InvalidForkNumber;
diff --git a/src/include/access/frozenmap.h b/src/include/access/frozenmap.h
new file mode 100644
index 0000000..0f2e54e
--- /dev/null
+++ b/src/include/access/frozenmap.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * frozenmap.h
+ * frozen map interface
+ *
+ *
+ * Portions Copyright (c) 2007-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/frozenmap.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef FROZENMAP_H
+#define FROZENMAP_H
+
+#include "access/xlogdefs.h"
+#include "storage/block.h"
+#include "storage/buf.h"
+#include "utils/relcache.h"
+
+extern void frozenmap_clear(Relation rel, BlockNumber heapBlk,
+ Buffer fmbuf);
+extern void frozenmap_pin(Relation rel, BlockNumber heapBlk,
+ Buffer *fmbuf);
+extern bool frozenmap_pin_ok(BlockNumber heapBlk, Buffer fmbuf);
+extern void frozenmap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
+ XLogRecPtr recptr, Buffer fmBuf);
+extern bool frozenmap_test(Relation rel, BlockNumber heapBlk, Buffer *fmbuf);
+extern BlockNumber frozenmap_count(Relation rel);
+extern void frozenmap_truncate(Relation rel, BlockNumber nheapblocks);
+
+#endif /* FROZENMAP_H */
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index f0f89de..087cfeb 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -60,6 +60,13 @@
#define XLOG_HEAP2_NEW_CID 0x70
/*
+ * heapam.c has a third RmgrId now. These opcodes are associated with
+ * RM_HEAP3_ID, but are not logically different fromthe ones above
+ * asssociated with RM_HEAP_ID. XLOG_HEAP_OPMASK applies to these, too.
+ */
+#define XLOG_HEAP3_FROZENMAP 0x00
+
+/*
* xl_heap_* ->flag values, 8 bits are available.
*/
/* PD_ALL_VISIBLE was cleared */
@@ -73,6 +80,10 @@
#define XLOG_HEAP_SUFFIX_FROM_OLD (1<<6)
/* last xl_heap_multi_insert record for one heap_multi_insert() call */
#define XLOG_HEAP_LAST_MULTI_INSERT (1<<7)
+/* PD_ALL_FROZEN was cleared for INSERT and UPDATE */
+#define XLOG_HEAP_ALL_FROZEN_CLEARED (1<<8)
+/* PD_ALL_FROZEN was cleared for INSERT and UPDATE */
+#define XLOG_HEAP_NEW_ALL_FROZEN_CLEARED (1<<9)
/* convenience macro for checking whether any form of old tuple was logged */
#define XLOG_HEAP_CONTAINS_OLD \
@@ -110,12 +121,12 @@ typedef struct xl_heap_header
typedef struct xl_heap_insert
{
OffsetNumber offnum; /* inserted tuple's offset */
- uint8 flags;
+ uint16 flags;
/* xl_heap_header & TUPLE DATA in backup block 0 */
} xl_heap_insert;
-#define SizeOfHeapInsert (offsetof(xl_heap_insert, flags) + sizeof(uint8))
+#define SizeOfHeapInsert (offsetof(xl_heap_insert, flags) + sizeof(uint16))
/*
* This is what we need to know about a multi-insert.
@@ -130,7 +141,7 @@ typedef struct xl_heap_insert
*/
typedef struct xl_heap_multi_insert
{
- uint8 flags;
+ uint16 flags;
uint16 ntuples;
OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
} xl_heap_multi_insert;
@@ -170,7 +181,7 @@ typedef struct xl_heap_update
TransactionId old_xmax; /* xmax of the old tuple */
OffsetNumber old_offnum; /* old tuple's offset */
uint8 old_infobits_set; /* infomask bits to set on old tuple */
- uint8 flags;
+ uint16 flags;
TransactionId new_xmax; /* xmax of the new tuple */
OffsetNumber new_offnum; /* new tuple's offset */
@@ -342,6 +353,9 @@ extern const char *heap_identify(uint8 info);
extern void heap2_redo(XLogReaderState *record);
extern void heap2_desc(StringInfo buf, XLogReaderState *record);
extern const char *heap2_identify(uint8 info);
+extern void heap3_redo(XLogReaderState *record);
+extern void heap3_desc(StringInfo buf, XLogReaderState *record);
+extern const char *heap3_identify(uint8 info);
extern void heap_xlog_logical_rewrite(XLogReaderState *r);
extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
@@ -354,6 +368,8 @@ extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer,
TransactionId cutoff_xid, xl_heap_freeze_tuple *tuples,
int ntuples);
+extern XLogRecPtr log_heap_frozenmap(RelFileNode rnode, Buffer heap_buffer,
+ Buffer fm_buffer);
extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
TransactionId cutoff_xid,
TransactionId cutoff_multi,
diff --git a/src/include/access/hio.h b/src/include/access/hio.h
index b014029..1a27ee8 100644
--- a/src/include/access/hio.h
+++ b/src/include/access/hio.h
@@ -40,6 +40,8 @@ extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
Buffer otherBuffer, int options,
BulkInsertState bistate,
- Buffer *vmbuffer, Buffer *vmbuffer_other);
+ Buffer *vmbuffer, Buffer *vmbuffer_other,
+ Buffer *fmbuffer, Buffer *fmbuffer_other
+ );
#endif /* HIO_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 48f04c6..e49c0b0 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -34,6 +34,7 @@ PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, N
PG_RMGR(RM_MULTIXACT_ID, "MultiXact", multixact_redo, multixact_desc, multixact_identify, NULL, NULL)
PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, NULL)
PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL)
+PG_RMGR(RM_HEAP3_ID, "Heap3", heap3_redo, heap3_desc, heap3_identify, NULL, NULL)
PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL)
PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL)
PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL)
diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
index 8b4c35c..8420e47 100644
--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -47,6 +47,8 @@ CATALOG(pg_class,1259) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83) BKI_SCHEMA_MACRO
float4 reltuples; /* # of tuples (not always up-to-date) */
int32 relallvisible; /* # of all-visible blocks (not always
* up-to-date) */
+ int32 relallfrozen; /* # of all-frozen blocks (not always
+ up-to-date) */
Oid reltoastrelid; /* OID of toast table; 0 if none */
bool relhasindex; /* T if has (or has had) any indexes */
bool relisshared; /* T if shared across databases */
@@ -95,7 +97,7 @@ typedef FormData_pg_class *Form_pg_class;
* ----------------
*/
-#define Natts_pg_class 30
+#define Natts_pg_class 31
#define Anum_pg_class_relname 1
#define Anum_pg_class_relnamespace 2
#define Anum_pg_class_reltype 3
@@ -107,25 +109,26 @@ typedef FormData_pg_class *Form_pg_class;
#define Anum_pg_class_relpages 9
#define Anum_pg_class_reltuples 10
#define Anum_pg_class_relallvisible 11
-#define Anum_pg_class_reltoastrelid 12
-#define Anum_pg_class_relhasindex 13
-#define Anum_pg_class_relisshared 14
-#define Anum_pg_class_relpersistence 15
-#define Anum_pg_class_relkind 16
-#define Anum_pg_class_relnatts 17
-#define Anum_pg_class_relchecks 18
-#define Anum_pg_class_relhasoids 19
-#define Anum_pg_class_relhaspkey 20
-#define Anum_pg_class_relhasrules 21
-#define Anum_pg_class_relhastriggers 22
-#define Anum_pg_class_relhassubclass 23
-#define Anum_pg_class_relrowsecurity 24
-#define Anum_pg_class_relispopulated 25
-#define Anum_pg_class_relreplident 26
-#define Anum_pg_class_relfrozenxid 27
-#define Anum_pg_class_relminmxid 28
-#define Anum_pg_class_relacl 29
-#define Anum_pg_class_reloptions 30
+#define Anum_pg_class_relallfrozen 12
+#define Anum_pg_class_reltoastrelid 13
+#define Anum_pg_class_relhasindex 14
+#define Anum_pg_class_relisshared 15
+#define Anum_pg_class_relpersistence 16
+#define Anum_pg_class_relkind 17
+#define Anum_pg_class_relnatts 18
+#define Anum_pg_class_relchecks 19
+#define Anum_pg_class_relhasoids 20
+#define Anum_pg_class_relhaspkey 21
+#define Anum_pg_class_relhasrules 22
+#define Anum_pg_class_relhastriggers 23
+#define Anum_pg_class_relhassubclass 24
+#define Anum_pg_class_relrowsecurity 25
+#define Anum_pg_class_relispopulated 26
+#define Anum_pg_class_relreplident 27
+#define Anum_pg_class_relfrozenxid 28
+#define Anum_pg_class_relminmxid 29
+#define Anum_pg_class_relacl 30
+#define Anum_pg_class_reloptions 31
/* ----------------
* initial contents of pg_class
@@ -140,13 +143,13 @@ typedef FormData_pg_class *Form_pg_class;
* Note: "3" in the relfrozenxid column stands for FirstNormalTransactionId;
* similarly, "1" in relminmxid stands for FirstMultiXactId
*/
-DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 f f p r 27 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 0 f f p r 27 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 0 f f p r 31 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h
index a263779..5d40997 100644
--- a/src/include/common/relpath.h
+++ b/src/include/common/relpath.h
@@ -27,6 +27,7 @@ typedef enum ForkNumber
MAIN_FORKNUM = 0,
FSM_FORKNUM,
VISIBILITYMAP_FORKNUM,
+ FROZENMAP_FORKNUM,
INIT_FORKNUM
/*
@@ -38,7 +39,7 @@ typedef enum ForkNumber
#define MAX_FORKNUM INIT_FORKNUM
-#define FORKNAMECHARS 4 /* max chars for a fork name */
+#define FORKNAMECHARS 5 /* max chars for a fork name */
extern const char *const forkNames[];
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index c2fbffc..f46375d 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -178,8 +178,10 @@ typedef PageHeaderData *PageHeader;
* tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
+#define PD_ALL_FROZEN 0x0008 /* all tuples on page are completely
+ frozen */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x000F /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -367,6 +369,13 @@ typedef PageHeaderData *PageHeader;
#define PageClearAllVisible(page) \
(((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
+#define PageIsAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags & PD_ALL_FROZEN)
+#define PageSetAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags |= PD_ALL_FROZEN)
+#define PageClearAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags &= ~PD_ALL_FROZEN)
+
#define PageIsPrunable(page, oldestxmin) \
( \
AssertMacro(TransactionIdIsNormal(oldestxmin)), \
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 69a624f..2173c20 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -55,6 +55,7 @@ typedef struct SMgrRelationData
BlockNumber smgr_targblock; /* current insertion target block */
BlockNumber smgr_fsm_nblocks; /* last known size of fsm fork */
BlockNumber smgr_vm_nblocks; /* last known size of vm fork */
+ BlockNumber smgr_fm_nblocks; /* last known size of fm fork */
/* additional public fields may someday exist here */
On Mon, Apr 20, 2015 at 04:45:34PM +0900, Sawada Masahiko wrote:
Attached WIP patch adds Frozen Map which enables us to avoid whole
table vacuuming even when full scan is required: preventing XID
wraparound failures.Frozen Map is a bitmap with one bit per heap page, and quite similar
to Visibility Map. A set bit means that all tuples on heap page are
completely frozen, therefore we don't need to do vacuum freeze that
page.
A bit is set when vacuum(or autovacuum) figures out that all tuples on
corresponding heap page are completely frozen, and a bit is cleared
when INSERT and UPDATE(only new heap page) are executed.
So, this patch avoids reading the all-frozen pages if it has not been
modified since the last VACUUM FREEZE? Since it is already frozen, the
running VACUUM FREEZE will not modify the page or generate WAL, so is it
really worth maintaining a new per-page bitmap just to avoid the
sequential scan of tables every 200MB transactions? I would like to see
us reduce the need for VACUUM FREEZE, rather than go in this direction.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ Everyone has their own god. +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 4/20/15 1:48 PM, Bruce Momjian wrote:
On Mon, Apr 20, 2015 at 04:45:34PM +0900, Sawada Masahiko wrote:
Attached WIP patch adds Frozen Map which enables us to avoid whole
table vacuuming even when full scan is required: preventing XID
wraparound failures.Frozen Map is a bitmap with one bit per heap page, and quite similar
to Visibility Map. A set bit means that all tuples on heap page are
completely frozen, therefore we don't need to do vacuum freeze that
page.
A bit is set when vacuum(or autovacuum) figures out that all tuples on
corresponding heap page are completely frozen, and a bit is cleared
when INSERT and UPDATE(only new heap page) are executed.So, this patch avoids reading the all-frozen pages if it has not been
modified since the last VACUUM FREEZE? Since it is already frozen, the
running VACUUM FREEZE will not modify the page or generate WAL, so is it
really worth maintaining a new per-page bitmap just to avoid the
sequential scan of tables every 200MB transactions? I would like to see
us reduce the need for VACUUM FREEZE, rather than go in this direction.
How would you propose we do that?
I also think there's better ways we could handle *all* our cleanup work.
Tuples have a definite lifespan, and there's potentially a lot of
efficiency to be gained if we could track tuples through their stages of
life... but I don't see any easy ways to do that.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Apr 20, 2015 at 01:59:17PM -0500, Jim Nasby wrote:
On 4/20/15 1:48 PM, Bruce Momjian wrote:
On Mon, Apr 20, 2015 at 04:45:34PM +0900, Sawada Masahiko wrote:
Attached WIP patch adds Frozen Map which enables us to avoid whole
table vacuuming even when full scan is required: preventing XID
wraparound failures.Frozen Map is a bitmap with one bit per heap page, and quite similar
to Visibility Map. A set bit means that all tuples on heap page are
completely frozen, therefore we don't need to do vacuum freeze that
page.
A bit is set when vacuum(or autovacuum) figures out that all tuples on
corresponding heap page are completely frozen, and a bit is cleared
when INSERT and UPDATE(only new heap page) are executed.So, this patch avoids reading the all-frozen pages if it has not been
modified since the last VACUUM FREEZE? Since it is already frozen, the
running VACUUM FREEZE will not modify the page or generate WAL, so is it
really worth maintaining a new per-page bitmap just to avoid the
sequential scan of tables every 200MB transactions? I would like to see
us reduce the need for VACUUM FREEZE, rather than go in this direction.How would you propose we do that?
I also think there's better ways we could handle *all* our cleanup
work. Tuples have a definite lifespan, and there's potentially a lot
of efficiency to be gained if we could track tuples through their
stages of life... but I don't see any easy ways to do that.
See the TODO list:
https://wiki.postgresql.org/wiki/Todo
o Avoid the requirement of freezing pages that are infrequently
modified
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ Everyone has their own god. +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 4/20/15 2:09 PM, Bruce Momjian wrote:
On Mon, Apr 20, 2015 at 01:59:17PM -0500, Jim Nasby wrote:
On 4/20/15 1:48 PM, Bruce Momjian wrote:
On Mon, Apr 20, 2015 at 04:45:34PM +0900, Sawada Masahiko wrote:
Attached WIP patch adds Frozen Map which enables us to avoid whole
table vacuuming even when full scan is required: preventing XID
wraparound failures.Frozen Map is a bitmap with one bit per heap page, and quite similar
to Visibility Map. A set bit means that all tuples on heap page are
completely frozen, therefore we don't need to do vacuum freeze that
page.
A bit is set when vacuum(or autovacuum) figures out that all tuples on
corresponding heap page are completely frozen, and a bit is cleared
when INSERT and UPDATE(only new heap page) are executed.So, this patch avoids reading the all-frozen pages if it has not been
modified since the last VACUUM FREEZE? Since it is already frozen, the
running VACUUM FREEZE will not modify the page or generate WAL, so is it
really worth maintaining a new per-page bitmap just to avoid the
sequential scan of tables every 200MB transactions? I would like to see
us reduce the need for VACUUM FREEZE, rather than go in this direction.How would you propose we do that?
I also think there's better ways we could handle *all* our cleanup
work. Tuples have a definite lifespan, and there's potentially a lot
of efficiency to be gained if we could track tuples through their
stages of life... but I don't see any easy ways to do that.See the TODO list:
https://wiki.postgresql.org/wiki/Todo
o Avoid the requirement of freezing pages that are infrequently
modified
Right, but do you have a proposal for how that would actually happen?
Perhaps I'm mis-understanding you, but it sounded like you were opposed
to this patch because it doesn't do anything to avoid the need to
freeze. My point is that no one has any good ideas on how to avoid
freezing, and I think it's a safe bet that any ideas people do come up
with there will be a lot more invasive than a FrozenMap is.
While not perfect, a FrozenMap is something we can do today, without a
lot of effort, and it will provide definite value for any tables that
have a "good" amount of frozen pages. Without performance testing, we
don't know what "good" actually looks like, but we can't test without a
patch (which we now have). Assuming performance numbers look good I
think it would be folly to reject this patch in the hopes that
eventually we'll have some way to avoid the need to freeze.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Apr 20, 2015 at 03:58:19PM -0500, Jim Nasby wrote:
I also think there's better ways we could handle *all* our cleanup
work. Tuples have a definite lifespan, and there's potentially a lot
of efficiency to be gained if we could track tuples through their
stages of life... but I don't see any easy ways to do that.See the TODO list:
https://wiki.postgresql.org/wiki/Todo
o Avoid the requirement of freezing pages that are infrequently
modifiedRight, but do you have a proposal for how that would actually happen?
Perhaps I'm mis-understanding you, but it sounded like you were
opposed to this patch because it doesn't do anything to avoid the
need to freeze. My point is that no one has any good ideas on how to
avoid freezing, and I think it's a safe bet that any ideas people do
come up with there will be a lot more invasive than a FrozenMap is.
Didn't you think any of the TODO threads had workable solutions? And
don't expect adding an additional file per relation will be zero cost
--- added over the lifetime of 200M transactions, I question if this
approach would be a win.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ Everyone has their own god. +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 4/20/15 2:45 AM, Sawada Masahiko wrote:
Current patch adds new source file src/backend/access/heap/frozenmap.c
which is quite similar to visibilitymap.c. They have similar code but
are separated for now. I do refactoring these source code like adding
bitmap.c, if needed.
My feeling is we'd definitely want this refactored; it looks to be a
whole lot of duplicated code. But before working on that we should get
consensus that a FrozenMap is a good idea.
Are there any meaningful differences between the two, besides the
obvious name changes?
I think there's also a bunch of XLOG stuff that could be refactored too...
Also, when skipping vacuum by visibility map, we can skip at least
SKIP_PAGE_THESHOLD consecutive page, but such mechanism is not in
frozen map.
That's probably something else that can be factored out, since it's
basically the same logic. I suspect we just need to && some of the
checks so we're looking at both FM and VM at the same time.
Other comments...
It would be nice if we didn't need another page bit for FM; do you see
any reasonable way that could happen?
+ * If we didn't pin the visibility(and frozen) map page and the page has
+ * become all visible(and frozen) while we were busy locking the buffer,
+ * or during some subsequent window during which we had it unlocked,
+ * we'll have to unlock and re-lock, to avoid holding the buffer lock
+ * across an I/O. That's a bit unfortunate, especially since we'll now
+ * have to recheck whether the tuple has been locked or updated under us,
+ * but hopefully it won't happen very often.
*/
s/(and frozen)/ or frozen/
+ * Reply XLOG_HEAP3_FROZENMAP record.
s/Reply/Replay/
+ /*
+ * XLogReplayBufferExtended locked the buffer. But frozenmap_set
+ * will handle locking itself.
+ */
+ LockBuffer(fmbuffer, BUFFER_LOCK_UNLOCK);
Doesn't this create a race condition?
Are you sure the bit in finish_heap_swap() is safe? If so, we should add
the same the same for the visibility map too (it certainly better be all
visible if it's frozen...)
+ /*
+ * Current block is all-visible.
+ * If frozen map represents that it's all frozen and this
+ * function is called for freezing tuples, we can skip to
+ * vacuum block.
+ */
I would state this as "Even if scan_all is true, we can skip blocks that
are marked as frozen."
+ if (frozenmap_test(onerel, blkno, &fmbuffer) && scan_all)
I suspect it's faster to reverse those tests (scan_all &&
frozenmap_test())... but why do we even need to look at scan_all? AFAICT
if a block as frozen we can skip it unconditionally.
+ /*
+ * If the un-frozen tuple is remaining in current page and
+ * current page is marked as ALL_FROZEN, we should clear it.
+ */
That needs to NEVER happen. If it does then we're going to consider
tuples as visible/frozen that shouldn't be. We should probably throw an
error here, because it means the heap is now corrupted. At the minimum
it needs to be an assert().
Note that I haven't reviewed all the logic in detail at this point. If
this ends up being refactored it'll be a lot easier to spot logic
problems, so I'll hold off on that for now.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 4/20/15 4:13 PM, Bruce Momjian wrote:
On Mon, Apr 20, 2015 at 03:58:19PM -0500, Jim Nasby wrote:
I also think there's better ways we could handle *all* our cleanup
work. Tuples have a definite lifespan, and there's potentially a lot
of efficiency to be gained if we could track tuples through their
stages of life... but I don't see any easy ways to do that.See the TODO list:
https://wiki.postgresql.org/wiki/Todo
o Avoid the requirement of freezing pages that are infrequently
modifiedRight, but do you have a proposal for how that would actually happen?
Perhaps I'm mis-understanding you, but it sounded like you were
opposed to this patch because it doesn't do anything to avoid the
need to freeze. My point is that no one has any good ideas on how to
avoid freezing, and I think it's a safe bet that any ideas people do
come up with there will be a lot more invasive than a FrozenMap is.Didn't you think any of the TODO threads had workable solutions? And
I didn't realize there were threads there.
The first three are discussion around the idea of eliminating the need
to freeze based on a page already being all visible. No patches.
/messages/by-id/CA+TgmoaEmnoLZmVbb8gvY69NA8zw9BWpiZ9+TLz-LnaBOZi7JA@mail.gmail.com
has a WIP patch that goes the route of using a tuple flag to indicate
frozen, but also raises a lot of concerns about visibility, because it
means we'd stop using FrozenXID. That impacts a large amount of code.
There were some followup patches as well as a bunch of discussion of how
to make it visible that a tuple was frozen or not. That thread died in
January 2014.
The fifth thread is XID to LSN mapping. AFAICT this has a significant
drawback in that it breaks page compatibility, meaning no pg_upgrade. It
ends 5/14/2014 with this comment:
"Well, Heikki was saying on another thread that he had kind of gotten
cold feet about this, so I gather he's not planning to pursue it. Not
sure if I understood that correctly. If so, I guess it depends on
whether someone else can pick it up, but we might first want to
establish why he got cold feet and how worrying those problems seem to
other people." -
/messages/by-id/CA+TgmoYoN8LzSuaffUaEkyV8Mhv1wi=ZLBXQ3VOfEZNO1dbw9Q@mail.gmail.com
So work was done on two alternative approaches, and then abandoned. Both
of those approaches might still be valid, but seem to need more work.
They're also higher risk because they're changing MVCC at a very
fundamental level.
As I mentioned, I think there's a lot better stuff we could be doing
about tuple lifetime, but there's no easy fixes to be had. This patch
solves a problem today, using a concept that's now well proven
(visibility map). If we had something more sophisticated being developed
then I'd be inclined not to pursue this patch, but that's not the case.
Perhaps others can elaborate on where those two patches are at...
don't expect adding an additional file per relation will be zero cost --- added over the lifetime of 200M transactions, I question if this approach would be a win.
Can you elaborate on this? I don't see how the number of transactions
would come into play, but the overhead here is not large; the FrozenMap
would be the same size as the VM map, which is 1/64,000th as large as
the heap. So a 64G table means a 1M FM. That doesn't seem very expensive.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 04/20/2015 02:13 PM, Bruce Momjian wrote:
Didn't you think any of the TODO threads had workable solutions? And don't expect adding an additional file per relation will be zero cost --- added over the lifetime of 200M transactions, I question if this approach would be a win.
Well, the only real way to test that is a prototype, no?
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: WM8fc36cbd7210a33fc159d927eb0e276f8278764109a8ec1aca2c54eecb67b6d875b788fb026e1d81bfdc25a7db52bf9b@asav-3.01.com
On 2015-04-20 17:13:29 -0400, Bruce Momjian wrote:
Didn't you think any of the TODO threads had workable solutions? And don't expect adding an additional file per relation will be zero cost --- added over the lifetime of 200M transactions, I question if this approach would be a win.
Note that normally you'd not run with a 200M transaction freeze max age
on a busy server. Rather around a magnitude more.
Think about this being used on a time partionioned table. Right now all
the partitions have to be fully rescanned on a regular basis - quite
painful. With something like this normally only the newest partitions
will have to be.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Apr 21, 2015 at 7:00 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 4/20/15 2:45 AM, Sawada Masahiko wrote:
Current patch adds new source file src/backend/access/heap/frozenmap.c
which is quite similar to visibilitymap.c. They have similar code but
are separated for now. I do refactoring these source code like adding
bitmap.c, if needed.
Thank you for having a look this patch.
My feeling is we'd definitely want this refactored; it looks to be a whole
lot of duplicated code. But before working on that we should get consensus
that a FrozenMap is a good idea.
Yes, we need to get consensus about FrozenMap before starting work on.
In addition to comment you pointed out, I noticed that one problems I
should address, that a bit of FrozenMap need to be cleared on deletion and
(i.g. xmax is set).
The page as frozen could have the dead tuple for now, but I think to change
to that the frozen page guarantees that page is all frozen *and* all
visible.
Are there any meaningful differences between the two, besides the obvious
name changes?
No, there aren't.
I think there's also a bunch of XLOG stuff that could be refactored too...
I agree with you.
Also, when skipping vacuum by visibility map, we can skip at least
SKIP_PAGE_THESHOLD consecutive page, but such mechanism is not in
frozen map.That's probably something else that can be factored out, since it's
basically the same logic. I suspect we just need to && some of the checks
so
we're looking at both FM and VM at the same time.
FrozenMap is used to skip scan only when anti-wrapping vacuum or freezing
all tuples (i.g scan_all is true).
The normal vacuum uses only VM, doesn't use FM for now.
Other comments...
It would be nice if we didn't need another page bit for FM; do you see any
reasonable way that could happen?
We may be able to remove page bit for FM from page header, but I'm not sure
we could do that.
+ * If we didn't pin the visibility(and frozen) map page and the
page
has + * become all visible(and frozen) while we were busy locking the buffer, + * or during some subsequent window during which we had it
unlocked,
+ * we'll have to unlock and re-lock, to avoid holding the buffer lock + * across an I/O. That's a bit unfortunate, especially since
we'll
now + * have to recheck whether the tuple has been locked or updated under us, + * but hopefully it won't happen very often. */s/(and frozen)/ or frozen/
+ * Reply XLOG_HEAP3_FROZENMAP record.
s/Reply/Replay/
Understood.
+ /* + * XLogReplayBufferExtended locked the buffer. But frozenmap_set + * will handle locking itself. + */ + LockBuffer(fmbuffer, BUFFER_LOCK_UNLOCK);Doesn't this create a race condition?
Are you sure the bit in finish_heap_swap() is safe? If so, we should add
the
same the same for the visibility map too (it certainly better be all
visible
if it's frozen...)
We can not ensure page is all visible even if we execute VACUUM FULL,
because of dead tuple could be remained. e.g. the case when other process
does insert and update to same tuple in same transaction before VACUUM FULL.
I was thinking that the FrozenMap is free of the influence of delete
operation. But as I said at top of this mail, a bit of FrozenMap needs to
be cleared on deletion.
So I will remove these related code as you mentioned.
+ /* + * Current block is all-visible. + * If frozen map represents that it's all frozen
and
this + * function is called for freezing tuples, we can skip to + * vacuum block. + */I would state this as "Even if scan_all is true, we can skip blocks that
are
marked as frozen."
+ if (frozenmap_test(onerel, blkno, &fmbuffer) &&
scan_all)I suspect it's faster to reverse those tests (scan_all &&
frozenmap_test())... but why do we even need to look at scan_all? AFAICT
if
a block as frozen we can skip it unconditionally.
The tuple which is frozen and dead, could be remained in page is marked all
frozen, in currently patch.
i.g., There is possible to exist the page is not all visible but marked
frozen.
But I'm thinking to change that.
+ /* + * If the un-frozen tuple is remaining in current page and + * current page is marked as ALL_FROZEN, we should clear it. + */That needs to NEVER happen. If it does then we're going to consider tuples
as visible/frozen that shouldn't be. We should probably throw an error
here,
because it means the heap is now corrupted. At the minimum it needs to be
an
assert().
I understood. I'll fix it.
Note that I haven't reviewed all the logic in detail at this point. If
this
ends up being refactored it'll be a lot easier to spot logic problems, so
I'll hold off on that for now.
Understood, we need to get consen at first.
Regards,
-------
Sawada Masahiko
On 2015-04-21 23:59:45 +0900, Sawada Masahiko wrote:
The page as frozen could have the dead tuple for now, but I think to change
to that the frozen page guarantees that page is all frozen *and* all
visible.
It shouldn't. That'd potentially cause corruption after a wraparound. A
tuple's visibility might change due to that.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Apr 22, 2015 at 12:02 AM, Andres Freund <andres@anarazel.de> wrote:
On 2015-04-21 23:59:45 +0900, Sawada Masahiko wrote:
The page as frozen could have the dead tuple for now, but I think to change
to that the frozen page guarantees that page is all frozen *and* all
visible.It shouldn't. That'd potentially cause corruption after a wraparound. A
tuple's visibility might change due to that.
The page as frozen could have some dead tuples, right?
I think we should to clear a bit of FrozenMap (and flag of page
header) on delete operation, and a bit is set only by vacuum.
So accordingly, the page as frozen guarantees that all frozen and all visible?
Regards,
-------
Sawada Masahiko
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-04-22 00:15:53 +0900, Sawada Masahiko wrote:
On Wed, Apr 22, 2015 at 12:02 AM, Andres Freund <andres@anarazel.de> wrote:
On 2015-04-21 23:59:45 +0900, Sawada Masahiko wrote:
The page as frozen could have the dead tuple for now, but I think to change
to that the frozen page guarantees that page is all frozen *and* all
visible.It shouldn't. That'd potentially cause corruption after a wraparound. A
tuple's visibility might change due to that.The page as frozen could have some dead tuples, right?
Well, we right now don't really freeze pages, but tuples. But in what
you described above that could happen.
I think we should to clear a bit of FrozenMap (and flag of page
header) on delete operation, and a bit is set only by vacuum.
Yes.
So accordingly, the page as frozen guarantees that all frozen and all
visible?
I think that's how it has to be, yes.
I *do* wonder if we shouldn't redefine the VM to also contain
information about the frozenness. Having two identically structured maps
that'll often both have to be touched at the same time isn't
nice. Neither is adding another fork. Given the size of the files
pg_upgrade could be made to rewrite them. The bigger question is
probably how bad that'd be for index-only efficiency.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Apr 20, 2015 at 7:59 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
/messages/by-id/CA+TgmoaEmnoLZmVbb8gvY69NA8zw9BWpiZ9+TLz-LnaBOZi7JA@mail.gmail.com
has a WIP patch that goes the route of using a tuple flag to indicate
frozen, but also raises a lot of concerns about visibility, because it means
we'd stop using FrozenXID. That impacts a large amount of code. There were
some followup patches as well as a bunch of discussion of how to make it
visible that a tuple was frozen or not. That thread died in January 2014.
Actually, this change has already been made, so it's not so much of a
to-do as a was-done. See commit
37484ad2aacef5ec794f4dd3d5cf814475180a78. The immediate thing we got
out of that change is that when CLUSTER or VACUUM FULL rewrite a
table, they now freeze all of the tuples using this method. See
commits 3cff1879f8d03cb729368722ca823a4bf74c0cac and
af2543e884db06c0beb75010218cd88680203b86. Previously, CLUSTER or
VACUUM FULL would not freeze anything, which meant that people who
tried to use VACUUM FULL to recover from XID wraparound problems got
nowhere, and even people who knew when to use which tool could end up
having to VACUUM FULL and then VACUUM FREEZE afterward, rewriting the
table twice, an annoyance.
It's possible that we could use this infrastructure to freeze more
aggressively in other circumstances. For example, perhaps VACUUM
should freeze any page it intends to mark all-visible. That's not a
guaranteed win, because it might increase WAL volume: setting a page
all-visible does not emit an FPI for that page, but freezing any tuple
on it would, if the page hasn't otherwise been modified since the last
checkpoint. Even if that were no issue, the freezing itself must be
WAL-logged. But if we could somehow get to a place where all-visible
=> frozen, then autovacuum would never need to visit all-visible
pages, a huge win.
We could also attack the problem from the other end. Instead of
trying to set the bits on the individual tuples, we could decide that
whenever a page is marked all-visible, we regard it as frozen
regardless of the bits set or not set on the individual tuples.
Anybody who wants to modify the page must freeze any unfrozen tuples
"for real" before clearing the visibility map bit. This would have
the same end result as the previous idea: all-visible would
essentially imply frozen, and autovacuum could ignore those pages
categorically.
I'm not saying those ideas don't have problems, because they do. But
I think they are worth further exploring. The main reason I gave up
on that is because Heikki was working on the XID-to-LSN mapping stuff.
That seemed like a better approach than either of the above, so as
long as Heikki was working on that, there wasn't much reason to pursue
more lowbrow approaches. Clearly, though, we need to do something
about this. Freezing is a big problem for lots of users.
All that having been said, I don't think adding a new fork is a good
approach. We already have problems pretty commonly where our
customers complain about running out of inodes. Adding another fork
for every table would exacerbate that problem considerably.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-04-21 16:21:47 -0400, Robert Haas wrote:
All that having been said, I don't think adding a new fork is a good
approach. We already have problems pretty commonly where our
customers complain about running out of inodes. Adding another fork
for every table would exacerbate that problem considerably.
Really? These days? There's good arguments against another fork
(increased number of fsyncs, more stat calls, increased number of file
handles, more WAL logging, ...), but the number of inodes themselves
seems like something halfway recent filesystems should handle.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Apr 21, 2015 at 4:27 PM, Andres Freund <andres@anarazel.de> wrote:
On 2015-04-21 16:21:47 -0400, Robert Haas wrote:
All that having been said, I don't think adding a new fork is a good
approach. We already have problems pretty commonly where our
customers complain about running out of inodes. Adding another fork
for every table would exacerbate that problem considerably.Really? These days? There's good arguments against another fork
(increased number of fsyncs, more stat calls, increased number of file
handles, more WAL logging, ...), but the number of inodes themselves
seems like something halfway recent filesystems should handle.
Not making it up...
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 4/21/15 3:21 PM, Robert Haas wrote:
It's possible that we could use this infrastructure to freeze more
aggressively in other circumstances. For example, perhaps VACUUM
should freeze any page it intends to mark all-visible. That's not a
guaranteed win, because it might increase WAL volume: setting a page
all-visible does not emit an FPI for that page, but freezing any tuple
on it would, if the page hasn't otherwise been modified since the last
checkpoint. Even if that were no issue, the freezing itself must be
WAL-logged. But if we could somehow get to a place where all-visible
=> frozen, then autovacuum would never need to visit all-visible
pages, a huge win.
I don't know how bad the extra WAL traffic would be; we'd obviously need
to incur it eventually, so it's a question of how common it is for a
page to go all-visible but then go not-all-visible again before
freezing. It would presumably be far more traffic than some form of a
FrozenMap though...
We could also attack the problem from the other end. Instead of
trying to set the bits on the individual tuples, we could decide that
whenever a page is marked all-visible, we regard it as frozen
regardless of the bits set or not set on the individual tuples.
Anybody who wants to modify the page must freeze any unfrozen tuples
"for real" before clearing the visibility map bit. This would have
the same end result as the previous idea: all-visible would
essentially imply frozen, and autovacuum could ignore those pages
categorically.
Pushing what's currently background work onto foreground processes
doesn't seem like a good idea...
I'm not saying those ideas don't have problems, because they do. But
I think they are worth further exploring. The main reason I gave up
on that is because Heikki was working on the XID-to-LSN mapping stuff.
That seemed like a better approach than either of the above, so as
long as Heikki was working on that, there wasn't much reason to pursue
more lowbrow approaches. Clearly, though, we need to do something
about this. Freezing is a big problem for lots of users.
Did XID-LSN die? I see at the bottom of the thread it was returned with
feedback; I guess Heikki just hasn't had time and there's no major
blockers? From what I remember this is probably a better solution, but
if it's not going to make it into 9.6 then we should probably at least
look further into a FM.
All that having been said, I don't think adding a new fork is a good
approach. We already have problems pretty commonly where our
customers complain about running out of inodes. Adding another fork
for every table would exacerbate that problem considerably.
Andres idea of adding this to the VM may work well to handle that. It
would double the size of the VM, but it would still be a ratio of
32,000-1 compared to heap size, or 2MB for a 64GB table.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Apr 21, 2015 at 7:24 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 4/21/15 3:21 PM, Robert Haas wrote:
It's possible that we could use this infrastructure to freeze more
aggressively in other circumstances. For example, perhaps VACUUM
should freeze any page it intends to mark all-visible. That's not a
guaranteed win, because it might increase WAL volume: setting a page
all-visible does not emit an FPI for that page, but freezing any tuple
on it would, if the page hasn't otherwise been modified since the last
checkpoint. Even if that were no issue, the freezing itself must be
WAL-logged. But if we could somehow get to a place where all-visible
=> frozen, then autovacuum would never need to visit all-visible
pages, a huge win.I don't know how bad the extra WAL traffic would be; we'd obviously need to
incur it eventually, so it's a question of how common it is for a page to go
all-visible but then go not-all-visible again before freezing. It would
presumably be far more traffic than some form of a FrozenMap though...
Yeah, maybe. The freeze record contains details for each TID, while
the freeze map bit would only need to be set once for the whole page.
I wonder if the format of that record could be optimized somehow.
We could also attack the problem from the other end. Instead of
trying to set the bits on the individual tuples, we could decide that
whenever a page is marked all-visible, we regard it as frozen
regardless of the bits set or not set on the individual tuples.
Anybody who wants to modify the page must freeze any unfrozen tuples
"for real" before clearing the visibility map bit. This would have
the same end result as the previous idea: all-visible would
essentially imply frozen, and autovacuum could ignore those pages
categorically.Pushing what's currently background work onto foreground processes doesn't
seem like a good idea...
When you phrase it that way, no, but pushing work that otherwise would
need to be done right now off to a future time that may never arrive
sounds like a good idea. Today, we freeze the page -- rewriting it --
and then keep scanning those all-frozen pages every X number of
transactions to make sure they are really all-frozen. In this system,
we'd eliminate the repeated scanning and defer the freeze work until
the page actually gets modified again. But that might never happen,
in which case we never have to do the work at all.
I'm not saying those ideas don't have problems, because they do. But
I think they are worth further exploring. The main reason I gave up
on that is because Heikki was working on the XID-to-LSN mapping stuff.
That seemed like a better approach than either of the above, so as
long as Heikki was working on that, there wasn't much reason to pursue
more lowbrow approaches. Clearly, though, we need to do something
about this. Freezing is a big problem for lots of users.Did XID-LSN die? I see at the bottom of the thread it was returned with
feedback; I guess Heikki just hasn't had time and there's no major blockers?
From what I remember this is probably a better solution, but if it's not
going to make it into 9.6 then we should probably at least look further into
a FM.
Heikki said he'd lost enthusiasm for it, but he wasn't too specific
about his reasons, IIRC. I guess maybe just that it got complicated,
and he wasn't sure it was correct.
All that having been said, I don't think adding a new fork is a good
approach. We already have problems pretty commonly where our
customers complain about running out of inodes. Adding another fork
for every table would exacerbate that problem considerably.Andres idea of adding this to the VM may work well to handle that. It would
double the size of the VM, but it would still be a ratio of 32,000-1
compared to heap size, or 2MB for a 64GB table.
Yes, that's got some potential. It would mean pg_upgrade would have
to remove all existing visibility maps when upgrading to the new
version, or rewrite them into the new format. But it otherwise seems
promising.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas <robertmhaas@gmail.com> wrote:
It's possible that we could use this infrastructure to freeze
more aggressively in other circumstances. For example, perhaps
VACUUM should freeze any page it intends to mark all-visible.
That's not a guaranteed win, because it might increase WAL
volume: setting a page all-visible does not emit an FPI for that
page, but freezing any tuple on it would, if the page hasn't
otherwise been modified since the last checkpoint. Even if that
were no issue, the freezing itself must be WAL-logged. But if we
could somehow get to a place where all-visible => frozen, then
autovacuum would never need to visit all-visible pages, a huge
win.
That would eliminate full-table scan vacuums, right? It would do
that by adding incremental effort and WAL to the "normal"
autovacuum run to eliminate the full table scan and the associated
mass freeze WAL-logging? It's hard to see how that would not be an
overall win.
We could also attack the problem from the other end. Instead of
trying to set the bits on the individual tuples, we could decide
that whenever a page is marked all-visible, we regard it as
frozen regardless of the bits set or not set on the individual
tuples. Anybody who wants to modify the page must freeze any
unfrozen tuples "for real" before clearing the visibility map
bit. This would have the same end result as the previous idea:
all-visible would essentially imply frozen, and autovacuum could
ignore those pages categorically.
Besides putting work into the foreground that could be done in the
background, that sounds more complicated. Also, there is no
ability to "pace" the freeze load or use scheduled jobs to shift
the work to off-peak hours.
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Apr 22, 2015 at 11:09 AM, Kevin Grittner <kgrittn@ymail.com> wrote:
Robert Haas <robertmhaas@gmail.com> wrote:
It's possible that we could use this infrastructure to freeze
more aggressively in other circumstances. For example, perhaps
VACUUM should freeze any page it intends to mark all-visible.
That's not a guaranteed win, because it might increase WAL
volume: setting a page all-visible does not emit an FPI for that
page, but freezing any tuple on it would, if the page hasn't
otherwise been modified since the last checkpoint. Even if that
were no issue, the freezing itself must be WAL-logged. But if we
could somehow get to a place where all-visible => frozen, then
autovacuum would never need to visit all-visible pages, a huge
win.That would eliminate full-table scan vacuums, right? It would do
that by adding incremental effort and WAL to the "normal"
autovacuum run to eliminate the full table scan and the associated
mass freeze WAL-logging? It's hard to see how that would not be an
overall win.
Yes and yes.
In terms of an overall win, this design loses when the tuples that
have been recently marked all-visible are going to get updated again
in the near future. In that case, the effort we spend to freeze them
is wasted. I just tested "pgbench -i -s 40 -n" followed by "VACUUM"
or alternatively followed by "VACUUM FREEZE". The VACUUM generated
4641kB of WAL. The VACUUM FREEZE generated 515MB of WAL - that is,
113 times more. So changing every VACUUM to act like VACUUM FREEZE
would be quite expensive. We'll still come out ahead if those tuples
are going to stick around long enough that they would have eventually
gotten frozen anyway, but if they get deleted again the loss is pretty
significant.
Incidentally, the reason for the large difference is that when Heikki
created the visibility map, it wasn't necessary for the WAL records
that set the visibility map bits to bump the page LSN, because it was
just a hint anyway. When I made the visibility-map crash-safe, I went
to some pains to preserve that property. Therefore, a regular VACUUM
does not emit full page images for the heap pages - it does for the
visibility map pages themselves, but there aren't very many of those.
In this example, the relation itself was 512MB, so you can see that
adding freezing to the mix roughly doubles the I/O cost. Either way
we have to write half a gig of dirty data pages, but in one case we
also have to write an additional half a gig of WAL.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 04/22/2015 05:33 PM, Robert Haas wrote:
On Tue, Apr 21, 2015 at 7:24 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 4/21/15 3:21 PM, Robert Haas wrote:
I'm not saying those ideas don't have problems, because they do. But
I think they are worth further exploring. The main reason I gave up
on that is because Heikki was working on the XID-to-LSN mapping stuff.
That seemed like a better approach than either of the above, so as
long as Heikki was working on that, there wasn't much reason to pursue
more lowbrow approaches. Clearly, though, we need to do something
about this. Freezing is a big problem for lots of users.Did XID-LSN die? I see at the bottom of the thread it was returned with
feedback; I guess Heikki just hasn't had time and there's no major blockers?
From what I remember this is probably a better solution, but if it's not
going to make it into 9.6 then we should probably at least look further into
a FM.Heikki said he'd lost enthusiasm N it, but he wasn't too specific
about his reasons, IIRC. I guess maybe just that it got complicated,
and he wasn't sure it was correct.
I'd like to continue working on that when I get around to it. Or even
better if someone else continues it :-).
The thing that made me nervous about that approach is that it made the
LSN of each page critical information. If you somehow zeroed out the
LSN, you could no longer tell which pages are frozen and which are not.
I'm sure it could be made to work - and I got it working to some degree
anyway - but it's a bit scary. It's similar to the multixid changes in
9.3: multixids also used to be data that you can just zap at restart,
and when we changed the rules so that you lose data if you lose
multixids, we got trouble. Now, LSNs are much simpler, and there
wouldn't be anything like the multioffset/member SLRUs that you'd have
to keep around forever or vacuum, but still..
I would feel safer if we added a completely new "epoch" counter to the
page header, instead of reusing LSNs. But as we all know, changing the
page format is a problem for in-place upgrade, and takes some space too.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas <robertmhaas@gmail.com> wrote:
I just tested "pgbench -i -s 40 -n" followed by "VACUUM" or
alternatively followed by "VACUUM FREEZE". The VACUUM generated
4641kB of WAL. The VACUUM FREEZE generated 515MB of WAL - that
is, 113 times more.
Essentially a bulk load. OK, so if you bulk load data and then
vacuum it before updating 100% of it, this approach will generate a
lot more WAL than we currently do. Of course, if you don't VACUUM
FREEZE after a bulk load and then are engaged in a fairly normal
OLTP workload with peak and off-peak cycles, you are currently
almost certain to hit a point during peak OLTP load where you begin
to sequentially scan all tables, rewriting them in place, with WAL
logging. Incidentally, this tends to flush a lot of your "hot"
data out of cache, increasing disk reads. The first time I hit
this "interesting" experience in production it was so devastating,
and generated so many user complaints, that I never again
considered a bulk load complete until I had run VACUUM FREEZE on it
-- although I was sometimes able to defer that to an off-peak
window of time.
In other words, for the production environments I managed, the only
value of that number is in demonstrating the importance of using
unlogged COPY followed by VACUUM FREEZE for bulk-loading and
capturing a fresh base backup upon completion. A better way to use
pgbench to measure WAL size cost might be to initialize, VACUUM
FREEZE to set a "long term baseline", and do a reasonable length
run with crontab running VACUUM FREEZE periodically (including
after the run was complete) versus doing the same with plain VACUUM
(followed by a VACUUM FREEZE at the end?). Comparing the total WAL
sizes generated following the initial load and VACUUM FREEZE would
give a more accurate picture of the impact on an OLTP load, I
think.
We'll still come out ahead if those tuples are going to stick
around long enough that they would have eventually gotten frozen
anyway, but if they get deleted again the loss is pretty
significant.
Perhaps my perception is biased by having worked in an environment
where the vast majority of tuples (both in terms of tuple count and
byte count) were never updated and were only eligible for deletion
after a period of years. Our current approach is pretty bad in
such an environment, at least if you try to leave all vacuuming to
autovacuum. I'll admit that we were able to work around the
problems by running VACUUM FREEZE every night for most databases.
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Apr 22, 2015 at 12:39 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
The thing that made me nervous about that approach is that it made the LSN
of each page critical information. If you somehow zeroed out the LSN, you
could no longer tell which pages are frozen and which are not. I'm sure it
could be made to work - and I got it working to some degree anyway - but
it's a bit scary. It's similar to the multixid changes in 9.3: multixids
also used to be data that you can just zap at restart, and when we changed
the rules so that you lose data if you lose multixids, we got trouble. Now,
LSNs are much simpler, and there wouldn't be anything like the
multioffset/member SLRUs that you'd have to keep around forever or vacuum,
but still..
LSNs are already pretty critical. If they're in the future, you can't
flush those pages. Ever. And if they're wrong in either direction,
crash recovery is broken. But it's still worth thinking about ways
that we could make this more robust.
I keep coming back to the idea of treating any page that is marked as
all-visible as frozen, and deferring freezing until the page is again
modified. The big downside of this is that if the page is set as
all-visible and then immediately thereafter modified, it sucks to have
to freeze when the XIDs in the page are still present in CLOG. But if
we could determine from the LSN that the XIDs in the page are new
enough to still be considered valid, then we could skip freezing in
those cases and only do it when the page is "old". That way, if
somebody zeroed out the LSN (why, oh why?) the worst that would happen
is that we'd do some extra freezing when the page was next modified.
I would feel safer if we added a completely new "epoch" counter to the page
header, instead of reusing LSNs. But as we all know, changing the page
format is a problem for in-place upgrade, and takes some space too.
Yeah. We have a serious need to reduce the size of our on-disk
format. On a TPC-C-like workload Jan Wieck recently tested, our data
set was 34% larger than another database at the beginning of the test,
and 80% larger by the end of the test. And we did twice the disk
writes. See "The Elephants in the Room.pdf" at
https://sites.google.com/site/robertmhaas/presentations
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Apr 22, 2015 at 2:23 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
Robert Haas <robertmhaas@gmail.com> wrote:
I just tested "pgbench -i -s 40 -n" followed by "VACUUM" or
alternatively followed by "VACUUM FREEZE". The VACUUM generated
4641kB of WAL. The VACUUM FREEZE generated 515MB of WAL - that
is, 113 times more.Essentially a bulk load. OK, so if you bulk load data and then
vacuum it before updating 100% of it, this approach will generate a
lot more WAL than we currently do. Of course, if you don't VACUUM
FREEZE after a bulk load and then are engaged in a fairly normal
OLTP workload with peak and off-peak cycles, you are currently
almost certain to hit a point during peak OLTP load where you begin
to sequentially scan all tables, rewriting them in place, with WAL
logging. Incidentally, this tends to flush a lot of your "hot"
data out of cache, increasing disk reads. The first time I hit
this "interesting" experience in production it was so devastating,
and generated so many user complaints, that I never again
considered a bulk load complete until I had run VACUUM FREEZE on it
-- although I was sometimes able to defer that to an off-peak
window of time.In other words, for the production environments I managed, the only
value of that number is in demonstrating the importance of using
unlogged COPY followed by VACUUM FREEZE for bulk-loading and
capturing a fresh base backup upon completion. A better way to use
pgbench to measure WAL size cost might be to initialize, VACUUM
FREEZE to set a "long term baseline", and do a reasonable length
run with crontab running VACUUM FREEZE periodically (including
after the run was complete) versus doing the same with plain VACUUM
(followed by a VACUUM FREEZE at the end?). Comparing the total WAL
sizes generated following the initial load and VACUUM FREEZE would
give a more accurate picture of the impact on an OLTP load, I
think.
Sure, that would be a better test. But I'm pretty sure the impact
will still be fairly substantial.
We'll still come out ahead if those tuples are going to stick
around long enough that they would have eventually gotten frozen
anyway, but if they get deleted again the loss is pretty
significant.Perhaps my perception is biased by having worked in an environment
where the vast majority of tuples (both in terms of tuple count and
byte count) were never updated and were only eligible for deletion
after a period of years. Our current approach is pretty bad in
such an environment, at least if you try to leave all vacuuming to
autovacuum. I'll admit that we were able to work around the
problems by running VACUUM FREEZE every night for most databases.
Yeah. And that breaks down when you have very big databases with a
high XID consumption rate, because the mostly-no-op VACUUM FREEZE runs
for longer than you can tolerate. I'm not saying we don't need to fix
this problem; we clearly do. I'm just saying that we've got to be
careful not to harm other scenarios in the process.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 4/22/15 1:24 PM, Robert Haas wrote:
I keep coming back to the idea of treating any page that is marked as
all-visible as frozen, and deferring freezing until the page is again
modified. The big downside of this is that if the page is set as
all-visible and then immediately thereafter modified, it sucks to have
to freeze when the XIDs in the page are still present in CLOG. But if
we could determine from the LSN that the XIDs in the page are new
enough to still be considered valid, then we could skip freezing in
those cases and only do it when the page is "old". That way, if
somebody zeroed out the LSN (why, oh why?) the worst that would happen
is that we'd do some extra freezing when the page was next modified.
Maybe freezing a page as part of making it not all-visible wouldn't be
that horrible, even without LSN.
For one, we already know that every tuple is visible, so no MVCC checks
needed. That's probably a significant savings over current freezing.
If we're marking a page as no longer all-visible, that means we're
already dirtying it and generating WAL for it (likely including a FPI).
We may be able to consolidate all of this into a new WAL record that's a
lot more efficient than what we currently do for freezing. I suspect we
wouldn't need to log each TID we're freezing, for starters. Even if we
did though, we could at least combine all that into one WAL message that
just contains an array of TIDs or LPs.
<ponders...> I think we could actually proactively freeze tuples during
vacuum too, at least if we're about to mark the page as all-visible.
Though, with Robert's HEAP_XMIN_FROZEN change we could be a lot more
aggressive about freezing during VACUUM, certainly for pages we're
already dirtying, especially if we can keep the WAL cost of that down.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Apr 21, 2015 at 08:39:37AM +0200, Andres Freund wrote:
On 2015-04-20 17:13:29 -0400, Bruce Momjian wrote:
Didn't you think any of the TODO threads had workable solutions? And don't expect adding an additional file per relation will be zero cost --- added over the lifetime of 200M transactions, I question if this approach would be a win.Note that normally you'd not run with a 200M transaction freeze max age
on a busy server. Rather around a magnitude more.Think about this being used on a time partionioned table. Right now all
the partitions have to be fully rescanned on a regular basis - quite
painful. With something like this normally only the newest partitions
will have to be.
My point is that for the life of 200M transactions, you would have the
overhead of an additional file per table in the file system, and updates
of that. I just don't know if the overhead over the long time period
would be smaller than the VACUUM FREEZE. It might be fine --- I don't
know. People seem to focus on the big activities, while many small
activities can lead to larger slowdowns.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ Everyone has their own god. +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 4/22/15 6:12 PM, Bruce Momjian wrote:
My point is that for the life of 200M transactions, you would have the
overhead of an additional file per table in the file system, and updates
of that. I just don't know if the overhead over the long time period
would be smaller than the VACUUM FREEZE. It might be fine --- I don't
know. People seem to focus on the big activities, while many small
activities can lead to larger slowdowns.
Ahh. This wouldn't be for the life of 200M transactions; it would be a
permanent fork, just like the VM is.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Apr 22, 2015 at 06:36:23PM -0500, Jim Nasby wrote:
On 4/22/15 6:12 PM, Bruce Momjian wrote:
My point is that for the life of 200M transactions, you would have the
overhead of an additional file per table in the file system, and updates
of that. I just don't know if the overhead over the long time period
would be smaller than the VACUUM FREEZE. It might be fine --- I don't
know. People seem to focus on the big activities, while many small
activities can lead to larger slowdowns.Ahh. This wouldn't be for the life of 200M transactions; it would be
a permanent fork, just like the VM is.
Right. My point is that either you do X 2M times to maintain that fork
and the overhead of the file existance, or you do one VACUUM FREEZE. I
am saying that 2M is a large number and adding all those X's might
exceed the cost of a VACUUM FREEZE.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ Everyone has their own god. +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Apr 23, 2015 at 3:24 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Apr 22, 2015 at 12:39 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
The thing that made me nervous about that approach is that it made the LSN
of each page critical information. If you somehow zeroed out the LSN, you
could no longer tell which pages are frozen and which are not. I'm sure it
could be made to work - and I got it working to some degree anyway - but
it's a bit scary. It's similar to the multixid changes in 9.3: multixids
also used to be data that you can just zap at restart, and when we changed
the rules so that you lose data if you lose multixids, we got trouble. Now,
LSNs are much simpler, and there wouldn't be anything like the
multioffset/member SLRUs that you'd have to keep around forever or vacuum,
but still..LSNs are already pretty critical. If they're in the future, you can't
flush those pages. Ever. And if they're wrong in either direction,
crash recovery is broken. But it's still worth thinking about ways
that we could make this more robust.I keep coming back to the idea of treating any page that is marked as
all-visible as frozen, and deferring freezing until the page is again
modified. The big downside of this is that if the page is set as
all-visible and then immediately thereafter modified, it sucks to have
to freeze when the XIDs in the page are still present in CLOG. But if
we could determine from the LSN that the XIDs in the page are new
enough to still be considered valid, then we could skip freezing in
those cases and only do it when the page is "old". That way, if
somebody zeroed out the LSN (why, oh why?) the worst that would happen
is that we'd do some extra freezing when the page was next modified.
In your idea, if we have WORM (write-once read-many) table then these
tuples in page would not be frozen at all unless we do VACUUM FREEZE.
Also in this situation, from the second time VACUUM FREEZE would need
to scan only pages of increment from last freezing, we could reduce
I/O, but we would still need to do explicitly freezing for
anti-wrapping as in the past. WORM table has huge data in general, and
that data would be increase rapidly, so it would also be expensive.
I would feel safer if we added a completely new "epoch" counter to the page
header, instead of reusing LSNs. But as we all know, changing the page
format is a problem for in-place upgrade, and takes some space too.Yeah. We have a serious need to reduce the size of our on-disk
format. On a TPC-C-like workload Jan Wieck recently tested, our data
set was 34% larger than another database at the beginning of the test,
and 80% larger by the end of the test. And we did twice the disk
writes. See "The Elephants in the Room.pdf" at
https://sites.google.com/site/robertmhaas/presentations
Regards,
-------
Sawada Masahiko
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 04/22/2015 09:24 PM, Robert Haas wrote:
I would feel safer if we added a completely new "epoch" counter to the page
header, instead of reusing LSNs. But as we all know, changing the page
format is a problem for in-place upgrade, and takes some space too.Yeah. We have a serious need to reduce the size of our on-disk
format. On a TPC-C-like workload Jan Wieck recently tested, our data
set was 34% larger than another database at the beginning of the test,
and 80% larger by the end of the test. And we did twice the disk
writes. See "The Elephants in the Room.pdf" at
https://sites.google.com/site/robertmhaas/presentations
Meh. Adding an 8-byte header to every 8k block would add 0.1% to the
disk size. No doubt it would be nice to reduce our disk footprint, but
the page header is not the elephant in the room.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 21 April 2015 at 22:21, Robert Haas <robertmhaas@gmail.com> wrote:
I'm not saying those ideas don't have problems, because they do. But
I think they are worth further exploring. The main reason I gave up
on that is because Heikki was working on the XID-to-LSN mapping stuff.
That seemed like a better approach than either of the above, so as
long as Heikki was working on that, there wasn't much reason to pursue
more lowbrow approaches. Clearly, though, we need to do something
about this. Freezing is a big problem for lots of users.All that having been said, I don't think adding a new fork is a good
approach. We already have problems pretty commonly where our
customers complain about running out of inodes. Adding another fork
for every table would exacerbate that problem considerably.
We were talking about having an incremental backup map also. Which sounds a
lot like the freeze map.
XID-to-LSN sounded cool but was complex. If we need the map for backup
purposes, we may as well do it the simple way and hit both birds at once.
We only need a freeze/backup map for larger relations. So if we map 1000
blocks per map page, we skip having a map at all when size < 1000.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Apr 23, 2015 at 4:19 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
We were talking about having an incremental backup map also. Which sounds a
lot like the freeze map.
Yeah, possibly. I think we should try to set things up so that the
backup map can be updated asynchronously by a background worker, so
that we're not adding more work to the foreground path just for the
benefit of maintenance operations. That might make the logic for
autovacuum to use it a little bit more complex, but it seems
manageable.
We only need a freeze/backup map for larger relations. So if we map 1000
blocks per map page, we skip having a map at all when size < 1000.
Agreed. We might also want to map multiple blocks per map slot - e.g.
one slot per 32 blocks. That would keep the map quite small even for
very large relations, and would not compromise efficiency that much
since reading 256kB sequentially probably takes only a little longer
than reading 8kB.
I think the idea of integrating the freeze map into the VM fork is
also worth considering. Then, the incremental backup map could be
optional; if you don't want incremental backup, you can shut it off
and have less overhead.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Apr 22, 2015 at 8:55 PM, Bruce Momjian <bruce@momjian.us> wrote:
On Wed, Apr 22, 2015 at 06:36:23PM -0500, Jim Nasby wrote:
On 4/22/15 6:12 PM, Bruce Momjian wrote:
My point is that for the life of 200M transactions, you would have the
overhead of an additional file per table in the file system, and updates
of that. I just don't know if the overhead over the long time period
would be smaller than the VACUUM FREEZE. It might be fine --- I don't
know. People seem to focus on the big activities, while many small
activities can lead to larger slowdowns.Ahh. This wouldn't be for the life of 200M transactions; it would be
a permanent fork, just like the VM is.Right. My point is that either you do X 2M times to maintain that fork
and the overhead of the file existance, or you do one VACUUM FREEZE. I
am saying that 2M is a large number and adding all those X's might
exceed the cost of a VACUUM FREEZE.
I agree, but if we instead make this part of the visibility map
instead of a separate fork, the cost is much less. It won't be any
more expensive to clear 2 consecutive bits any time a page is touched
than it is to clear 1. The VM fork will be twice as large, but still
tiny. And the fact that you'll have only half as many pages mapping
to the same VM page may even improve performance in some cases by
reducing contention. Even when it reduces performance, I think the
impact will be so tiny as not to be worth caring about.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 4/23/15 2:42 AM, Heikki Linnakangas wrote:
On 04/22/2015 09:24 PM, Robert Haas wrote:
Yeah. We have a serious need to reduce the size of our on-disk
format. On a TPC-C-like workload Jan Wieck recently tested, our data
set was 34% larger than another database at the beginning of the test,
and 80% larger by the end of the test. And we did twice the disk
writes. See "The Elephants in the Room.pdf" at
https://sites.google.com/site/robertmhaas/presentationsMeh. Adding an 8-byte header to every 8k block would add 0.1% to the
disk size. No doubt it would be nice to reduce our disk footprint, but
the page header is not the elephant in the room.
I've often wondered if there was some way we could consolidate XMIN/XMAX
from multiple tuples at the page level; that could be a big win for OLAP
environments where most of your tuples belong to a pretty small range of
XIDs. In many workloads you could have 80%+ of the tuples in a table
having a single inserting XID.
Dunno how much it would help for OLTP though... :/
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 4/23/15 8:42 AM, Robert Haas wrote:
On Thu, Apr 23, 2015 at 4:19 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
We were talking about having an incremental backup map also. Which sounds a
lot like the freeze map.Yeah, possibly. I think we should try to set things up so that the
backup map can be updated asynchronously by a background worker, so
that we're not adding more work to the foreground path just for the
benefit of maintenance operations. That might make the logic for
autovacuum to use it a little bit more complex, but it seems
manageable.
I'm not sure an actual map makes sense... for incremental backups you
need some kind of stream that tells you not only what changed but when
it changed. A simple freeze map won't work for that because the
operation of freezing itself writes data (and the same can be true for
VM). Though, if the backup utility was actually comparing live data to
an actual backup maybe this would work...
We only need a freeze/backup map for larger relations. So if we map 1000
blocks per map page, we skip having a map at all when size < 1000.Agreed. We might also want to map multiple blocks per map slot - e.g.
one slot per 32 blocks. That would keep the map quite small even for
very large relations, and would not compromise efficiency that much
since reading 256kB sequentially probably takes only a little longer
than reading 8kB.
The problem with mapping a range of pages per bit is dealing with
locking when you set the bit. Currently that's easy because we're
holding the cleanup lock on the page, but you can't do that if you have
a range of pages. Though, if each 'slot' wasn't a simple binary value we
could have a 3rd state that indicates we're in the process of marking
that slot as all visible/frozen, but you still need to consider the bit
as cleared.
Honestly though, I think concerns about the size of the map are a bit
overblown. Even if we double it's size, it's still 32,000 times smaller
than the heap is with 8k pages. I suspect if you have tables large
enough where you'll care, you'll also be using 32k pages, which means
it'd be 128,000 times smaller than the heap. I have a hard time
believing that's going to be even a faint blip on the performance radar.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 04/23/2015 05:52 PM, Jim Nasby wrote:
On 4/23/15 2:42 AM, Heikki Linnakangas wrote:
On 04/22/2015 09:24 PM, Robert Haas wrote:
Yeah. We have a serious need to reduce the size of our on-disk
format. On a TPC-C-like workload Jan Wieck recently tested, our data
set was 34% larger than another database at the beginning of the test,
and 80% larger by the end of the test. And we did twice the disk
writes. See "The Elephants in the Room.pdf" at
https://sites.google.com/site/robertmhaas/presentationsMeh. Adding an 8-byte header to every 8k block would add 0.1% to the
disk size. No doubt it would be nice to reduce our disk footprint, but
the page header is not the elephant in the room.I've often wondered if there was some way we could consolidate XMIN/XMAX
from multiple tuples at the page level; that could be a big win for OLAP
environments where most of your tuples belong to a pretty small range of
XIDs. In many workloads you could have 80%+ of the tuples in a table
having a single inserting XID.
It would be doable for xmin - IIRC someone even posted a patch for that
years ago - but xmax (and ctid) is difficult. When a tuple is inserted,
Xmax is basically just a reservation for the value that will be put
there later. You have no idea what that value is, and you can't
influence it, and when it's time to delete/update the row, you *must*
have the space for that xmax. So we can't opportunistically use the
space for anything else, or compress them or anything like that.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Apr 23, 2015 at 10:42:59AM +0300, Heikki Linnakangas wrote:
On 04/22/2015 09:24 PM, Robert Haas wrote:
I would feel safer if we added a completely new "epoch" counter to the page
header, instead of reusing LSNs. But as we all know, changing the page
format is a problem for in-place upgrade, and takes some space too.Yeah. We have a serious need to reduce the size of our on-disk
format. On a TPC-C-like workload Jan Wieck recently tested, our data
set was 34% larger than another database at the beginning of the test,
and 80% larger by the end of the test. And we did twice the disk
writes. See "The Elephants in the Room.pdf" at
https://sites.google.com/site/robertmhaas/presentationsMeh. Adding an 8-byte header to every 8k block would add 0.1% to the
disk size. No doubt it would be nice to reduce our disk footprint,
but the page header is not the elephant in the room.
Agreed. Are you saying we can't find a way to fit an 8-byte value into
the existing page in a backward-compatible way?
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ Everyone has their own god. +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 23/04/15 17:24, Heikki Linnakangas wrote:
On 04/23/2015 05:52 PM, Jim Nasby wrote:
On 4/23/15 2:42 AM, Heikki Linnakangas wrote:
On 04/22/2015 09:24 PM, Robert Haas wrote:
Yeah. We have a serious need to reduce the size of our on-disk
format. On a TPC-C-like workload Jan Wieck recently tested, our data
set was 34% larger than another database at the beginning of the test,
and 80% larger by the end of the test. And we did twice the disk
writes. See "The Elephants in the Room.pdf" at
https://sites.google.com/site/robertmhaas/presentationsMeh. Adding an 8-byte header to every 8k block would add 0.1% to the
disk size. No doubt it would be nice to reduce our disk footprint, but
the page header is not the elephant in the room.I've often wondered if there was some way we could consolidate XMIN/XMAX
from multiple tuples at the page level; that could be a big win for OLAP
environments where most of your tuples belong to a pretty small range of
XIDs. In many workloads you could have 80%+ of the tuples in a table
having a single inserting XID.It would be doable for xmin - IIRC someone even posted a patch for that
years ago - but xmax (and ctid) is difficult. When a tuple is inserted,
Xmax is basically just a reservation for the value that will be put
there later. You have no idea what that value is, and you can't
influence it, and when it's time to delete/update the row, you *must*
have the space for that xmax. So we can't opportunistically use the
space for anything else, or compress them or anything like that.
That depends, if we are going to change page format we can move the xmax
to be some map of ctid->xmax in the header (with no values for tuples
with no xmax) or have bitmap there of tuples that have xmax etc.
Basically not saving xmax (and potentially other info) inline for each
tuple but have some info in header only for tuples that need it. That
might have bad performance side effects of course, but there are
definitely some potential ways of doing things differently which we
could explore.
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Apr 23, 2015 at 06:24:00PM +0300, Heikki Linnakangas wrote:
I've often wondered if there was some way we could consolidate XMIN/XMAX
from multiple tuples at the page level; that could be a big win for OLAP
environments where most of your tuples belong to a pretty small range of
XIDs. In many workloads you could have 80%+ of the tuples in a table
having a single inserting XID.It would be doable for xmin - IIRC someone even posted a patch for
that years ago - but xmax (and ctid) is difficult. When a tuple is
inserted, Xmax is basically just a reservation for the value that
will be put there later. You have no idea what that value is, and
you can't influence it, and when it's time to delete/update the row,
you *must* have the space for that xmax. So we can't
opportunistically use the space for anything else, or compress them
or anything like that.
Also SELECT FOR UPDATE uses the per-row xmax too.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ Everyone has their own god. +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Apr 23, 2015 at 09:45:38AM -0400, Robert Haas wrote:
Right. My point is that either you do X 2M times to maintain that fork
and the overhead of the file existence, or you do one VACUUM FREEZE. I
am saying that 2M is a large number and adding all those X's might
exceed the cost of a VACUUM FREEZE.I agree, but if we instead make this part of the visibility map
instead of a separate fork, the cost is much less. It won't be any
more expensive to clear 2 consecutive bits any time a page is touched
than it is to clear 1. The VM fork will be twice as large, but still
tiny. And the fact that you'll have only half as many pages mapping
to the same VM page may even improve performance in some cases by
reducing contention. Even when it reduces performance, I think the
impact will be so tiny as not to be worth caring about.
Agreed, no extra file, and the same write volume as currently. It would
also match pg_clog, which uses two bits per transaction --- maybe we can
reuse some of that code.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ Everyone has their own god. +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 04/23/2015 06:39 PM, Petr Jelinek wrote:
On 23/04/15 17:24, Heikki Linnakangas wrote:
On 04/23/2015 05:52 PM, Jim Nasby wrote:
I've often wondered if there was some way we could consolidate XMIN/XMAX
from multiple tuples at the page level; that could be a big win for OLAP
environments where most of your tuples belong to a pretty small range of
XIDs. In many workloads you could have 80%+ of the tuples in a table
having a single inserting XID.It would be doable for xmin - IIRC someone even posted a patch for that
years ago - but xmax (and ctid) is difficult. When a tuple is inserted,
Xmax is basically just a reservation for the value that will be put
there later. You have no idea what that value is, and you can't
influence it, and when it's time to delete/update the row, you *must*
have the space for that xmax. So we can't opportunistically use the
space for anything else, or compress them or anything like that.That depends, if we are going to change page format we can move the xmax
to be some map of ctid->xmax in the header (with no values for tuples
with no xmax) ...
Stop right there. You need to reserve enough space on the page to store
an xmax for *every* tuple on the page. Because if you don't, what are
you going to do when every tuple on the page is deleted by a different
transaction.
Even if you store the xmax somewhere else than the page header, you need
to reserve the same amount of space for them, so it doesn't help at all.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 04/23/2015 06:38 PM, Bruce Momjian wrote:
On Thu, Apr 23, 2015 at 10:42:59AM +0300, Heikki Linnakangas wrote:
On 04/22/2015 09:24 PM, Robert Haas wrote:
I would feel safer if we added a completely new "epoch" counter to the page
header, instead of reusing LSNs. But as we all know, changing the page
format is a problem for in-place upgrade, and takes some space too.Yeah. We have a serious need to reduce the size of our on-disk
format. On a TPC-C-like workload Jan Wieck recently tested, our data
set was 34% larger than another database at the beginning of the test,
and 80% larger by the end of the test. And we did twice the disk
writes. See "The Elephants in the Room.pdf" at
https://sites.google.com/site/robertmhaas/presentationsMeh. Adding an 8-byte header to every 8k block would add 0.1% to the
disk size. No doubt it would be nice to reduce our disk footprint,
but the page header is not the elephant in the room.Agreed. Are you saying we can't find a way to fit an 8-byte value into
the existing page in a backward-compatible way?
I'm sure we can find a way. We've discussed ways to handle page format
updates in pg_upgrade before, and I don't want to get into that
discussion here, but it's not trivial.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Apr 23, 2015 at 06:52:20PM +0300, Heikki Linnakangas wrote:
Agreed. Are you saying we can't find a way to fit an 8-byte value into
the existing page in a backward-compatible way?I'm sure we can find a way. We've discussed ways to handle page
format updates in pg_upgrade before, and I don't want to get into
that discussion here, but it's not trivial.
OK, good to know, thanks.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ Everyone has their own god. +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 23/04/15 17:45, Bruce Momjian wrote:
On Thu, Apr 23, 2015 at 09:45:38AM -0400, Robert Haas wrote:
Right. My point is that either you do X 2M times to maintain that fork
and the overhead of the file existence, or you do one VACUUM FREEZE. I
am saying that 2M is a large number and adding all those X's might
exceed the cost of a VACUUM FREEZE.I agree, but if we instead make this part of the visibility map
instead of a separate fork, the cost is much less. It won't be any
more expensive to clear 2 consecutive bits any time a page is touched
than it is to clear 1. The VM fork will be twice as large, but still
tiny. And the fact that you'll have only half as many pages mapping
to the same VM page may even improve performance in some cases by
reducing contention. Even when it reduces performance, I think the
impact will be so tiny as not to be worth caring about.Agreed, no extra file, and the same write volume as currently. It would
also match pg_clog, which uses two bits per transaction --- maybe we can
reuse some of that code.
Yeah, this approach seems promising. We probably can't reuse code from
clog because the usage pattern is different (key for clog is xid, while
for visibility/freeze map ctid is used). But visibility map storage
layer is pretty simple so it should be easy to extend it for this use.
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 4/23/15 11:06 AM, Petr Jelinek wrote:
On 23/04/15 17:45, Bruce Momjian wrote:
On Thu, Apr 23, 2015 at 09:45:38AM -0400, Robert Haas wrote:
Agreed, no extra file, and the same write volume as currently. It would
also match pg_clog, which uses two bits per transaction --- maybe we can
reuse some of that code.Yeah, this approach seems promising. We probably can't reuse code from
clog because the usage pattern is different (key for clog is xid, while
for visibility/freeze map ctid is used). But visibility map storage
layer is pretty simple so it should be easy to extend it for this use.
Actually, there may be some bit manipulation functions we could reuse;
things like efficiently counting how many things in a byte are set.
Probably doesn't make sense to fully refactor it, but at least CLOG is a
good source for cut/paste/whack.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Apr 23, 2015 at 10:42 PM, Robert Haas wrote:
On Thu, Apr 23, 2015 at 4:19 AM, Simon Riggs wrote:
We only need a freeze/backup map for larger relations. So if we map 1000
blocks per map page, we skip having a map at all when size < 1000.Agreed. We might also want to map multiple blocks per map slot - e.g.
one slot per 32 blocks. That would keep the map quite small even for
very large relations, and would not compromise efficiency that much
since reading 256kB sequentially probably takes only a little longer
than reading 8kB.I think the idea of integrating the freeze map into the VM fork is
also worth considering. Then, the incremental backup map could be
optional; if you don't want incremental backup, you can shut it off
and have less overhead.
When I read that I think about something configurable at
relation-level.There are cases where you may want to have more
granularity of this information at block level by having the VM slots
to track less blocks than 32, and vice-versa.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Apr 24, 2015 at 1:31 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 4/23/15 11:06 AM, Petr Jelinek wrote:
On 23/04/15 17:45, Bruce Momjian wrote:
On Thu, Apr 23, 2015 at 09:45:38AM -0400, Robert Haas wrote:
Agreed, no extra file, and the same write volume as currently. It would
also match pg_clog, which uses two bits per transaction --- maybe we can
reuse some of that code.Yeah, this approach seems promising. We probably can't reuse code from
clog because the usage pattern is different (key for clog is xid, while
for visibility/freeze map ctid is used). But visibility map storage
layer is pretty simple so it should be easy to extend it for this use.Actually, there may be some bit manipulation functions we could reuse;
things like efficiently counting how many things in a byte are set. Probably
doesn't make sense to fully refactor it, but at least CLOG is a good source
for cut/paste/whack.
I agree with adding a bit that indicates corresponding page is
all-frozen into VM, just like CLOG.
I'll change the patch as second version patch.
Regards,
-------
Sawada Masahiko
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Apr 23, 2015 at 9:03 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
On Thu, Apr 23, 2015 at 10:42 PM, Robert Haas wrote:
On Thu, Apr 23, 2015 at 4:19 AM, Simon Riggs wrote:
We only need a freeze/backup map for larger relations. So if we map 1000
blocks per map page, we skip having a map at all when size < 1000.Agreed. We might also want to map multiple blocks per map slot - e.g.
one slot per 32 blocks. That would keep the map quite small even for
very large relations, and would not compromise efficiency that much
since reading 256kB sequentially probably takes only a little longer
than reading 8kB.I think the idea of integrating the freeze map into the VM fork is
also worth considering. Then, the incremental backup map could be
optional; if you don't want incremental backup, you can shut it off
and have less overhead.When I read that I think about something configurable at
relation-level.There are cases where you may want to have more
granularity of this information at block level by having the VM slots
to track less blocks than 32, and vice-versa.
What are those cases? To me that sounds like making things
complicated to no obvious benefit.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 4/24/15 6:52 AM, Robert Haas wrote:
On Thu, Apr 23, 2015 at 9:03 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:On Thu, Apr 23, 2015 at 10:42 PM, Robert Haas wrote:
On Thu, Apr 23, 2015 at 4:19 AM, Simon Riggs wrote:
We only need a freeze/backup map for larger relations. So if we map 1000
blocks per map page, we skip having a map at all when size < 1000.Agreed. We might also want to map multiple blocks per map slot - e.g.
one slot per 32 blocks. That would keep the map quite small even for
very large relations, and would not compromise efficiency that much
since reading 256kB sequentially probably takes only a little longer
than reading 8kB.I think the idea of integrating the freeze map into the VM fork is
also worth considering. Then, the incremental backup map could be
optional; if you don't want incremental backup, you can shut it off
and have less overhead.When I read that I think about something configurable at
relation-level.There are cases where you may want to have more
granularity of this information at block level by having the VM slots
to track less blocks than 32, and vice-versa.What are those cases? To me that sounds like making things
complicated to no obvious benefit.
Tables that get few/no dead tuples, like bulk insert tables. You'll have
large sections of blocks with the same visibility.
I suspect the added code to allow setting 1 bit for multiple pages
without having to lock all those pages simultaneously will probably
outweigh making this a reloption anyway.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Apr 24, 2015 at 4:09 PM, Jim Nasby <Jim.Nasby@bluetreble.com>
wrote:>>> When I read that I think about something configurable at
relation-level.There are cases where you may want to have more
granularity of this information at block level by having the VM slots
to track less blocks than 32, and vice-versa.What are those cases? To me that sounds like making things
complicated to no obvious benefit.Tables that get few/no dead tuples, like bulk insert tables. You'll have
large sections of blocks with the same visibility.
I don't see any reason why that would require different granularity.
I suspect the added code to allow setting 1 bit for multiple pages without
having to lock all those pages simultaneously will probably outweigh making
this a reloption anyway.
That's a completely unrelated issue.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 4/28/15 7:11 AM, Robert Haas wrote:
On Fri, Apr 24, 2015 at 4:09 PM, Jim Nasby<Jim.Nasby@bluetreble.com>
wrote:>>> When I read that I think about something configurable atrelation-level.There are cases where you may want to have more
granularity of this information at block level by having the VM slots
to track less blocks than 32, and vice-versa.What are those cases? To me that sounds like making things
complicated to no obvious benefit.Tables that get few/no dead tuples, like bulk insert tables. You'll have
large sections of blocks with the same visibility.I don't see any reason why that would require different granularity.
Because in those cases it would be trivial to drop XMIN out of the tuple
headers. For a warehouse with narrow rows that could be a significant
win. Moreso, we could also move XMAX to the page level if we accept that
if we need to invalidate any tuple we'd have to move all of them. In a
warehouse situation that's probably OK as well.
That said, I don't think this is the first place to focus for reducing
our on-disk format; reducing cleanup bloat would probably be a lot more
useful.
Did you or Jan have more detailed info from the test he ran about where
our 80% overhead was ending up? That would remove a lot of speculation
here...
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Apr 28, 2015 at 1:53 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
Because in those cases it would be trivial to drop XMIN out of the tuple
headers. For a warehouse with narrow rows that could be a significant win.
Moreso, we could also move XMAX to the page level if we accept that if we
need to invalidate any tuple we'd have to move all of them. In a warehouse
situation that's probably OK as well.
You have a funny definition of "trivial". If you start looking
through the code you'll see that anything that changes the format of
the tuple header is a very large undertaking. And the bit about "if
we invalidate any tuple we'd need to move all of them" doesn't really
make any sense; we have no infrastructure that would allow us "move"
tuples like that. A lot of people would like it if we did, but we
don't.
That said, I don't think this is the first place to focus for reducing our
on-disk format; reducing cleanup bloat would probably be a lot more useful.
Sure; changing the on-disk format is a different project that tracking
the frozen parts of a table, which is what this thread started out
being about, and nothing you've said since then seems to add or
detract from that. I still think the best way to do it is to make the
VM carry two bits per page instead of one.
Did you or Jan have more detailed info from the test he ran about where our
80% overhead was ending up? That would remove a lot of speculation here...
We have more detailed information on that, but (1) that's not a very
specific question and (2) it has nothing to do with freeze avoidance,
so I'm not sure why you are asking on this thread. Let's try not to
get sidetracked from the well-defined proposal that just needs to be
implemented to speculation about major changes in completely unrelated
areas.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Apr 24, 2015 at 11:21 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
On Fri, Apr 24, 2015 at 1:31 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 4/23/15 11:06 AM, Petr Jelinek wrote:
On 23/04/15 17:45, Bruce Momjian wrote:
On Thu, Apr 23, 2015 at 09:45:38AM -0400, Robert Haas wrote:
Agreed, no extra file, and the same write volume as currently. It would
also match pg_clog, which uses two bits per transaction --- maybe we can
reuse some of that code.Yeah, this approach seems promising. We probably can't reuse code from
clog because the usage pattern is different (key for clog is xid, while
for visibility/freeze map ctid is used). But visibility map storage
layer is pretty simple so it should be easy to extend it for this use.Actually, there may be some bit manipulation functions we could reuse;
things like efficiently counting how many things in a byte are set. Probably
doesn't make sense to fully refactor it, but at least CLOG is a good source
for cut/paste/whack.I agree with adding a bit that indicates corresponding page is
all-frozen into VM, just like CLOG.
I'll change the patch as second version patch.
The second patch is attached.
In second patch, I added a bit that indicates all tuples in page are
completely frozen into visibility map.
The visibility map became a bitmap with two bit per heap page:
all-visible and all-frozen.
The logics around vacuum, insert/update/delete heap are almost same as
previous version.
This patch lack some point: documentation, comment in source code,
etc, so it's WIP patch yet,
but I think that it's enough to discuss about this.
Please feedbacks.
Regards,
-------
Sawada Masahiko
Attachments:
000_add_frozen_bit_into_visibilitymap_v1.patchtext/x-diff; charset=US-ASCII; name=000_add_frozen_bit_into_visibilitymap_v1.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b504ccd..a06e16d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -86,7 +86,8 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
HeapTuple newtup, HeapTuple old_key_tup,
- bool all_visible_cleared, bool new_all_visible_cleared);
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool all_frozen_cleared, bool new_all_frozen_cleared);
static void HeapSatisfiesHOTandKeyUpdate(Relation relation,
Bitmapset *hot_attrs,
Bitmapset *key_attrs, Bitmapset *id_attrs,
@@ -2068,7 +2069,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
HeapTuple heaptup;
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
- bool all_visible_cleared = false;
+ bool all_visible_cleared = false,
+ all_frozen_cleared = false;
/*
* Fill in tuple header fields, assign an OID, and toast the tuple if
@@ -2092,8 +2094,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
/*
- * Find buffer to insert this tuple into. If the page is all visible,
- * this will also pin the requisite visibility map page.
+ * Find buffer to insert this tuple into. If the page is all visible
+ * of all frozen, this will also pin the requisite visibility map and
+ * frozen map page.
*/
buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
InvalidBuffer, options, bistate,
@@ -2110,7 +2113,16 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
PageClearAllVisible(BufferGetPage(buffer));
visibilitymap_clear(relation,
ItemPointerGetBlockNumber(&(heaptup->t_self)),
- vmbuffer);
+ vmbuffer, VISIBILITYMAP_ALL_VISIBLE);
+ }
+
+ if (PageIsAllFrozen(BufferGetPage(buffer)))
+ {
+ all_frozen_cleared = true;
+ PageClearAllFrozen(BufferGetPage(buffer));
+ visibilitymap_clear(relation,
+ ItemPointerGetBlockNumber(&(heaptup->t_self)),
+ vmbuffer, VISIBILITYMAP_ALL_FROZEN);
}
/*
@@ -2157,6 +2169,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
xlrec.offnum = ItemPointerGetOffsetNumber(&heaptup->t_self);
xlrec.flags = all_visible_cleared ? XLOG_HEAP_ALL_VISIBLE_CLEARED : 0;
+ if (all_frozen_cleared)
+ xlrec.flags |= XLOG_HEAP_ALL_FROZEN_CLEARED;
Assert(ItemPointerGetBlockNumber(&heaptup->t_self) == BufferGetBlockNumber(buffer));
/*
@@ -2350,7 +2364,8 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
{
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
- bool all_visible_cleared = false;
+ bool all_visible_cleared = false,
+ all_frozen_cleared = false;
int nthispage;
CHECK_FOR_INTERRUPTS();
@@ -2395,7 +2410,16 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
PageClearAllVisible(page);
visibilitymap_clear(relation,
BufferGetBlockNumber(buffer),
- vmbuffer);
+ vmbuffer, VISIBILITYMAP_ALL_VISIBLE);
+ }
+
+ if (PageIsAllFrozen(page))
+ {
+ all_frozen_cleared = true;
+ PageClearAllFrozen(page);
+ visibilitymap_clear(relation,
+ BufferGetBlockNumber(buffer),
+ vmbuffer, VISIBILITYMAP_ALL_FROZEN);
}
/*
@@ -2440,6 +2464,8 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
tupledata = scratchptr;
xlrec->flags = all_visible_cleared ? XLOG_HEAP_ALL_VISIBLE_CLEARED : 0;
+ if (all_frozen_cleared)
+ xlrec->flags |= XLOG_HEAP_ALL_FROZEN_CLEARED;
xlrec->ntuples = nthispage;
/*
@@ -2642,7 +2668,8 @@ heap_delete(Relation relation, ItemPointer tid,
new_infomask2;
bool have_tuple_lock = false;
bool iscombo;
- bool all_visible_cleared = false;
+ bool all_visible_cleared = false,
+ all_frozen_cleared = false;
HeapTuple old_key_tuple = NULL; /* replica identity of the tuple */
bool old_key_copied = false;
@@ -2658,18 +2685,19 @@ heap_delete(Relation relation, ItemPointer tid,
* in the middle of changing this, so we'll need to recheck after we have
* the lock.
*/
- if (PageIsAllVisible(page))
+ if (PageIsAllVisible(page) || PageIsAllFrozen(page))
visibilitymap_pin(relation, block, &vmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
/*
* If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, we'll have to unlock and
- * re-lock, to avoid holding the buffer lock across an I/O. That's a bit
- * unfortunate, but hopefully shouldn't happen often.
+ * visible or all frozen while we were busy locking the buffer, we'll
+ * have to unlock and re-lock, to avoid holding the buffer lock across an
+ * I/O. That's a bit unfortunate, but hopefully shouldn't happen often.
*/
- if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
+ if (vmbuffer == InvalidBuffer &&
+ (PageIsAllVisible(page) || PageIsAllFrozen(page)))
{
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
visibilitymap_pin(relation, block, &vmbuffer);
@@ -2859,12 +2887,22 @@ l1:
*/
PageSetPrunable(page, xid);
+ /* clear PD_ALL_VISIBLE flags */
if (PageIsAllVisible(page))
{
all_visible_cleared = true;
PageClearAllVisible(page);
visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
- vmbuffer);
+ vmbuffer, VISIBILITYMAP_ALL_VISIBLE);
+ }
+
+ /* clear PD_ALL_FROZEN flags */
+ if (PageIsAllFrozen(page))
+ {
+ all_frozen_cleared = true;
+ PageClearAllFrozen(page);
+ visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
+ vmbuffer, VISIBILITYMAP_ALL_FROZEN);
}
/* store transaction information of xact deleting the tuple */
@@ -2891,6 +2929,8 @@ l1:
log_heap_new_cid(relation, &tp);
xlrec.flags = all_visible_cleared ? XLOG_HEAP_ALL_VISIBLE_CLEARED : 0;
+ if (all_frozen_cleared)
+ xlrec.flags |= XLOG_HEAP_ALL_FROZEN_CLEARED;
xlrec.infobits_set = compute_infobits(tp.t_data->t_infomask,
tp.t_data->t_infomask2);
xlrec.offnum = ItemPointerGetOffsetNumber(&tp.t_self);
@@ -3088,6 +3128,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
bool key_intact;
bool all_visible_cleared = false;
bool all_visible_cleared_new = false;
+ bool all_frozen_cleared = false;
+ bool all_frozen_cleared_new = false;
bool checked_lockers;
bool locker_remains;
TransactionId xmax_new_tuple,
@@ -3121,12 +3163,12 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
page = BufferGetPage(buffer);
/*
- * Before locking the buffer, pin the visibility map page if it appears to
- * be necessary. Since we haven't got the lock yet, someone else might be
+ * Before locking the buffer, pin the visibility map if it appears to be
+ * necessary. Since we haven't got the lock yet, someone else might be
* in the middle of changing this, so we'll need to recheck after we have
* the lock.
*/
- if (PageIsAllVisible(page))
+ if (PageIsAllVisible(page) || PageIsAllFrozen(page))
visibilitymap_pin(relation, block, &vmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
@@ -3409,21 +3451,23 @@ l2:
UnlockTupleTuplock(relation, &(oldtup.t_self), *lockmode);
if (vmbuffer != InvalidBuffer)
ReleaseBuffer(vmbuffer);
+
bms_free(hot_attrs);
bms_free(key_attrs);
return result;
}
/*
- * If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, or during some
- * subsequent window during which we had it unlocked, we'll have to unlock
- * and re-lock, to avoid holding the buffer lock across an I/O. That's a
- * bit unfortunate, especially since we'll now have to recheck whether the
- * tuple has been locked or updated under us, but hopefully it won't
- * happen very often.
+ * If we didn't pin the visibility map page and the page has
+ * become all visible(and frozen) while we were busy locking the buffer,
+ * or during some subsequent window during which we had it unlocked,
+ * we'll have to unlock and re-lock, to avoid holding the buffer lock
+ * across an I/O. That's a bit unfortunate, especially since we'll now
+ * have to recheck whether the tuple has been locked or updated under us,
+ * but hopefully it won't happen very often.
*/
- if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
+ if (vmbuffer == InvalidBuffer &&
+ (PageIsAllVisible(page) || PageIsAllFrozen(page)))
{
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
visibilitymap_pin(relation, block, &vmbuffer);
@@ -3722,14 +3766,30 @@ l2:
all_visible_cleared = true;
PageClearAllVisible(BufferGetPage(buffer));
visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
- vmbuffer);
+ vmbuffer, VISIBILITYMAP_ALL_VISIBLE);
}
if (newbuf != buffer && PageIsAllVisible(BufferGetPage(newbuf)))
{
all_visible_cleared_new = true;
PageClearAllVisible(BufferGetPage(newbuf));
visibilitymap_clear(relation, BufferGetBlockNumber(newbuf),
- vmbuffer_new);
+ vmbuffer_new, VISIBILITYMAP_ALL_VISIBLE);
+ }
+
+ /* clear PD_ALL_FROZEN flags */
+ if (PageIsAllFrozen(BufferGetPage(buffer)))
+ {
+ all_frozen_cleared = true;
+ PageClearAllFrozen(BufferGetPage(buffer));
+ visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
+ vmbuffer, VISIBILITYMAP_ALL_FROZEN);
+ }
+ if (newbuf != buffer && PageIsAllFrozen(BufferGetPage(newbuf)))
+ {
+ all_frozen_cleared_new = true;
+ PageClearAllFrozen(BufferGetPage(newbuf));
+ visibilitymap_clear(relation, BufferGetBlockNumber(newbuf),
+ vmbuffer_new, VISIBILITYMAP_ALL_FROZEN);
}
if (newbuf != buffer)
@@ -3755,7 +3815,9 @@ l2:
newbuf, &oldtup, heaptup,
old_key_tuple,
all_visible_cleared,
- all_visible_cleared_new);
+ all_visible_cleared_new,
+ all_frozen_cleared,
+ all_frozen_cleared_new);
if (newbuf != buffer)
{
PageSetLSN(BufferGetPage(newbuf), recptr);
@@ -6564,7 +6626,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
*/
XLogRecPtr
log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
- TransactionId cutoff_xid)
+ TransactionId cutoff_xid, uint8 vmflags)
{
xl_heap_visible xlrec;
XLogRecPtr recptr;
@@ -6574,6 +6636,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
Assert(BufferIsValid(vm_buffer));
xlrec.cutoff_xid = cutoff_xid;
+ xlrec.flags = vmflags;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
@@ -6597,7 +6660,8 @@ static XLogRecPtr
log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup, HeapTuple newtup,
HeapTuple old_key_tuple,
- bool all_visible_cleared, bool new_all_visible_cleared)
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool all_frozen_cleared, bool new_all_frozen_cleared)
{
xl_heap_update xlrec;
xl_heap_header xlhdr;
@@ -6680,6 +6744,10 @@ log_heap_update(Relation reln, Buffer oldbuf,
xlrec.flags |= XLOG_HEAP_ALL_VISIBLE_CLEARED;
if (new_all_visible_cleared)
xlrec.flags |= XLOG_HEAP_NEW_ALL_VISIBLE_CLEARED;
+ if (all_frozen_cleared)
+ xlrec.flags |= XLOG_HEAP_ALL_FROZEN_CLEARED;
+ if (new_all_frozen_cleared)
+ xlrec.flags |= XLOG_HEAP_NEW_ALL_FROZEN_CLEARED;
if (prefixlen > 0)
xlrec.flags |= XLOG_HEAP_PREFIX_FROM_OLD;
if (suffixlen > 0)
@@ -7162,8 +7230,14 @@ heap_xlog_visible(XLogReaderState *record)
* the subsequent update won't be replayed to clear the flag.
*/
page = BufferGetPage(buffer);
- PageSetAllVisible(page);
+
+ if (xlrec->flags & VISIBILITYMAP_ALL_VISIBLE)
+ PageSetAllVisible(page);
+ if (xlrec->flags & VISIBILITYMAP_ALL_FROZEN)
+ PageSetAllFrozen(page);
+
MarkBufferDirty(buffer);
+
}
else if (action == BLK_RESTORED)
{
@@ -7214,7 +7288,7 @@ heap_xlog_visible(XLogReaderState *record)
*/
if (lsn > PageGetLSN(vmpage))
visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
- xlrec->cutoff_xid);
+ xlrec->cutoff_xid, xlrec->flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
@@ -7323,13 +7397,21 @@ heap_xlog_delete(XLogReaderState *record)
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->flags & XLOG_HEAP_ALL_VISIBLE_CLEARED)
+ if (xlrec->flags &
+ (XLOG_HEAP_ALL_VISIBLE_CLEARED | XLOG_HEAP_ALL_FROZEN_CLEARED))
{
Relation reln = CreateFakeRelcacheEntry(target_node);
Buffer vmbuffer = InvalidBuffer;
+ uint8 flags = 0;
+
+ /* set flags for either clear one flags or both */
+ flags |= xlrec->flags & XLOG_HEAP_ALL_VISIBLE_CLEARED ?
+ VISIBILITYMAP_ALL_VISIBLE : 0;
+ flags |= xlrec->flags & XLOG_HEAP_ALL_FROZEN_CLEARED ?
+ VISIBILITYMAP_ALL_FROZEN : 0;
visibilitymap_pin(reln, blkno, &vmbuffer);
- visibilitymap_clear(reln, blkno, vmbuffer);
+ visibilitymap_clear(reln, blkno, vmbuffer, flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
}
@@ -7398,13 +7480,20 @@ heap_xlog_insert(XLogReaderState *record)
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->flags & XLOG_HEAP_ALL_VISIBLE_CLEARED)
+ if (xlrec->flags & (XLOG_HEAP_ALL_VISIBLE_CLEARED | XLOG_HEAP_ALL_FROZEN_CLEARED))
{
Relation reln = CreateFakeRelcacheEntry(target_node);
Buffer vmbuffer = InvalidBuffer;
+ uint8 flags = 0;
+
+ /* set flags for either clear one flags or both */
+ flags |= xlrec->flags & XLOG_HEAP_ALL_VISIBLE_CLEARED ?
+ VISIBILITYMAP_ALL_VISIBLE : 0;
+ flags |= xlrec->flags & XLOG_HEAP_ALL_FROZEN_CLEARED ?
+ VISIBILITYMAP_ALL_FROZEN : 0;
visibilitymap_pin(reln, blkno, &vmbuffer);
- visibilitymap_clear(reln, blkno, vmbuffer);
+ visibilitymap_clear(reln, blkno, vmbuffer, flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
}
@@ -7464,6 +7553,9 @@ heap_xlog_insert(XLogReaderState *record)
if (xlrec->flags & XLOG_HEAP_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
+ if (xlrec->flags & XLOG_HEAP_ALL_FROZEN_CLEARED)
+ PageClearAllFrozen(page);
+
MarkBufferDirty(buffer);
}
if (BufferIsValid(buffer))
@@ -7524,7 +7616,22 @@ heap_xlog_multi_insert(XLogReaderState *record)
Buffer vmbuffer = InvalidBuffer;
visibilitymap_pin(reln, blkno, &vmbuffer);
- visibilitymap_clear(reln, blkno, vmbuffer);
+ visibilitymap_clear(reln, blkno, vmbuffer, VISIBILITYMAP_ALL_VISIBLE);
+ ReleaseBuffer(vmbuffer);
+ FreeFakeRelcacheEntry(reln);
+ }
+
+ /*
+ * The frozen map may need to be fixed even if the heap page is
+ * already up-to-date.
+ */
+ if (xlrec->flags & XLOG_HEAP_ALL_FROZEN_CLEARED)
+ {
+ Relation reln = CreateFakeRelcacheEntry(rnode);
+ Buffer vmbuffer = InvalidBuffer;
+
+ visibilitymap_pin(reln, blkno, &vmbuffer);
+ visibilitymap_clear(reln, blkno, vmbuffer, VISIBILITYMAP_ALL_VISIBLE);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
}
@@ -7602,6 +7709,8 @@ heap_xlog_multi_insert(XLogReaderState *record)
if (xlrec->flags & XLOG_HEAP_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
+ if (xlrec->flags & XLOG_HEAP_ALL_FROZEN_CLEARED)
+ PageClearAllFrozen(page);
MarkBufferDirty(buffer);
}
@@ -7673,13 +7782,20 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->flags & XLOG_HEAP_ALL_VISIBLE_CLEARED)
+ if (xlrec->flags & (XLOG_HEAP_ALL_VISIBLE_CLEARED | XLOG_HEAP_ALL_FROZEN_CLEARED))
{
Relation reln = CreateFakeRelcacheEntry(rnode);
Buffer vmbuffer = InvalidBuffer;
+ uint8 flags = 0;
+
+ /* set flags for either clear one flags or both */
+ flags |= xlrec->flags & XLOG_HEAP_ALL_VISIBLE_CLEARED ?
+ VISIBILITYMAP_ALL_VISIBLE : 0;
+ flags |= xlrec->flags & XLOG_HEAP_ALL_FROZEN_CLEARED ?
+ VISIBILITYMAP_ALL_FROZEN : 0;
visibilitymap_pin(reln, oldblk, &vmbuffer);
- visibilitymap_clear(reln, oldblk, vmbuffer);
+ visibilitymap_clear(reln, oldblk, vmbuffer, flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
}
@@ -7730,6 +7846,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
if (xlrec->flags & XLOG_HEAP_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
+ if (xlrec->flags & XLOG_HEAP_ALL_FROZEN_CLEARED)
+ PageClearAllFrozen(page);
PageSetLSN(page, lsn);
MarkBufferDirty(obuffer);
@@ -7757,13 +7875,21 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->flags & XLOG_HEAP_NEW_ALL_VISIBLE_CLEARED)
+ if (xlrec->flags &
+ (XLOG_HEAP_NEW_ALL_VISIBLE_CLEARED | XLOG_HEAP_NEW_ALL_FROZEN_CLEARED))
{
Relation reln = CreateFakeRelcacheEntry(rnode);
Buffer vmbuffer = InvalidBuffer;
+ uint8 flags = 0;
+
+ /* set flags for either clear one flags or both */
+ flags |= xlrec->flags & XLOG_HEAP_ALL_VISIBLE_CLEARED ?
+ VISIBILITYMAP_ALL_VISIBLE : 0;
+ flags |= xlrec->flags & XLOG_HEAP_ALL_FROZEN_CLEARED ?
+ VISIBILITYMAP_ALL_FROZEN : 0;
visibilitymap_pin(reln, newblk, &vmbuffer);
- visibilitymap_clear(reln, newblk, vmbuffer);
+ visibilitymap_clear(reln, newblk, vmbuffer, flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
}
@@ -7865,6 +7991,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
if (xlrec->flags & XLOG_HEAP_NEW_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
+ if (xlrec->flags & XLOG_HEAP_ALL_FROZEN_CLEARED)
+ PageClearAllFrozen(page);
freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 6d091f6..ceab7d8 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -314,7 +314,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
{
/* easy case */
buffer = ReadBufferBI(relation, targetBlock, bistate);
- if (PageIsAllVisible(BufferGetPage(buffer)))
+ if (PageIsAllVisible(BufferGetPage(buffer)) ||
+ PageIsAllFrozen(BufferGetPage(buffer)))
visibilitymap_pin(relation, targetBlock, vmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
}
@@ -322,7 +323,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
{
/* also easy case */
buffer = otherBuffer;
- if (PageIsAllVisible(BufferGetPage(buffer)))
+ if (PageIsAllVisible(BufferGetPage(buffer)) ||
+ PageIsAllFrozen(BufferGetPage(buffer)))
visibilitymap_pin(relation, targetBlock, vmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
}
@@ -330,7 +332,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
{
/* lock other buffer first */
buffer = ReadBuffer(relation, targetBlock);
- if (PageIsAllVisible(BufferGetPage(buffer)))
+ if (PageIsAllVisible(BufferGetPage(buffer)) ||
+ PageIsAllFrozen(BufferGetPage(buffer)))
visibilitymap_pin(relation, targetBlock, vmbuffer);
LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
@@ -339,7 +342,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
{
/* lock target buffer first */
buffer = ReadBuffer(relation, targetBlock);
- if (PageIsAllVisible(BufferGetPage(buffer)))
+ if (PageIsAllVisible(BufferGetPage(buffer)) ||
+ PageIsAllFrozen(BufferGetPage(buffer)))
visibilitymap_pin(relation, targetBlock, vmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
@@ -367,13 +371,17 @@ RelationGetBufferForTuple(Relation relation, Size len,
* done.
*/
if (otherBuffer == InvalidBuffer || buffer <= otherBuffer)
+ {
GetVisibilityMapPins(relation, buffer, otherBuffer,
targetBlock, otherBlock, vmbuffer,
vmbuffer_other);
+ }
else
+ {
GetVisibilityMapPins(relation, otherBuffer, buffer,
otherBlock, targetBlock, vmbuffer_other,
vmbuffer);
+ }
/*
* Now we can check to see if there's enough free space here. If so,
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 7c38772..ab8beef 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -21,11 +21,14 @@
*
* NOTES
*
- * The visibility map is a bitmap with one bit per heap page. A set bit means
- * that all tuples on the page are known visible to all transactions, and
- * therefore the page doesn't need to be vacuumed. The map is conservative in
- * the sense that we make sure that whenever a bit is set, we know the
- * condition is true, but if a bit is not set, it might or might not be true.
+ * The visibility map is a bitmap with two bits (all-visible and all-frozen
+ * per heap page. A set all-visible bit means that all tuples on the page are
+ * known visible to all transactions, and therefore the page doesn't need to
+ * be vacuumed. A set all-frozen bit means that all tuples on the page are
+ * completely frozen, so the page doesn't need to be vacuumed even if whole
+ * table scanning vacuum is required. The map is conservative in the sense that
+ * we make sure that whenever a bit is set, we know the condition is true,
+ * but if a bit is not set, it might or might not be true.
*
* Clearing a visibility map bit is not separately WAL-logged. The callers
* must make sure that whenever a bit is cleared, the bit is cleared on WAL
@@ -33,21 +36,25 @@
*
* When we *set* a visibility map during VACUUM, we must write WAL. This may
* seem counterintuitive, since the bit is basically a hint: if it is clear,
- * it may still be the case that every tuple on the page is visible to all
- * transactions; we just don't know that for certain. The difficulty is that
- * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
- * on the page itself, and the visibility map bit. If a crash occurs after the
- * visibility map page makes it to disk and before the updated heap page makes
- * it to disk, redo must set the bit on the heap page. Otherwise, the next
- * insert, update, or delete on the heap page will fail to realize that the
- * visibility map bit must be cleared, possibly causing index-only scans to
- * return wrong answers.
+ * it may still be the case that every tuple on the page is visible or frozen
+ * to all transactions; we just don't know that for certain. The difficulty is
+ * that there are two bits which are typically set together: the PD_ALL_VISIBLE
+ * or PD_ALL_FROZEN bit on the page itself, and the visibility map bit. If a
+ * crash occurs after the visibility map page makes it to disk and before the
+ * updated heap page makes it to disk, redo must set the bit on the heap page.
+ * Otherwise, the next insert, update, or delete on the heap page will fail to
+ * realize that the visibility map bit must be cleared, possibly causing index-only
+ * scans to return wrong answers.
*
- * VACUUM will normally skip pages for which the visibility map bit is set;
+ * VACUUM will normally skip pages for which the visibility map either bit is set;
* such pages can't contain any dead tuples and therefore don't need vacuuming.
- * The visibility map is not used for anti-wraparound vacuums, because
+ * The visibility map is not used for anti-wraparound vacuums before 9.5, because
* an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
* present in the table, even on pages that don't have any dead tuples.
+ * 9.6 or later, the visibility map has a additional bit that indicates all tuple
+ * on single page has been completely forzen, so the visibility map is also used for
+ * anti-wraparound vacuums.
+ *
*
* LOCKING
*
@@ -58,14 +65,14 @@
* section that logs the page modification. However, we don't want to hold
* the buffer lock over any I/O that may be required to read in the visibility
* map page. To avoid this, we examine the heap page before locking it;
- * if the page-level PD_ALL_VISIBLE bit is set, we pin the visibility map
- * bit. Then, we lock the buffer. But this creates a race condition: there
- * is a possibility that in the time it takes to lock the buffer, the
- * PD_ALL_VISIBLE bit gets set. If that happens, we have to unlock the
- * buffer, pin the visibility map page, and relock the buffer. This shouldn't
- * happen often, because only VACUUM currently sets visibility map bits,
- * and the race will only occur if VACUUM processes a given page at almost
- * exactly the same time that someone tries to further modify it.
+ * if the page-level PD_ALL_VISIBLE or PD_ALL_FROZEN bit is set, we pin the
+ * visibility map bit. Then, we lock the buffer. But this creates a race
+ * condition: there is a possibility that in the time it takes to lock the
+ * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that happens,
+ * we have to unlock the buffer, pin the visibility map page, and relock the
+ * buffer. This shouldn't happen often, because only VACUUM currently sets
+ * visibility map bits, and the race will only occur if VACUUM processes a given
+ * page at almost exactly the same time that someone tries to further modify it.
*
* To set a bit, you need to hold a lock on the heap page. That prevents
* the race condition where VACUUM sees that all tuples on the page are
@@ -101,11 +108,14 @@
*/
#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
-/* Number of bits allocated for each heap block. */
-#define BITS_PER_HEAPBLOCK 1
+/*
+ * Number of bits allocated for each heap block.
+ * One for all-visible, another one for all-frozen.
+*/
+#define BITS_PER_HEAPBLOCK 2
/* Number of heap blocks we can represent in one byte. */
-#define HEAPBLOCKS_PER_BYTE 8
+#define HEAPBLOCKS_PER_BYTE 4
/* Number of heap blocks we can represent in one visibility map page. */
#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
@@ -115,24 +125,42 @@
#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
#define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
-/* table for fast counting of set bits */
-static const uint8 number_of_ones[256] = {
- 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
+/* tables for fast counting of set bits for visible and freeze */
+static const uint8 number_of_ones_for_visible[256] = {
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4
+};
+static const uint8 number_of_ones_for_freeze[256] = {
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4
};
/* prototypes for internal routines */
@@ -141,23 +169,23 @@ static void vm_extend(Relation rel, BlockNumber nvmblocks);
/*
- * visibilitymap_clear - clear a bit in visibility map
+ * visibilitymap_clear - clear bits in visibility map
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
* any I/O.
*/
void
-visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
+visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf, uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
int mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- uint8 mask = 1 << mapBit;
+ uint8 mask = flags << (BITS_PER_HEAPBLOCK * mapBit);
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_clear %s %d %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
if (!BufferIsValid(buf) || BufferGetBlockNumber(buf) != mapBlock)
@@ -225,7 +253,7 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
}
/*
- * visibilitymap_set - set a bit on a previously pinned page
+ * visibilitymap_set - set bits on a previously pinned page
*
* recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
* or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
@@ -234,10 +262,10 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
* marked all-visible; it is needed for Hot Standby, and can be
* InvalidTransactionId if the page contains no tuples.
*
- * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
- * this function. Except in recovery, caller should also pass the heap
- * buffer. When checksums are enabled and we're not in recovery, we must add
- * the heap buffer to the WAL chain to protect it from being torn.
+ * Caller is expected to set the heap page's PD_ALL_VISIBLE or PD_ALL_FROZEN
+ * bit before calling this function. Except in recovery, caller should also
+ * pass the heap buffer. When checksums are enabled and we're not in recovery,
+ * we must add the heap buffer to the WAL chain to protect it from being torn.
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -245,7 +273,8 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
*/
void
visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid)
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -254,7 +283,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_set %s %d %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
@@ -272,11 +301,11 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map = PageGetContents(page);
LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
- if (!(map[mapByte] & (1 << mapBit)))
+ if (!(map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))))
{
START_CRIT_SECTION();
- map[mapByte] |= (1 << mapBit);
+ map[mapByte] |= (flags << (BITS_PER_HEAPBLOCK * mapBit));
MarkBufferDirty(vmBuf);
if (RelationNeedsWAL(rel))
@@ -285,7 +314,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Assert(!InRecovery);
recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
- cutoff_xid);
+ cutoff_xid, flags);
/*
* If data checksums are enabled (or wal_log_hints=on), we
@@ -310,9 +339,9 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
}
/*
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_test - test if bits are set
*
- * Are all tuples on heapBlk visible to all, according to the visibility map?
+ * Are all tuples on heapBlk visible or frozen to all, according to the visibility map?
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by an
* earlier call to visibilitymap_pin or visibilitymap_test on the same
@@ -328,7 +357,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
* all concurrency issues!
*/
bool
-visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
+visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf, uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -337,7 +366,7 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_test %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_test %s %d %u", RelationGetRelationName(rel), heapBlk, bits);
#endif
/* Reuse the old pinned buffer if possible */
@@ -360,11 +389,11 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
map = PageGetContents(BufferGetPage(*buf));
/*
- * A single-bit read is atomic. There could be memory-ordering effects
+ * A single or double bit read is atomic. There could be memory-ordering effects
* here, but for performance reasons we make it the caller's job to worry
* about that.
*/
- result = (map[mapByte] & (1 << mapBit)) ? true : false;
+ result = (map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))) ? true : false;
return result;
}
@@ -374,10 +403,12 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
*
* Note: we ignore the possibility of race conditions when the table is being
* extended concurrently with the call. New pages added to the table aren't
- * going to be marked all-visible, so they won't affect the result.
+ * going to be marked all-visible or all-frozen, so they won't affect the result.
+ * if for_visible is true, we count the number of all-visible flag. If false,
+ * we count the number of all-frozen flag.
*/
BlockNumber
-visibilitymap_count(Relation rel)
+visibilitymap_count(Relation rel, bool for_visible)
{
BlockNumber result = 0;
BlockNumber mapBlock;
@@ -406,7 +437,8 @@ visibilitymap_count(Relation rel)
for (i = 0; i < MAPSIZE; i++)
{
- result += number_of_ones[map[i]];
+ result += for_visible ?
+ number_of_ones_for_visible[map[i]] : number_of_ones_for_freeze[map[i]];
}
ReleaseBuffer(mapBuffer);
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index ac3b785..c3a6d59 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1867,11 +1867,18 @@ index_update_stats(Relation rel,
{
BlockNumber relpages = RelationGetNumberOfBlocks(rel);
BlockNumber relallvisible;
+ BlockNumber relallfrozen;
if (rd_rel->relkind != RELKIND_INDEX)
- relallvisible = visibilitymap_count(rel);
+ {
+ relallvisible = visibilitymap_count(rel, true);
+ relallfrozen = visibilitymap_count(rel, false);
+ }
else /* don't bother for indexes */
+ {
relallvisible = 0;
+ relallfrozen = 0;
+ }
if (rd_rel->relpages != (int32) relpages)
{
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 15ec0ad..a653873 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -586,7 +586,8 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
vac_update_relstats(onerel,
relpages,
totalrows,
- visibilitymap_count(onerel),
+ visibilitymap_count(onerel, true),
+ visibilitymap_count(onerel, false),
hasindex,
InvalidTransactionId,
InvalidMultiXactId,
@@ -609,6 +610,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
RelationGetNumberOfBlocks(Irel[ind]),
totalindexrows,
0,
+ 0,
false,
InvalidTransactionId,
InvalidMultiXactId,
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 3febdd5..d510826 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -22,6 +22,7 @@
#include "access/rewriteheap.h"
#include "access/transam.h"
#include "access/tuptoaster.h"
+#include "access/visibilitymap.h"
#include "access/xact.h"
#include "access/xlog.h"
#include "catalog/catalog.h"
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7ead161..4f5297e 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -736,6 +736,7 @@ void
vac_update_relstats(Relation relation,
BlockNumber num_pages, double num_tuples,
BlockNumber num_all_visible_pages,
+ BlockNumber num_all_frozen_pages,
bool hasindex, TransactionId frozenxid,
MultiXactId minmulti,
bool in_outer_xact)
@@ -773,6 +774,11 @@ vac_update_relstats(Relation relation,
pgcform->relallvisible = (int32) num_all_visible_pages;
dirty = true;
}
+ if (pgcform->relallfrozen != (int32) num_all_frozen_pages)
+ {
+ pgcform->relallfrozen = (int32) num_all_frozen_pages;
+ dirty = true;
+ }
/* Apply DDL updates, but not inside an outer transaction (see above) */
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index c3d6e59..e9d11fd 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -106,6 +106,8 @@ typedef struct LVRelStats
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* number of pages we examined */
BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber vmskipped_pages; /* # of pages we skipped by all-frozen bit
+ of visibility map */
double scanned_tuples; /* counts only tuples on scanned pages */
double old_rel_tuples; /* previous value of pg_class.reltuples */
double new_rel_tuples; /* new estimated total # of tuples */
@@ -188,7 +190,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
MultiXactId mxactFullScanLimit;
BlockNumber new_rel_pages;
double new_rel_tuples;
- BlockNumber new_rel_allvisible;
+ BlockNumber new_rel_allvisible,
+ new_rel_allfrozen;
double new_live_tuples;
TransactionId new_frozen_xid;
MultiXactId new_min_multi;
@@ -222,6 +225,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* than or equal to the requested Xid full-table scan limit; or if the
* table's minimum MultiXactId is older than or equal to the requested
* mxid full-table scan limit.
+ * Even if scan_all is set so far, we could skip to scan some pages
+ * according by frozen map.
*/
scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
xidFullScanLimit);
@@ -253,14 +258,16 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
- if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages + vacrelstats->vmskipped_pages)
+ < vacrelstats->rel_pages)
{
- Assert(!scan_all);
scanned_all = false;
}
else
scanned_all = true;
+ scanned_all |= scan_all;
+
/*
* Optionally truncate the relation.
*
@@ -301,10 +308,16 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_tuples = vacrelstats->old_rel_tuples;
}
- new_rel_allvisible = visibilitymap_count(onerel);
+ /* true means that count for all-visible */
+ new_rel_allvisible = visibilitymap_count(onerel, true);
if (new_rel_allvisible > new_rel_pages)
new_rel_allvisible = new_rel_pages;
+ /* false means that count for all-frozen */
+ new_rel_allfrozen = visibilitymap_count(onerel, false);
+ if (new_rel_allfrozen > new_rel_pages)
+ new_rel_allfrozen = new_rel_pages;
+
new_frozen_xid = scanned_all ? FreezeLimit : InvalidTransactionId;
new_min_multi = scanned_all ? MultiXactCutoff : InvalidMultiXactId;
@@ -312,6 +325,7 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_pages,
new_rel_tuples,
new_rel_allvisible,
+ new_rel_allfrozen,
vacrelstats->hasindex,
new_frozen_xid,
new_min_multi,
@@ -514,7 +528,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer))
+ if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer,
+ VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN))
break;
vacuum_delay_point();
}
@@ -533,6 +548,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
hastup;
int prev_dead_count;
int nfrozen;
+ int already_nfrozen; /* # of tuples already frozen */
+ int ntup_blk; /* # of tuples in single page */
Size freespace;
bool all_visible_according_to_vm;
bool all_visible;
@@ -547,7 +564,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block++)
{
if (!visibilitymap_test(onerel, next_not_all_visible_block,
- &vmbuffer))
+ &vmbuffer,
+ VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN))
break;
vacuum_delay_point();
}
@@ -565,9 +583,20 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
}
else
{
- /* Current block is all-visible */
+ /*
+ * Current block is all-visible.
+ * If visibility map represents that it's all frozen, we can
+ * skip to vacuum page unconditionally.
+ */
+ if (visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_FROZEN))
+ {
+ vacrelstats->vmskipped_pages++;
+ continue;
+ }
+
if (skipping_all_visible_blocks && !scan_all)
continue;
+
all_visible_according_to_vm = true;
}
@@ -739,7 +768,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
PageSetAllVisible(page);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, InvalidTransactionId);
+ vmbuffer, InvalidTransactionId,
+ VISIBILITYMAP_ALL_VISIBLE);
END_CRIT_SECTION();
}
@@ -763,6 +793,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
all_visible = true;
has_dead_tuples = false;
nfrozen = 0;
+ already_nfrozen = 0;
+ ntup_blk = 0;
hastup = false;
prev_dead_count = vacrelstats->num_dead_tuples;
maxoff = PageGetMaxOffsetNumber(page);
@@ -917,8 +949,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
{
num_tuples += 1;
+ ntup_blk += 1;
hastup = true;
+ /* If current tuple is already frozen, count it up */
+ if (HeapTupleHeaderXminFrozen(tuple.t_data))
+ already_nfrozen += 1;
+
/*
* Each non-removable tuple must be checked to see if it needs
* freezing. Note we already have exclusive buffer lock.
@@ -930,11 +967,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
} /* scan along page */
/*
- * If we froze any tuples, mark the buffer dirty, and write a WAL
- * record recording the changes. We must log the changes to be
- * crash-safe against future truncation of CLOG.
+ * If we froze any tuples or any tuples are already frozen,
+ * mark the buffer dirty, and write a WAL record recording the changes.
+ * We must log the changes to be crash-safe against future truncation
+ * of CLOG.
*/
- if (nfrozen > 0)
+ if (nfrozen > 0 || already_nfrozen > 0)
{
START_CRIT_SECTION();
@@ -952,8 +990,20 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
heap_execute_freeze_tuple(htup, &frozen[i]);
}
+ /*
+ * As a result of scanning a page, we ensure that all tuples
+ * are completely frozen. Set VISIBILITYMAP_ALL_FROZEN bit on
+ * visibility map and PD_ALL_FROZEN flag on page.
+ */
+ if (ntup_blk == (nfrozen + already_nfrozen))
+ {
+ PageSetAllFrozen(page);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr, vmbuffer,
+ InvalidTransactionId, VISIBILITYMAP_ALL_FROZEN);
+ }
+
/* Now WAL-log freezing if neccessary */
- if (RelationNeedsWAL(onerel))
+ if (nfrozen > 0 && RelationNeedsWAL(onerel))
{
XLogRecPtr recptr;
@@ -1006,7 +1056,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
PageSetAllVisible(page);
MarkBufferDirty(buf);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, visibility_cutoff_xid);
+ vmbuffer, visibility_cutoff_xid, VISIBILITYMAP_ALL_VISIBLE);
}
/*
@@ -1017,11 +1067,11 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* that something bad has happened.
*/
else if (all_visible_according_to_vm && !PageIsAllVisible(page)
- && visibilitymap_test(onerel, blkno, &vmbuffer))
+ && visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
relname, blkno);
- visibilitymap_clear(onerel, blkno, vmbuffer);
+ visibilitymap_clear(onerel, blkno, vmbuffer, VISIBILITYMAP_ALL_VISIBLE);
}
/*
@@ -1043,7 +1093,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
relname, blkno);
PageClearAllVisible(page);
MarkBufferDirty(buf);
- visibilitymap_clear(onerel, blkno, vmbuffer);
+ visibilitymap_clear(onerel, blkno, vmbuffer, VISIBILITYMAP_ALL_VISIBLE);
}
UnlockReleaseBuffer(buf);
@@ -1077,7 +1127,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
num_tuples);
/*
- * Release any remaining pin on visibility map page.
+ * Release any remaining pin on visibility map and frozen map page.
*/
if (BufferIsValid(vmbuffer))
{
@@ -1284,11 +1334,11 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
* flag is now set, also set the VM bit.
*/
if (PageIsAllVisible(page) &&
- !visibilitymap_test(onerel, blkno, vmbuffer))
+ !visibilitymap_test(onerel, blkno, vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
Assert(BufferIsValid(*vmbuffer));
visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
- visibility_cutoff_xid);
+ visibility_cutoff_xid, VISIBILITYMAP_ALL_VISIBLE);
}
return tupindex;
@@ -1407,6 +1457,7 @@ lazy_cleanup_index(Relation indrel,
stats->num_pages,
stats->num_index_tuples,
0,
+ 0,
false,
InvalidTransactionId,
InvalidMultiXactId,
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 06b7c3c..d976bf5 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -116,7 +116,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
*/
if (!visibilitymap_test(scandesc->heapRelation,
ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
+ &node->ioss_VMBuffer, VISIBILITYMAP_ALL_VISIBLE))
{
/*
* Rats, we have to visit the heap to check visibility.
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 31666ed..ebd6576 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -82,7 +82,7 @@ ExecCheckPlanOutput(Relation resultRel, List *targetList)
if (attno >= resultDesc->natts)
ereport(ERROR,
(errcode(ERRCODE_DATATYPE_MISMATCH),
- errmsg("table row type and query-specified row type do not match"),
+ errmsg("table row type and query-specified row type do not match"),
errdetail("Query has too many columns.")));
attr = resultDesc->attrs[attno++];
@@ -92,7 +92,7 @@ ExecCheckPlanOutput(Relation resultRel, List *targetList)
if (exprType((Node *) tle->expr) != attr->atttypid)
ereport(ERROR,
(errcode(ERRCODE_DATATYPE_MISMATCH),
- errmsg("table row type and query-specified row type do not match"),
+ errmsg("table row type and query-specified row type do not match"),
errdetail("Table has type %s at ordinal position %d, but query expects %s.",
format_type_be(attr->atttypid),
attno,
@@ -117,7 +117,7 @@ ExecCheckPlanOutput(Relation resultRel, List *targetList)
if (attno != resultDesc->natts)
ereport(ERROR,
(errcode(ERRCODE_DATATYPE_MISMATCH),
- errmsg("table row type and query-specified row type do not match"),
+ errmsg("table row type and query-specified row type do not match"),
errdetail("Query has too few columns.")));
}
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index f0f89de..c60cd2d 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -73,6 +73,10 @@
#define XLOG_HEAP_SUFFIX_FROM_OLD (1<<6)
/* last xl_heap_multi_insert record for one heap_multi_insert() call */
#define XLOG_HEAP_LAST_MULTI_INSERT (1<<7)
+/* PD_ALL_FROZEN was cleared for INSERT and UPDATE */
+#define XLOG_HEAP_ALL_FROZEN_CLEARED (1<<8)
+/* PD_ALL_FROZEN was cleared for INSERT and UPDATE */
+#define XLOG_HEAP_NEW_ALL_FROZEN_CLEARED (1<<9)
/* convenience macro for checking whether any form of old tuple was logged */
#define XLOG_HEAP_CONTAINS_OLD \
@@ -84,10 +88,10 @@ typedef struct xl_heap_delete
TransactionId xmax; /* xmax of the deleted tuple */
OffsetNumber offnum; /* deleted tuple's offset */
uint8 infobits_set; /* infomask bits */
- uint8 flags;
+ uint16 flags;
} xl_heap_delete;
-#define SizeOfHeapDelete (offsetof(xl_heap_delete, flags) + sizeof(uint8))
+#define SizeOfHeapDelete (offsetof(xl_heap_delete, flags) + sizeof(uint16))
/*
* We don't store the whole fixed part (HeapTupleHeaderData) of an inserted
@@ -110,12 +114,12 @@ typedef struct xl_heap_header
typedef struct xl_heap_insert
{
OffsetNumber offnum; /* inserted tuple's offset */
- uint8 flags;
+ uint16 flags;
/* xl_heap_header & TUPLE DATA in backup block 0 */
} xl_heap_insert;
-#define SizeOfHeapInsert (offsetof(xl_heap_insert, flags) + sizeof(uint8))
+#define SizeOfHeapInsert (offsetof(xl_heap_insert, flags) + sizeof(uint16))
/*
* This is what we need to know about a multi-insert.
@@ -130,7 +134,7 @@ typedef struct xl_heap_insert
*/
typedef struct xl_heap_multi_insert
{
- uint8 flags;
+ uint16 flags;
uint16 ntuples;
OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
} xl_heap_multi_insert;
@@ -170,7 +174,7 @@ typedef struct xl_heap_update
TransactionId old_xmax; /* xmax of the old tuple */
OffsetNumber old_offnum; /* old tuple's offset */
uint8 old_infobits_set; /* infomask bits to set on old tuple */
- uint8 flags;
+ uint16 flags;
TransactionId new_xmax; /* xmax of the new tuple */
OffsetNumber new_offnum; /* new tuple's offset */
@@ -292,9 +296,10 @@ typedef struct xl_heap_freeze_page
typedef struct xl_heap_visible
{
TransactionId cutoff_xid;
+ uint8 flags;
} xl_heap_visible;
-#define SizeOfHeapVisible (offsetof(xl_heap_visible, cutoff_xid) + sizeof(TransactionId))
+#define SizeOfHeapVisible (offsetof(xl_heap_visible, flags) + sizeof(uint8))
typedef struct xl_heap_new_cid
{
@@ -354,6 +359,8 @@ extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer,
TransactionId cutoff_xid, xl_heap_freeze_tuple *tuples,
int ntuples);
+extern XLogRecPtr log_heap_frozenmap(RelFileNode rnode, Buffer heap_buffer,
+ Buffer fm_buffer);
extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
TransactionId cutoff_xid,
TransactionId cutoff_multi,
@@ -361,6 +368,6 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
xl_heap_freeze_tuple *xlrec_tp);
extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
- Buffer vm_buffer, TransactionId cutoff_xid);
+ Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
#endif /* HEAPAM_XLOG_H */
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 0c0e0ef..53d8103 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -19,15 +19,21 @@
#include "storage/buf.h"
#include "utils/relcache.h"
+/* Flags for bit map */
+#define VISIBILITYMAP_ALL_VISIBLE 0x01
+#define VISIBILITYMAP_ALL_FROZEN 0x02
+
extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk,
- Buffer vmbuf);
+ Buffer vmbuf, uint8 flags);
extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
Buffer *vmbuf);
extern bool visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid);
-extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
-extern BlockNumber visibilitymap_count(Relation rel);
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags);
+extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf,
+ uint8 flags);
+extern BlockNumber visibilitymap_count(Relation rel, bool for_visible);
extern void visibilitymap_truncate(Relation rel, BlockNumber nheapblocks);
#endif /* VISIBILITYMAP_H */
diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
index 48a7262..b09ae6a 100644
--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -47,6 +47,8 @@ CATALOG(pg_class,1259) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83) BKI_SCHEMA_MACRO
float4 reltuples; /* # of tuples (not always up-to-date) */
int32 relallvisible; /* # of all-visible blocks (not always
* up-to-date) */
+ int32 relallfrozen; /* # of all-frozen blocks (not always
+ up-to-date) */
Oid reltoastrelid; /* OID of toast table; 0 if none */
bool relhasindex; /* T if has (or has had) any indexes */
bool relisshared; /* T if shared across databases */
@@ -95,7 +97,7 @@ typedef FormData_pg_class *Form_pg_class;
* ----------------
*/
-#define Natts_pg_class 30
+#define Natts_pg_class 31
#define Anum_pg_class_relname 1
#define Anum_pg_class_relnamespace 2
#define Anum_pg_class_reltype 3
@@ -107,25 +109,26 @@ typedef FormData_pg_class *Form_pg_class;
#define Anum_pg_class_relpages 9
#define Anum_pg_class_reltuples 10
#define Anum_pg_class_relallvisible 11
-#define Anum_pg_class_reltoastrelid 12
-#define Anum_pg_class_relhasindex 13
-#define Anum_pg_class_relisshared 14
-#define Anum_pg_class_relpersistence 15
-#define Anum_pg_class_relkind 16
-#define Anum_pg_class_relnatts 17
-#define Anum_pg_class_relchecks 18
-#define Anum_pg_class_relhasoids 19
-#define Anum_pg_class_relhaspkey 20
-#define Anum_pg_class_relhasrules 21
-#define Anum_pg_class_relhastriggers 22
-#define Anum_pg_class_relhassubclass 23
-#define Anum_pg_class_relrowsecurity 24
-#define Anum_pg_class_relispopulated 25
-#define Anum_pg_class_relreplident 26
-#define Anum_pg_class_relfrozenxid 27
-#define Anum_pg_class_relminmxid 28
-#define Anum_pg_class_relacl 29
-#define Anum_pg_class_reloptions 30
+#define Anum_pg_class_relallfrozen 12
+#define Anum_pg_class_reltoastrelid 13
+#define Anum_pg_class_relhasindex 14
+#define Anum_pg_class_relisshared 15
+#define Anum_pg_class_relpersistence 16
+#define Anum_pg_class_relkind 17
+#define Anum_pg_class_relnatts 18
+#define Anum_pg_class_relchecks 19
+#define Anum_pg_class_relhasoids 20
+#define Anum_pg_class_relhaspkey 21
+#define Anum_pg_class_relhasrules 22
+#define Anum_pg_class_relhastriggers 23
+#define Anum_pg_class_relhassubclass 24
+#define Anum_pg_class_relrowsecurity 25
+#define Anum_pg_class_relispopulated 26
+#define Anum_pg_class_relreplident 27
+#define Anum_pg_class_relfrozenxid 28
+#define Anum_pg_class_relminmxid 29
+#define Anum_pg_class_relacl 30
+#define Anum_pg_class_reloptions 31
/* ----------------
* initial contents of pg_class
@@ -140,13 +143,13 @@ typedef FormData_pg_class *Form_pg_class;
* Note: "3" in the relfrozenxid column stands for FirstNormalTransactionId;
* similarly, "1" in relminmxid stands for FirstMultiXactId
*/
-DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 f f p r 28 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 0 f f p r 28 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 0 f f p r 31 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 71f0165..609614c 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -172,6 +172,7 @@ extern void vac_update_relstats(Relation relation,
BlockNumber num_pages,
double num_tuples,
BlockNumber num_all_visible_pages,
+ BlockNumber num_all_frozen_pages,
bool hasindex,
TransactionId frozenxid,
MultiXactId minmulti,
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index c2fbffc..f46375d 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -178,8 +178,10 @@ typedef PageHeaderData *PageHeader;
* tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
+#define PD_ALL_FROZEN 0x0008 /* all tuples on page are completely
+ frozen */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x000F /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -367,6 +369,13 @@ typedef PageHeaderData *PageHeader;
#define PageClearAllVisible(page) \
(((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
+#define PageIsAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags & PD_ALL_FROZEN)
+#define PageSetAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags |= PD_ALL_FROZEN)
+#define PageClearAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags &= ~PD_ALL_FROZEN)
+
#define PageIsPrunable(page, oldestxmin) \
( \
AssertMacro(TransactionIdIsNormal(oldestxmin)), \
On Thu, Apr 30, 2015 at 8:07 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
On Fri, Apr 24, 2015 at 11:21 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
On Fri, Apr 24, 2015 at 1:31 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 4/23/15 11:06 AM, Petr Jelinek wrote:
On 23/04/15 17:45, Bruce Momjian wrote:
On Thu, Apr 23, 2015 at 09:45:38AM -0400, Robert Haas wrote:
Agreed, no extra file, and the same write volume as currently. It would
also match pg_clog, which uses two bits per transaction --- maybe we can
reuse some of that code.Yeah, this approach seems promising. We probably can't reuse code from
clog because the usage pattern is different (key for clog is xid, while
for visibility/freeze map ctid is used). But visibility map storage
layer is pretty simple so it should be easy to extend it for this use.Actually, there may be some bit manipulation functions we could reuse;
things like efficiently counting how many things in a byte are set. Probably
doesn't make sense to fully refactor it, but at least CLOG is a good source
for cut/paste/whack.I agree with adding a bit that indicates corresponding page is
all-frozen into VM, just like CLOG.
I'll change the patch as second version patch.The second patch is attached.
In second patch, I added a bit that indicates all tuples in page are
completely frozen into visibility map.
The visibility map became a bitmap with two bit per heap page:
all-visible and all-frozen.
The logics around vacuum, insert/update/delete heap are almost same as
previous version.This patch lack some point: documentation, comment in source code,
etc, so it's WIP patch yet,
but I think that it's enough to discuss about this.
The previous patch is no longer applied cleanly to HEAD.
The attached v2 patch is latest version.
Please review it.
Regards,
-------
Sawada Masahiko
Attachments:
000_add_frozen_bit_into_visibilitymap_v2.patchtext/x-diff; charset=US-ASCII; name=000_add_frozen_bit_into_visibilitymap_v2.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index caacc10..fcbf06a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -88,7 +88,8 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
HeapTuple newtup, HeapTuple old_key_tup,
- bool all_visible_cleared, bool new_all_visible_cleared);
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool all_frozen_cleared, bool new_all_frozen_cleared);
static void HeapSatisfiesHOTandKeyUpdate(Relation relation,
Bitmapset *hot_attrs,
Bitmapset *key_attrs, Bitmapset *id_attrs,
@@ -2107,7 +2108,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
HeapTuple heaptup;
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
- bool all_visible_cleared = false;
+ bool all_visible_cleared = false,
+ all_frozen_cleared = false;
/*
* Fill in tuple header fields, assign an OID, and toast the tuple if
@@ -2131,8 +2133,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
/*
- * Find buffer to insert this tuple into. If the page is all visible,
- * this will also pin the requisite visibility map page.
+ * Find buffer to insert this tuple into. If the page is all visible
+ * of all frozen, this will also pin the requisite visibility map and
+ * frozen map page.
*/
buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
InvalidBuffer, options, bistate,
@@ -2150,7 +2153,16 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
PageClearAllVisible(BufferGetPage(buffer));
visibilitymap_clear(relation,
ItemPointerGetBlockNumber(&(heaptup->t_self)),
- vmbuffer);
+ vmbuffer, VISIBILITYMAP_ALL_VISIBLE);
+ }
+
+ if (PageIsAllFrozen(BufferGetPage(buffer)))
+ {
+ all_frozen_cleared = true;
+ PageClearAllFrozen(BufferGetPage(buffer));
+ visibilitymap_clear(relation,
+ ItemPointerGetBlockNumber(&(heaptup->t_self)),
+ vmbuffer, VISIBILITYMAP_ALL_FROZEN);
}
/*
@@ -2199,6 +2211,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
xlrec.flags = 0;
if (all_visible_cleared)
xlrec.flags |= XLH_INSERT_ALL_VISIBLE_CLEARED;
+ if (all_frozen_cleared)
+ xlrec.flags |= XLH_INSERT_ALL_FROZEN_CLEARED;
if (options & HEAP_INSERT_SPECULATIVE)
xlrec.flags |= XLH_INSERT_IS_SPECULATIVE;
Assert(ItemPointerGetBlockNumber(&heaptup->t_self) == BufferGetBlockNumber(buffer));
@@ -2406,7 +2420,8 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
{
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
- bool all_visible_cleared = false;
+ bool all_visible_cleared = false,
+ all_frozen_cleared = false;
int nthispage;
CHECK_FOR_INTERRUPTS();
@@ -2451,7 +2466,16 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
PageClearAllVisible(page);
visibilitymap_clear(relation,
BufferGetBlockNumber(buffer),
- vmbuffer);
+ vmbuffer, VISIBILITYMAP_ALL_VISIBLE);
+ }
+
+ if (PageIsAllFrozen(page))
+ {
+ all_frozen_cleared = true;
+ PageClearAllFrozen(page);
+ visibilitymap_clear(relation,
+ BufferGetBlockNumber(buffer),
+ vmbuffer, VISIBILITYMAP_ALL_FROZEN);
}
/*
@@ -2496,6 +2520,8 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
tupledata = scratchptr;
xlrec->flags = all_visible_cleared ? XLH_INSERT_ALL_VISIBLE_CLEARED : 0;
+ if (all_frozen_cleared)
+ xlrec->flags |= XLH_INSERT_ALL_FROZEN_CLEARED;
xlrec->ntuples = nthispage;
/*
@@ -2698,7 +2724,8 @@ heap_delete(Relation relation, ItemPointer tid,
new_infomask2;
bool have_tuple_lock = false;
bool iscombo;
- bool all_visible_cleared = false;
+ bool all_visible_cleared = false,
+ all_frozen_cleared = false;
HeapTuple old_key_tuple = NULL; /* replica identity of the tuple */
bool old_key_copied = false;
@@ -2724,18 +2751,19 @@ heap_delete(Relation relation, ItemPointer tid,
* in the middle of changing this, so we'll need to recheck after we have
* the lock.
*/
- if (PageIsAllVisible(page))
+ if (PageIsAllVisible(page) || PageIsAllFrozen(page))
visibilitymap_pin(relation, block, &vmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
/*
* If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, we'll have to unlock and
- * re-lock, to avoid holding the buffer lock across an I/O. That's a bit
- * unfortunate, but hopefully shouldn't happen often.
+ * visible or all frozen while we were busy locking the buffer, we'll
+ * have to unlock and re-lock, to avoid holding the buffer lock across an
+ * I/O. That's a bit unfortunate, but hopefully shouldn't happen often.
*/
- if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
+ if (vmbuffer == InvalidBuffer &&
+ (PageIsAllVisible(page) || PageIsAllFrozen(page)))
{
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
visibilitymap_pin(relation, block, &vmbuffer);
@@ -2925,12 +2953,22 @@ l1:
*/
PageSetPrunable(page, xid);
+ /* clear PD_ALL_VISIBLE flags */
if (PageIsAllVisible(page))
{
all_visible_cleared = true;
PageClearAllVisible(page);
visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
- vmbuffer);
+ vmbuffer, VISIBILITYMAP_ALL_VISIBLE);
+ }
+
+ /* clear PD_ALL_FROZEN flags */
+ if (PageIsAllFrozen(page))
+ {
+ all_frozen_cleared = true;
+ PageClearAllFrozen(page);
+ visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
+ vmbuffer, VISIBILITYMAP_ALL_FROZEN);
}
/* store transaction information of xact deleting the tuple */
@@ -2962,6 +3000,8 @@ l1:
log_heap_new_cid(relation, &tp);
xlrec.flags = all_visible_cleared ? XLH_DELETE_ALL_VISIBLE_CLEARED : 0;
+ if (all_frozen_cleared)
+ xlrec.flags |= XLH_DELETE_ALL_FROZEN_CLEARED;
xlrec.infobits_set = compute_infobits(tp.t_data->t_infomask,
tp.t_data->t_infomask2);
xlrec.offnum = ItemPointerGetOffsetNumber(&tp.t_self);
@@ -3159,6 +3199,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
bool key_intact;
bool all_visible_cleared = false;
bool all_visible_cleared_new = false;
+ bool all_frozen_cleared = false;
+ bool all_frozen_cleared_new = false;
bool checked_lockers;
bool locker_remains;
TransactionId xmax_new_tuple,
@@ -3202,12 +3244,12 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
page = BufferGetPage(buffer);
/*
- * Before locking the buffer, pin the visibility map page if it appears to
- * be necessary. Since we haven't got the lock yet, someone else might be
+ * Before locking the buffer, pin the visibility map if it appears to be
+ * necessary. Since we haven't got the lock yet, someone else might be
* in the middle of changing this, so we'll need to recheck after we have
* the lock.
*/
- if (PageIsAllVisible(page))
+ if (PageIsAllVisible(page) || PageIsAllFrozen(page))
visibilitymap_pin(relation, block, &vmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
@@ -3490,21 +3532,23 @@ l2:
UnlockTupleTuplock(relation, &(oldtup.t_self), *lockmode);
if (vmbuffer != InvalidBuffer)
ReleaseBuffer(vmbuffer);
+
bms_free(hot_attrs);
bms_free(key_attrs);
return result;
}
/*
- * If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, or during some
- * subsequent window during which we had it unlocked, we'll have to unlock
- * and re-lock, to avoid holding the buffer lock across an I/O. That's a
- * bit unfortunate, especially since we'll now have to recheck whether the
- * tuple has been locked or updated under us, but hopefully it won't
- * happen very often.
+ * If we didn't pin the visibility map page and the page has
+ * become all visible(and frozen) while we were busy locking the buffer,
+ * or during some subsequent window during which we had it unlocked,
+ * we'll have to unlock and re-lock, to avoid holding the buffer lock
+ * across an I/O. That's a bit unfortunate, especially since we'll now
+ * have to recheck whether the tuple has been locked or updated under us,
+ * but hopefully it won't happen very often.
*/
- if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
+ if (vmbuffer == InvalidBuffer &&
+ (PageIsAllVisible(page) || PageIsAllFrozen(page)))
{
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
visibilitymap_pin(relation, block, &vmbuffer);
@@ -3803,14 +3847,30 @@ l2:
all_visible_cleared = true;
PageClearAllVisible(BufferGetPage(buffer));
visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
- vmbuffer);
+ vmbuffer, VISIBILITYMAP_ALL_VISIBLE);
}
if (newbuf != buffer && PageIsAllVisible(BufferGetPage(newbuf)))
{
all_visible_cleared_new = true;
PageClearAllVisible(BufferGetPage(newbuf));
visibilitymap_clear(relation, BufferGetBlockNumber(newbuf),
- vmbuffer_new);
+ vmbuffer_new, VISIBILITYMAP_ALL_VISIBLE);
+ }
+
+ /* clear PD_ALL_FROZEN flags */
+ if (PageIsAllFrozen(BufferGetPage(buffer)))
+ {
+ all_frozen_cleared = true;
+ PageClearAllFrozen(BufferGetPage(buffer));
+ visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
+ vmbuffer, VISIBILITYMAP_ALL_FROZEN);
+ }
+ if (newbuf != buffer && PageIsAllFrozen(BufferGetPage(newbuf)))
+ {
+ all_frozen_cleared_new = true;
+ PageClearAllFrozen(BufferGetPage(newbuf));
+ visibilitymap_clear(relation, BufferGetBlockNumber(newbuf),
+ vmbuffer_new, VISIBILITYMAP_ALL_FROZEN);
}
if (newbuf != buffer)
@@ -3836,7 +3896,9 @@ l2:
newbuf, &oldtup, heaptup,
old_key_tuple,
all_visible_cleared,
- all_visible_cleared_new);
+ all_visible_cleared_new,
+ all_frozen_cleared,
+ all_frozen_cleared_new);
if (newbuf != buffer)
{
PageSetLSN(BufferGetPage(newbuf), recptr);
@@ -6893,7 +6955,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
*/
XLogRecPtr
log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
- TransactionId cutoff_xid)
+ TransactionId cutoff_xid, uint8 vmflags)
{
xl_heap_visible xlrec;
XLogRecPtr recptr;
@@ -6903,6 +6965,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
Assert(BufferIsValid(vm_buffer));
xlrec.cutoff_xid = cutoff_xid;
+ xlrec.flags = vmflags;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
@@ -6926,7 +6989,8 @@ static XLogRecPtr
log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup, HeapTuple newtup,
HeapTuple old_key_tuple,
- bool all_visible_cleared, bool new_all_visible_cleared)
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool all_frozen_cleared, bool new_all_frozen_cleared)
{
xl_heap_update xlrec;
xl_heap_header xlhdr;
@@ -7009,6 +7073,10 @@ log_heap_update(Relation reln, Buffer oldbuf,
xlrec.flags |= XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED;
if (new_all_visible_cleared)
xlrec.flags |= XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED;
+ if (all_frozen_cleared)
+ xlrec.flags |= XLH_UPDATE_OLD_ALL_FROZEN_CLEARED;
+ if (new_all_frozen_cleared)
+ xlrec.flags |= XLH_UPDATE_NEW_ALL_FROZEN_CLEARED;
if (prefixlen > 0)
xlrec.flags |= XLH_UPDATE_PREFIX_FROM_OLD;
if (suffixlen > 0)
@@ -7491,8 +7559,14 @@ heap_xlog_visible(XLogReaderState *record)
* the subsequent update won't be replayed to clear the flag.
*/
page = BufferGetPage(buffer);
- PageSetAllVisible(page);
+
+ if (xlrec->flags & VISIBILITYMAP_ALL_VISIBLE)
+ PageSetAllVisible(page);
+ if (xlrec->flags & VISIBILITYMAP_ALL_FROZEN)
+ PageSetAllFrozen(page);
+
MarkBufferDirty(buffer);
+
}
else if (action == BLK_RESTORED)
{
@@ -7543,7 +7617,7 @@ heap_xlog_visible(XLogReaderState *record)
*/
if (lsn > PageGetLSN(vmpage))
visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
- xlrec->cutoff_xid);
+ xlrec->cutoff_xid, xlrec->flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
@@ -7652,13 +7726,20 @@ heap_xlog_delete(XLogReaderState *record)
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->flags & XLH_DELETE_ALL_VISIBLE_CLEARED)
+ if (xlrec->flags & (XLH_DELETE_ALL_VISIBLE_CLEARED | XLH_DELETE_ALL_FROZEN_CLEARED))
{
Relation reln = CreateFakeRelcacheEntry(target_node);
Buffer vmbuffer = InvalidBuffer;
+ uint8 flags = 0;
+
+ /* set flags for either clear one flags or both */
+ flags |= xlrec->flags & XLH_DELETE_ALL_VISIBLE_CLEARED ?
+ VISIBILITYMAP_ALL_VISIBLE : 0;
+ flags |= xlrec->flags & XLH_DELETE_ALL_FROZEN_CLEARED ?
+ VISIBILITYMAP_ALL_FROZEN : 0;
visibilitymap_pin(reln, blkno, &vmbuffer);
- visibilitymap_clear(reln, blkno, vmbuffer);
+ visibilitymap_clear(reln, blkno, vmbuffer, flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
}
@@ -7730,13 +7811,20 @@ heap_xlog_insert(XLogReaderState *record)
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
+ if (xlrec->flags & (XLH_INSERT_ALL_VISIBLE_CLEARED | XLH_INSERT_ALL_FROZEN_CLEARED))
{
Relation reln = CreateFakeRelcacheEntry(target_node);
Buffer vmbuffer = InvalidBuffer;
+ uint8 flags = 0;
+
+ /* set flags for either clear one flags or both */
+ flags |= xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED ?
+ VISIBILITYMAP_ALL_VISIBLE : 0;
+ flags |= xlrec->flags & XLH_INSERT_ALL_FROZEN_CLEARED ?
+ VISIBILITYMAP_ALL_FROZEN : 0;
visibilitymap_pin(reln, blkno, &vmbuffer);
- visibilitymap_clear(reln, blkno, vmbuffer);
+ visibilitymap_clear(reln, blkno, vmbuffer, flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
}
@@ -7796,6 +7884,9 @@ heap_xlog_insert(XLogReaderState *record)
if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
+ if (xlrec->flags & XLH_INSERT_ALL_FROZEN_CLEARED)
+ PageClearAllFrozen(page);
+
MarkBufferDirty(buffer);
}
if (BufferIsValid(buffer))
@@ -7850,13 +7941,20 @@ heap_xlog_multi_insert(XLogReaderState *record)
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
+ if (xlrec->flags & (XLH_INSERT_ALL_VISIBLE_CLEARED | XLH_INSERT_ALL_FROZEN_CLEARED))
{
Relation reln = CreateFakeRelcacheEntry(rnode);
Buffer vmbuffer = InvalidBuffer;
+ uint8 flags = 0;
+
+ /* set flags for either clear one flags or both */
+ flags |= xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED ?
+ VISIBILITYMAP_ALL_VISIBLE : 0;
+ flags |= xlrec->flags & XLH_INSERT_ALL_FROZEN_CLEARED ?
+ VISIBILITYMAP_ALL_FROZEN : 0;
visibilitymap_pin(reln, blkno, &vmbuffer);
- visibilitymap_clear(reln, blkno, vmbuffer);
+ visibilitymap_clear(reln, blkno, vmbuffer, flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
}
@@ -7934,6 +8032,8 @@ heap_xlog_multi_insert(XLogReaderState *record)
if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
+ if (xlrec->flags & XLH_INSERT_ALL_FROZEN_CLEARED)
+ PageClearAllFrozen(page);
MarkBufferDirty(buffer);
}
@@ -8005,13 +8105,20 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->flags & XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED)
+ if (xlrec->flags & (XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED | XLH_UPDATE_OLD_ALL_FROZEN_CLEARED))
{
Relation reln = CreateFakeRelcacheEntry(rnode);
Buffer vmbuffer = InvalidBuffer;
+ uint8 flags = 0;
+
+ /* set flags for either clear one flags or both */
+ flags |= xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED ?
+ VISIBILITYMAP_ALL_VISIBLE : 0;
+ flags |= xlrec->flags & XLH_INSERT_ALL_FROZEN_CLEARED ?
+ VISIBILITYMAP_ALL_FROZEN : 0;
visibilitymap_pin(reln, oldblk, &vmbuffer);
- visibilitymap_clear(reln, oldblk, vmbuffer);
+ visibilitymap_clear(reln, oldblk, vmbuffer, flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
}
@@ -8062,6 +8169,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
if (xlrec->flags & XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
+ if (xlrec->flags & XLH_UPDATE_OLD_ALL_FROZEN_CLEARED)
+ PageClearAllFrozen(page);
PageSetLSN(page, lsn);
MarkBufferDirty(obuffer);
@@ -8089,13 +8198,20 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->flags & XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED)
+ if (xlrec->flags & (XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED | XLH_UPDATE_NEW_ALL_FROZEN_CLEARED))
{
Relation reln = CreateFakeRelcacheEntry(rnode);
Buffer vmbuffer = InvalidBuffer;
+ uint8 flags = 0;
+
+ /* set flags for either clear one flags or both */
+ flags |= xlrec->flags & XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED ?
+ VISIBILITYMAP_ALL_VISIBLE : 0;
+ flags |= xlrec->flags & XLH_UPDATE_NEW_ALL_FROZEN_CLEARED ?
+ VISIBILITYMAP_ALL_FROZEN : 0;
visibilitymap_pin(reln, newblk, &vmbuffer);
- visibilitymap_clear(reln, newblk, vmbuffer);
+ visibilitymap_clear(reln, newblk, vmbuffer, flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
}
@@ -8197,6 +8313,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
if (xlrec->flags & XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
+ if (xlrec->flags & XLH_UPDATE_NEW_ALL_FROZEN_CLEARED)
+ PageClearAllFrozen(page);
freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 6db73bf..4e19f9c 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -327,7 +327,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
{
/* easy case */
buffer = ReadBufferBI(relation, targetBlock, bistate);
- if (PageIsAllVisible(BufferGetPage(buffer)))
+ if (PageIsAllVisible(BufferGetPage(buffer)) ||
+ PageIsAllFrozen(BufferGetPage(buffer)))
visibilitymap_pin(relation, targetBlock, vmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
}
@@ -335,7 +336,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
{
/* also easy case */
buffer = otherBuffer;
- if (PageIsAllVisible(BufferGetPage(buffer)))
+ if (PageIsAllVisible(BufferGetPage(buffer)) ||
+ PageIsAllFrozen(BufferGetPage(buffer)))
visibilitymap_pin(relation, targetBlock, vmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
}
@@ -343,7 +345,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
{
/* lock other buffer first */
buffer = ReadBuffer(relation, targetBlock);
- if (PageIsAllVisible(BufferGetPage(buffer)))
+ if (PageIsAllVisible(BufferGetPage(buffer)) ||
+ PageIsAllFrozen(BufferGetPage(buffer)))
visibilitymap_pin(relation, targetBlock, vmbuffer);
LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
@@ -352,7 +355,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
{
/* lock target buffer first */
buffer = ReadBuffer(relation, targetBlock);
- if (PageIsAllVisible(BufferGetPage(buffer)))
+ if (PageIsAllVisible(BufferGetPage(buffer)) ||
+ PageIsAllFrozen(BufferGetPage(buffer)))
visibilitymap_pin(relation, targetBlock, vmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 7c38772..ab8beef 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -21,11 +21,14 @@
*
* NOTES
*
- * The visibility map is a bitmap with one bit per heap page. A set bit means
- * that all tuples on the page are known visible to all transactions, and
- * therefore the page doesn't need to be vacuumed. The map is conservative in
- * the sense that we make sure that whenever a bit is set, we know the
- * condition is true, but if a bit is not set, it might or might not be true.
+ * The visibility map is a bitmap with two bits (all-visible and all-frozen
+ * per heap page. A set all-visible bit means that all tuples on the page are
+ * known visible to all transactions, and therefore the page doesn't need to
+ * be vacuumed. A set all-frozen bit means that all tuples on the page are
+ * completely frozen, so the page doesn't need to be vacuumed even if whole
+ * table scanning vacuum is required. The map is conservative in the sense that
+ * we make sure that whenever a bit is set, we know the condition is true,
+ * but if a bit is not set, it might or might not be true.
*
* Clearing a visibility map bit is not separately WAL-logged. The callers
* must make sure that whenever a bit is cleared, the bit is cleared on WAL
@@ -33,21 +36,25 @@
*
* When we *set* a visibility map during VACUUM, we must write WAL. This may
* seem counterintuitive, since the bit is basically a hint: if it is clear,
- * it may still be the case that every tuple on the page is visible to all
- * transactions; we just don't know that for certain. The difficulty is that
- * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
- * on the page itself, and the visibility map bit. If a crash occurs after the
- * visibility map page makes it to disk and before the updated heap page makes
- * it to disk, redo must set the bit on the heap page. Otherwise, the next
- * insert, update, or delete on the heap page will fail to realize that the
- * visibility map bit must be cleared, possibly causing index-only scans to
- * return wrong answers.
+ * it may still be the case that every tuple on the page is visible or frozen
+ * to all transactions; we just don't know that for certain. The difficulty is
+ * that there are two bits which are typically set together: the PD_ALL_VISIBLE
+ * or PD_ALL_FROZEN bit on the page itself, and the visibility map bit. If a
+ * crash occurs after the visibility map page makes it to disk and before the
+ * updated heap page makes it to disk, redo must set the bit on the heap page.
+ * Otherwise, the next insert, update, or delete on the heap page will fail to
+ * realize that the visibility map bit must be cleared, possibly causing index-only
+ * scans to return wrong answers.
*
- * VACUUM will normally skip pages for which the visibility map bit is set;
+ * VACUUM will normally skip pages for which the visibility map either bit is set;
* such pages can't contain any dead tuples and therefore don't need vacuuming.
- * The visibility map is not used for anti-wraparound vacuums, because
+ * The visibility map is not used for anti-wraparound vacuums before 9.5, because
* an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
* present in the table, even on pages that don't have any dead tuples.
+ * 9.6 or later, the visibility map has a additional bit that indicates all tuple
+ * on single page has been completely forzen, so the visibility map is also used for
+ * anti-wraparound vacuums.
+ *
*
* LOCKING
*
@@ -58,14 +65,14 @@
* section that logs the page modification. However, we don't want to hold
* the buffer lock over any I/O that may be required to read in the visibility
* map page. To avoid this, we examine the heap page before locking it;
- * if the page-level PD_ALL_VISIBLE bit is set, we pin the visibility map
- * bit. Then, we lock the buffer. But this creates a race condition: there
- * is a possibility that in the time it takes to lock the buffer, the
- * PD_ALL_VISIBLE bit gets set. If that happens, we have to unlock the
- * buffer, pin the visibility map page, and relock the buffer. This shouldn't
- * happen often, because only VACUUM currently sets visibility map bits,
- * and the race will only occur if VACUUM processes a given page at almost
- * exactly the same time that someone tries to further modify it.
+ * if the page-level PD_ALL_VISIBLE or PD_ALL_FROZEN bit is set, we pin the
+ * visibility map bit. Then, we lock the buffer. But this creates a race
+ * condition: there is a possibility that in the time it takes to lock the
+ * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that happens,
+ * we have to unlock the buffer, pin the visibility map page, and relock the
+ * buffer. This shouldn't happen often, because only VACUUM currently sets
+ * visibility map bits, and the race will only occur if VACUUM processes a given
+ * page at almost exactly the same time that someone tries to further modify it.
*
* To set a bit, you need to hold a lock on the heap page. That prevents
* the race condition where VACUUM sees that all tuples on the page are
@@ -101,11 +108,14 @@
*/
#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
-/* Number of bits allocated for each heap block. */
-#define BITS_PER_HEAPBLOCK 1
+/*
+ * Number of bits allocated for each heap block.
+ * One for all-visible, another one for all-frozen.
+*/
+#define BITS_PER_HEAPBLOCK 2
/* Number of heap blocks we can represent in one byte. */
-#define HEAPBLOCKS_PER_BYTE 8
+#define HEAPBLOCKS_PER_BYTE 4
/* Number of heap blocks we can represent in one visibility map page. */
#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
@@ -115,24 +125,42 @@
#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
#define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
-/* table for fast counting of set bits */
-static const uint8 number_of_ones[256] = {
- 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
+/* tables for fast counting of set bits for visible and freeze */
+static const uint8 number_of_ones_for_visible[256] = {
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4
+};
+static const uint8 number_of_ones_for_freeze[256] = {
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4
};
/* prototypes for internal routines */
@@ -141,23 +169,23 @@ static void vm_extend(Relation rel, BlockNumber nvmblocks);
/*
- * visibilitymap_clear - clear a bit in visibility map
+ * visibilitymap_clear - clear bits in visibility map
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
* any I/O.
*/
void
-visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
+visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf, uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
int mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- uint8 mask = 1 << mapBit;
+ uint8 mask = flags << (BITS_PER_HEAPBLOCK * mapBit);
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_clear %s %d %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
if (!BufferIsValid(buf) || BufferGetBlockNumber(buf) != mapBlock)
@@ -225,7 +253,7 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
}
/*
- * visibilitymap_set - set a bit on a previously pinned page
+ * visibilitymap_set - set bits on a previously pinned page
*
* recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
* or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
@@ -234,10 +262,10 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
* marked all-visible; it is needed for Hot Standby, and can be
* InvalidTransactionId if the page contains no tuples.
*
- * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
- * this function. Except in recovery, caller should also pass the heap
- * buffer. When checksums are enabled and we're not in recovery, we must add
- * the heap buffer to the WAL chain to protect it from being torn.
+ * Caller is expected to set the heap page's PD_ALL_VISIBLE or PD_ALL_FROZEN
+ * bit before calling this function. Except in recovery, caller should also
+ * pass the heap buffer. When checksums are enabled and we're not in recovery,
+ * we must add the heap buffer to the WAL chain to protect it from being torn.
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -245,7 +273,8 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
*/
void
visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid)
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -254,7 +283,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_set %s %d %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
@@ -272,11 +301,11 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map = PageGetContents(page);
LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
- if (!(map[mapByte] & (1 << mapBit)))
+ if (!(map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))))
{
START_CRIT_SECTION();
- map[mapByte] |= (1 << mapBit);
+ map[mapByte] |= (flags << (BITS_PER_HEAPBLOCK * mapBit));
MarkBufferDirty(vmBuf);
if (RelationNeedsWAL(rel))
@@ -285,7 +314,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Assert(!InRecovery);
recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
- cutoff_xid);
+ cutoff_xid, flags);
/*
* If data checksums are enabled (or wal_log_hints=on), we
@@ -310,9 +339,9 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
}
/*
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_test - test if bits are set
*
- * Are all tuples on heapBlk visible to all, according to the visibility map?
+ * Are all tuples on heapBlk visible or frozen to all, according to the visibility map?
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by an
* earlier call to visibilitymap_pin or visibilitymap_test on the same
@@ -328,7 +357,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
* all concurrency issues!
*/
bool
-visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
+visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf, uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -337,7 +366,7 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_test %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_test %s %d %u", RelationGetRelationName(rel), heapBlk, bits);
#endif
/* Reuse the old pinned buffer if possible */
@@ -360,11 +389,11 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
map = PageGetContents(BufferGetPage(*buf));
/*
- * A single-bit read is atomic. There could be memory-ordering effects
+ * A single or double bit read is atomic. There could be memory-ordering effects
* here, but for performance reasons we make it the caller's job to worry
* about that.
*/
- result = (map[mapByte] & (1 << mapBit)) ? true : false;
+ result = (map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))) ? true : false;
return result;
}
@@ -374,10 +403,12 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
*
* Note: we ignore the possibility of race conditions when the table is being
* extended concurrently with the call. New pages added to the table aren't
- * going to be marked all-visible, so they won't affect the result.
+ * going to be marked all-visible or all-frozen, so they won't affect the result.
+ * if for_visible is true, we count the number of all-visible flag. If false,
+ * we count the number of all-frozen flag.
*/
BlockNumber
-visibilitymap_count(Relation rel)
+visibilitymap_count(Relation rel, bool for_visible)
{
BlockNumber result = 0;
BlockNumber mapBlock;
@@ -406,7 +437,8 @@ visibilitymap_count(Relation rel)
for (i = 0; i < MAPSIZE; i++)
{
- result += number_of_ones[map[i]];
+ result += for_visible ?
+ number_of_ones_for_visible[map[i]] : number_of_ones_for_freeze[map[i]];
}
ReleaseBuffer(mapBuffer);
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 4246554..65753d9 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1919,11 +1919,18 @@ index_update_stats(Relation rel,
{
BlockNumber relpages = RelationGetNumberOfBlocks(rel);
BlockNumber relallvisible;
+ BlockNumber relallfrozen;
if (rd_rel->relkind != RELKIND_INDEX)
- relallvisible = visibilitymap_count(rel);
+ {
+ relallvisible = visibilitymap_count(rel, true);
+ relallfrozen = visibilitymap_count(rel, false);
+ }
else /* don't bother for indexes */
+ {
relallvisible = 0;
+ relallfrozen = 0;
+ }
if (rd_rel->relpages != (int32) relpages)
{
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 861048f..1eaf2da 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -572,7 +572,8 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
vac_update_relstats(onerel,
relpages,
totalrows,
- visibilitymap_count(onerel),
+ visibilitymap_count(onerel, true),
+ visibilitymap_count(onerel, false),
hasindex,
InvalidTransactionId,
InvalidMultiXactId,
@@ -595,6 +596,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
RelationGetNumberOfBlocks(Irel[ind]),
totalindexrows,
0,
+ 0,
false,
InvalidTransactionId,
InvalidMultiXactId,
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 7ab4874..d3725dd 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -22,6 +22,7 @@
#include "access/rewriteheap.h"
#include "access/transam.h"
#include "access/tuptoaster.h"
+#include "access/visibilitymap.h"
#include "access/xact.h"
#include "access/xlog.h"
#include "catalog/catalog.h"
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index baf66f1..d68c7c4 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -744,6 +744,7 @@ void
vac_update_relstats(Relation relation,
BlockNumber num_pages, double num_tuples,
BlockNumber num_all_visible_pages,
+ BlockNumber num_all_frozen_pages,
bool hasindex, TransactionId frozenxid,
MultiXactId minmulti,
bool in_outer_xact)
@@ -781,6 +782,11 @@ vac_update_relstats(Relation relation,
pgcform->relallvisible = (int32) num_all_visible_pages;
dirty = true;
}
+ if (pgcform->relallfrozen != (int32) num_all_frozen_pages)
+ {
+ pgcform->relallfrozen = (int32) num_all_frozen_pages;
+ dirty = true;
+ }
/* Apply DDL updates, but not inside an outer transaction (see above) */
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index a01cfb4..fc149af 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -106,6 +106,8 @@ typedef struct LVRelStats
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* number of pages we examined */
BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber vmskipped_pages; /* # of pages we skipped by all-frozen bit
+ of visibility map */
double scanned_tuples; /* counts only tuples on scanned pages */
double old_rel_tuples; /* previous value of pg_class.reltuples */
double new_rel_tuples; /* new estimated total # of tuples */
@@ -188,7 +190,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
MultiXactId mxactFullScanLimit;
BlockNumber new_rel_pages;
double new_rel_tuples;
- BlockNumber new_rel_allvisible;
+ BlockNumber new_rel_allvisible,
+ new_rel_allfrozen;
double new_live_tuples;
TransactionId new_frozen_xid;
MultiXactId new_min_multi;
@@ -222,6 +225,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* than or equal to the requested Xid full-table scan limit; or if the
* table's minimum MultiXactId is older than or equal to the requested
* mxid full-table scan limit.
+ * Even if scan_all is set so far, we could skip to scan some pages
+ * according by frozen map.
*/
scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
xidFullScanLimit);
@@ -253,14 +258,16 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
- if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages + vacrelstats->vmskipped_pages)
+ < vacrelstats->rel_pages)
{
- Assert(!scan_all);
scanned_all = false;
}
else
scanned_all = true;
+ scanned_all |= scan_all;
+
/*
* Optionally truncate the relation.
*
@@ -301,10 +308,16 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_tuples = vacrelstats->old_rel_tuples;
}
- new_rel_allvisible = visibilitymap_count(onerel);
+ /* true means that count for all-visible */
+ new_rel_allvisible = visibilitymap_count(onerel, true);
if (new_rel_allvisible > new_rel_pages)
new_rel_allvisible = new_rel_pages;
+ /* false means that count for all-frozen */
+ new_rel_allfrozen = visibilitymap_count(onerel, false);
+ if (new_rel_allfrozen > new_rel_pages)
+ new_rel_allfrozen = new_rel_pages;
+
new_frozen_xid = scanned_all ? FreezeLimit : InvalidTransactionId;
new_min_multi = scanned_all ? MultiXactCutoff : InvalidMultiXactId;
@@ -312,6 +325,7 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_pages,
new_rel_tuples,
new_rel_allvisible,
+ new_rel_allfrozen,
vacrelstats->hasindex,
new_frozen_xid,
new_min_multi,
@@ -515,7 +529,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer))
+ if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer,
+ VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN))
break;
vacuum_delay_point();
}
@@ -534,6 +549,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
hastup;
int prev_dead_count;
int nfrozen;
+ int already_nfrozen; /* # of tuples already frozen */
+ int ntup_blk; /* # of tuples in single page */
Size freespace;
bool all_visible_according_to_vm;
bool all_visible;
@@ -548,7 +565,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block++)
{
if (!visibilitymap_test(onerel, next_not_all_visible_block,
- &vmbuffer))
+ &vmbuffer,
+ VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN))
break;
vacuum_delay_point();
}
@@ -566,9 +584,20 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
}
else
{
- /* Current block is all-visible */
+ /*
+ * Current block is all-visible.
+ * If visibility map represents that it's all frozen, we can
+ * skip to vacuum page unconditionally.
+ */
+ if (visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_FROZEN))
+ {
+ vacrelstats->vmskipped_pages++;
+ continue;
+ }
+
if (skipping_all_visible_blocks && !scan_all)
continue;
+
all_visible_according_to_vm = true;
}
@@ -740,7 +769,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
PageSetAllVisible(page);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, InvalidTransactionId);
+ vmbuffer, InvalidTransactionId,
+ VISIBILITYMAP_ALL_VISIBLE);
END_CRIT_SECTION();
}
@@ -764,6 +794,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
all_visible = true;
has_dead_tuples = false;
nfrozen = 0;
+ already_nfrozen = 0;
+ ntup_blk = 0;
hastup = false;
prev_dead_count = vacrelstats->num_dead_tuples;
maxoff = PageGetMaxOffsetNumber(page);
@@ -918,8 +950,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
{
num_tuples += 1;
+ ntup_blk += 1;
hastup = true;
+ /* If current tuple is already frozen, count it up */
+ if (HeapTupleHeaderXminFrozen(tuple.t_data))
+ already_nfrozen += 1;
+
/*
* Each non-removable tuple must be checked to see if it needs
* freezing. Note we already have exclusive buffer lock.
@@ -931,11 +968,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
} /* scan along page */
/*
- * If we froze any tuples, mark the buffer dirty, and write a WAL
- * record recording the changes. We must log the changes to be
- * crash-safe against future truncation of CLOG.
+ * If we froze any tuples or any tuples are already frozen,
+ * mark the buffer dirty, and write a WAL record recording the changes.
+ * We must log the changes to be crash-safe against future truncation
+ * of CLOG.
*/
- if (nfrozen > 0)
+ if (nfrozen > 0 || already_nfrozen > 0)
{
START_CRIT_SECTION();
@@ -953,8 +991,20 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
heap_execute_freeze_tuple(htup, &frozen[i]);
}
+ /*
+ * As a result of scanning a page, we ensure that all tuples
+ * are completely frozen. Set VISIBILITYMAP_ALL_FROZEN bit on
+ * visibility map and PD_ALL_FROZEN flag on page.
+ */
+ if (ntup_blk == (nfrozen + already_nfrozen))
+ {
+ PageSetAllFrozen(page);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr, vmbuffer,
+ InvalidTransactionId, VISIBILITYMAP_ALL_FROZEN);
+ }
+
/* Now WAL-log freezing if necessary */
- if (RelationNeedsWAL(onerel))
+ if (nfrozen > 0 && RelationNeedsWAL(onerel))
{
XLogRecPtr recptr;
@@ -1007,7 +1057,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
PageSetAllVisible(page);
MarkBufferDirty(buf);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, visibility_cutoff_xid);
+ vmbuffer, visibility_cutoff_xid, VISIBILITYMAP_ALL_VISIBLE);
}
/*
@@ -1018,11 +1068,11 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* that something bad has happened.
*/
else if (all_visible_according_to_vm && !PageIsAllVisible(page)
- && visibilitymap_test(onerel, blkno, &vmbuffer))
+ && visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
relname, blkno);
- visibilitymap_clear(onerel, blkno, vmbuffer);
+ visibilitymap_clear(onerel, blkno, vmbuffer, VISIBILITYMAP_ALL_VISIBLE);
}
/*
@@ -1044,7 +1094,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
relname, blkno);
PageClearAllVisible(page);
MarkBufferDirty(buf);
- visibilitymap_clear(onerel, blkno, vmbuffer);
+ visibilitymap_clear(onerel, blkno, vmbuffer, VISIBILITYMAP_ALL_VISIBLE);
}
UnlockReleaseBuffer(buf);
@@ -1078,7 +1128,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
num_tuples);
/*
- * Release any remaining pin on visibility map page.
+ * Release any remaining pin on visibility map and frozen map page.
*/
if (BufferIsValid(vmbuffer))
{
@@ -1285,11 +1335,11 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
* flag is now set, also set the VM bit.
*/
if (PageIsAllVisible(page) &&
- !visibilitymap_test(onerel, blkno, vmbuffer))
+ !visibilitymap_test(onerel, blkno, vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
Assert(BufferIsValid(*vmbuffer));
visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
- visibility_cutoff_xid);
+ visibility_cutoff_xid, VISIBILITYMAP_ALL_VISIBLE);
}
return tupindex;
@@ -1408,6 +1458,7 @@ lazy_cleanup_index(Relation indrel,
stats->num_pages,
stats->num_index_tuples,
0,
+ 0,
false,
InvalidTransactionId,
InvalidMultiXactId,
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 9f54c46..08df289 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -116,7 +116,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
*/
if (!visibilitymap_test(scandesc->heapRelation,
ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
+ &node->ioss_VMBuffer, VISIBILITYMAP_ALL_VISIBLE))
{
/*
* Rats, we have to visit the heap to check visibility.
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 874ca6a..376841a 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -127,7 +127,7 @@ ExecCheckPlanOutput(Relation resultRel, List *targetList)
if (attno != resultDesc->natts)
ereport(ERROR,
(errcode(ERRCODE_DATATYPE_MISMATCH),
- errmsg("table row type and query-specified row type do not match"),
+ errmsg("table row type and query-specified row type do not match"),
errdetail("Query has too few columns.")));
}
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index caa0f14..d2f083b 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -64,9 +64,10 @@
*/
/* PD_ALL_VISIBLE was cleared */
#define XLH_INSERT_ALL_VISIBLE_CLEARED (1<<0)
-#define XLH_INSERT_LAST_IN_MULTI (1<<1)
-#define XLH_INSERT_IS_SPECULATIVE (1<<2)
-#define XLH_INSERT_CONTAINS_NEW_TUPLE (1<<3)
+#define XLH_INSERT_ALL_FROZEN_CLEARED (1<<1)
+#define XLH_INSERT_LAST_IN_MULTI (1<<2)
+#define XLH_INSERT_IS_SPECULATIVE (1<<3)
+#define XLH_INSERT_CONTAINS_NEW_TUPLE (1<<4)
/*
* xl_heap_update flag values, 8 bits are available.
@@ -75,11 +76,15 @@
#define XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED (1<<0)
/* PD_ALL_VISIBLE was cleared in the 2nd page */
#define XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED (1<<1)
-#define XLH_UPDATE_CONTAINS_OLD_TUPLE (1<<2)
-#define XLH_UPDATE_CONTAINS_OLD_KEY (1<<3)
-#define XLH_UPDATE_CONTAINS_NEW_TUPLE (1<<4)
-#define XLH_UPDATE_PREFIX_FROM_OLD (1<<5)
-#define XLH_UPDATE_SUFFIX_FROM_OLD (1<<6)
+/* PD_FROZEN_VISIBLE was cleared */
+#define XLH_UPDATE_OLD_ALL_FROZEN_CLEARED (1<<2)
+/* PD_FROZEN_VISIBLE was cleared in the 2nd page */
+#define XLH_UPDATE_NEW_ALL_FROZEN_CLEARED (1<<3)
+#define XLH_UPDATE_CONTAINS_OLD_TUPLE (1<<4)
+#define XLH_UPDATE_CONTAINS_OLD_KEY (1<<5)
+#define XLH_UPDATE_CONTAINS_NEW_TUPLE (1<<6)
+#define XLH_UPDATE_PREFIX_FROM_OLD (1<<7)
+#define XLH_UPDATE_SUFFIX_FROM_OLD (1<<8)
/* convenience macro for checking whether any form of old tuple was logged */
#define XLH_UPDATE_CONTAINS_OLD \
@@ -90,9 +95,10 @@
*/
/* PD_ALL_VISIBLE was cleared */
#define XLH_DELETE_ALL_VISIBLE_CLEARED (1<<0)
-#define XLH_DELETE_CONTAINS_OLD_TUPLE (1<<1)
-#define XLH_DELETE_CONTAINS_OLD_KEY (1<<2)
-#define XLH_DELETE_IS_SUPER (1<<3)
+#define XLH_DELETE_ALL_FROZEN_CLEARED (1<<1)
+#define XLH_DELETE_CONTAINS_OLD_TUPLE (1<<2)
+#define XLH_DELETE_CONTAINS_OLD_KEY (1<<3)
+#define XLH_DELETE_IS_SUPER (1<<4)
/* convenience macro for checking whether any form of old tuple was logged */
#define XLH_DELETE_CONTAINS_OLD \
@@ -320,9 +326,10 @@ typedef struct xl_heap_freeze_page
typedef struct xl_heap_visible
{
TransactionId cutoff_xid;
+ uint8 flags;
} xl_heap_visible;
-#define SizeOfHeapVisible (offsetof(xl_heap_visible, cutoff_xid) + sizeof(TransactionId))
+#define SizeOfHeapVisible (offsetof(xl_heap_visible, flags) + sizeof(uint8))
typedef struct xl_heap_new_cid
{
@@ -382,6 +389,8 @@ extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer,
TransactionId cutoff_xid, xl_heap_freeze_tuple *tuples,
int ntuples);
+extern XLogRecPtr log_heap_frozenmap(RelFileNode rnode, Buffer heap_buffer,
+ Buffer fm_buffer);
extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
TransactionId cutoff_xid,
TransactionId cutoff_multi,
@@ -389,6 +398,6 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
xl_heap_freeze_tuple *xlrec_tp);
extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
- Buffer vm_buffer, TransactionId cutoff_xid);
+ Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
#endif /* HEAPAM_XLOG_H */
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 0c0e0ef..53d8103 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -19,15 +19,21 @@
#include "storage/buf.h"
#include "utils/relcache.h"
+/* Flags for bit map */
+#define VISIBILITYMAP_ALL_VISIBLE 0x01
+#define VISIBILITYMAP_ALL_FROZEN 0x02
+
extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk,
- Buffer vmbuf);
+ Buffer vmbuf, uint8 flags);
extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
Buffer *vmbuf);
extern bool visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid);
-extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
-extern BlockNumber visibilitymap_count(Relation rel);
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags);
+extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf,
+ uint8 flags);
+extern BlockNumber visibilitymap_count(Relation rel, bool for_visible);
extern void visibilitymap_truncate(Relation rel, BlockNumber nheapblocks);
#endif /* VISIBILITYMAP_H */
diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
index fea99c7..1a8c18c 100644
--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -47,6 +47,8 @@ CATALOG(pg_class,1259) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83) BKI_SCHEMA_MACRO
float4 reltuples; /* # of tuples (not always up-to-date) */
int32 relallvisible; /* # of all-visible blocks (not always
* up-to-date) */
+ int32 relallfrozen; /* # of all-frozen blocks (not always
+ up-to-date) */
Oid reltoastrelid; /* OID of toast table; 0 if none */
bool relhasindex; /* T if has (or has had) any indexes */
bool relisshared; /* T if shared across databases */
@@ -95,7 +97,7 @@ typedef FormData_pg_class *Form_pg_class;
* ----------------
*/
-#define Natts_pg_class 30
+#define Natts_pg_class 31
#define Anum_pg_class_relname 1
#define Anum_pg_class_relnamespace 2
#define Anum_pg_class_reltype 3
@@ -107,25 +109,26 @@ typedef FormData_pg_class *Form_pg_class;
#define Anum_pg_class_relpages 9
#define Anum_pg_class_reltuples 10
#define Anum_pg_class_relallvisible 11
-#define Anum_pg_class_reltoastrelid 12
-#define Anum_pg_class_relhasindex 13
-#define Anum_pg_class_relisshared 14
-#define Anum_pg_class_relpersistence 15
-#define Anum_pg_class_relkind 16
-#define Anum_pg_class_relnatts 17
-#define Anum_pg_class_relchecks 18
-#define Anum_pg_class_relhasoids 19
-#define Anum_pg_class_relhaspkey 20
-#define Anum_pg_class_relhasrules 21
-#define Anum_pg_class_relhastriggers 22
-#define Anum_pg_class_relhassubclass 23
-#define Anum_pg_class_relrowsecurity 24
-#define Anum_pg_class_relispopulated 25
-#define Anum_pg_class_relreplident 26
-#define Anum_pg_class_relfrozenxid 27
-#define Anum_pg_class_relminmxid 28
-#define Anum_pg_class_relacl 29
-#define Anum_pg_class_reloptions 30
+#define Anum_pg_class_relallfrozen 12
+#define Anum_pg_class_reltoastrelid 13
+#define Anum_pg_class_relhasindex 14
+#define Anum_pg_class_relisshared 15
+#define Anum_pg_class_relpersistence 16
+#define Anum_pg_class_relkind 17
+#define Anum_pg_class_relnatts 18
+#define Anum_pg_class_relchecks 19
+#define Anum_pg_class_relhasoids 20
+#define Anum_pg_class_relhaspkey 21
+#define Anum_pg_class_relhasrules 22
+#define Anum_pg_class_relhastriggers 23
+#define Anum_pg_class_relhassubclass 24
+#define Anum_pg_class_relrowsecurity 25
+#define Anum_pg_class_relispopulated 26
+#define Anum_pg_class_relreplident 27
+#define Anum_pg_class_relfrozenxid 28
+#define Anum_pg_class_relminmxid 29
+#define Anum_pg_class_relacl 30
+#define Anum_pg_class_reloptions 31
/* ----------------
* initial contents of pg_class
@@ -140,13 +143,13 @@ typedef FormData_pg_class *Form_pg_class;
* Note: "3" in the relfrozenxid column stands for FirstNormalTransactionId;
* similarly, "1" in relminmxid stands for FirstMultiXactId
*/
-DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 f f p r 28 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 0 f f p r 28 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 0 f f p r 31 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index e3a31af..d2bae2d 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -172,6 +172,7 @@ extern void vac_update_relstats(Relation relation,
BlockNumber num_pages,
double num_tuples,
BlockNumber num_all_visible_pages,
+ BlockNumber num_all_frozen_pages,
bool hasindex,
TransactionId frozenxid,
MultiXactId minmulti,
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index c2fbffc..f46375d 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -178,8 +178,10 @@ typedef PageHeaderData *PageHeader;
* tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
+#define PD_ALL_FROZEN 0x0008 /* all tuples on page are completely
+ frozen */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x000F /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -367,6 +369,13 @@ typedef PageHeaderData *PageHeader;
#define PageClearAllVisible(page) \
(((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
+#define PageIsAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags & PD_ALL_FROZEN)
+#define PageSetAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags |= PD_ALL_FROZEN)
+#define PageClearAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags &= ~PD_ALL_FROZEN)
+
#define PageIsPrunable(page, oldestxmin) \
( \
AssertMacro(TransactionIdIsNormal(oldestxmin)), \
On Thu, May 28, 2015 at 11:34 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
On Thu, Apr 30, 2015 at 8:07 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
On Fri, Apr 24, 2015 at 11:21 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
On Fri, Apr 24, 2015 at 1:31 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 4/23/15 11:06 AM, Petr Jelinek wrote:
On 23/04/15 17:45, Bruce Momjian wrote:
On Thu, Apr 23, 2015 at 09:45:38AM -0400, Robert Haas wrote:
Agreed, no extra file, and the same write volume as currently. It would
also match pg_clog, which uses two bits per transaction --- maybe we can
reuse some of that code.Yeah, this approach seems promising. We probably can't reuse code from
clog because the usage pattern is different (key for clog is xid, while
for visibility/freeze map ctid is used). But visibility map storage
layer is pretty simple so it should be easy to extend it for this use.Actually, there may be some bit manipulation functions we could reuse;
things like efficiently counting how many things in a byte are set. Probably
doesn't make sense to fully refactor it, but at least CLOG is a good source
for cut/paste/whack.I agree with adding a bit that indicates corresponding page is
all-frozen into VM, just like CLOG.
I'll change the patch as second version patch.The second patch is attached.
In second patch, I added a bit that indicates all tuples in page are
completely frozen into visibility map.
The visibility map became a bitmap with two bit per heap page:
all-visible and all-frozen.
The logics around vacuum, insert/update/delete heap are almost same as
previous version.This patch lack some point: documentation, comment in source code,
etc, so it's WIP patch yet,
but I think that it's enough to discuss about this.The previous patch is no longer applied cleanly to HEAD.
The attached v2 patch is latest version.Please review it.
Attached new rebased version patch.
Please give me comments!
Regards,
--
Sawada Masahiko
Attachments:
000_add_frozen_bit_into_visibilitymap_v3.patchapplication/octet-stream; name=000_add_frozen_bit_into_visibilitymap_v3.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 86a2e6b..835d714 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -88,7 +88,8 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
HeapTuple newtup, HeapTuple old_key_tup,
- bool all_visible_cleared, bool new_all_visible_cleared);
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool all_frozen_cleared, bool new_all_frozen_cleared);
static void HeapSatisfiesHOTandKeyUpdate(Relation relation,
Bitmapset *hot_attrs,
Bitmapset *key_attrs, Bitmapset *id_attrs,
@@ -2107,7 +2108,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
HeapTuple heaptup;
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
- bool all_visible_cleared = false;
+ bool all_visible_cleared = false,
+ all_frozen_cleared = false;
/*
* Fill in tuple header fields, assign an OID, and toast the tuple if
@@ -2131,8 +2133,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
/*
- * Find buffer to insert this tuple into. If the page is all visible,
- * this will also pin the requisite visibility map page.
+ * Find buffer to insert this tuple into. If the page is all visible
+ * of all frozen, this will also pin the requisite visibility map and
+ * frozen map page.
*/
buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
InvalidBuffer, options, bistate,
@@ -2150,7 +2153,16 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
PageClearAllVisible(BufferGetPage(buffer));
visibilitymap_clear(relation,
ItemPointerGetBlockNumber(&(heaptup->t_self)),
- vmbuffer);
+ vmbuffer, VISIBILITYMAP_ALL_VISIBLE);
+ }
+
+ if (PageIsAllFrozen(BufferGetPage(buffer)))
+ {
+ all_frozen_cleared = true;
+ PageClearAllFrozen(BufferGetPage(buffer));
+ visibilitymap_clear(relation,
+ ItemPointerGetBlockNumber(&(heaptup->t_self)),
+ vmbuffer, VISIBILITYMAP_ALL_FROZEN);
}
/*
@@ -2199,6 +2211,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
xlrec.flags = 0;
if (all_visible_cleared)
xlrec.flags |= XLH_INSERT_ALL_VISIBLE_CLEARED;
+ if (all_frozen_cleared)
+ xlrec.flags |= XLH_INSERT_ALL_FROZEN_CLEARED;
if (options & HEAP_INSERT_SPECULATIVE)
xlrec.flags |= XLH_INSERT_IS_SPECULATIVE;
Assert(ItemPointerGetBlockNumber(&heaptup->t_self) == BufferGetBlockNumber(buffer));
@@ -2406,7 +2420,8 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
{
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
- bool all_visible_cleared = false;
+ bool all_visible_cleared = false,
+ all_frozen_cleared = false;
int nthispage;
CHECK_FOR_INTERRUPTS();
@@ -2451,7 +2466,16 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
PageClearAllVisible(page);
visibilitymap_clear(relation,
BufferGetBlockNumber(buffer),
- vmbuffer);
+ vmbuffer, VISIBILITYMAP_ALL_VISIBLE);
+ }
+
+ if (PageIsAllFrozen(page))
+ {
+ all_frozen_cleared = true;
+ PageClearAllFrozen(page);
+ visibilitymap_clear(relation,
+ BufferGetBlockNumber(buffer),
+ vmbuffer, VISIBILITYMAP_ALL_FROZEN);
}
/*
@@ -2496,6 +2520,8 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
tupledata = scratchptr;
xlrec->flags = all_visible_cleared ? XLH_INSERT_ALL_VISIBLE_CLEARED : 0;
+ if (all_frozen_cleared)
+ xlrec->flags |= XLH_INSERT_ALL_FROZEN_CLEARED;
xlrec->ntuples = nthispage;
/*
@@ -2698,7 +2724,8 @@ heap_delete(Relation relation, ItemPointer tid,
new_infomask2;
bool have_tuple_lock = false;
bool iscombo;
- bool all_visible_cleared = false;
+ bool all_visible_cleared = false,
+ all_frozen_cleared = false;
HeapTuple old_key_tuple = NULL; /* replica identity of the tuple */
bool old_key_copied = false;
@@ -2724,18 +2751,19 @@ heap_delete(Relation relation, ItemPointer tid,
* in the middle of changing this, so we'll need to recheck after we have
* the lock.
*/
- if (PageIsAllVisible(page))
+ if (PageIsAllVisible(page) || PageIsAllFrozen(page))
visibilitymap_pin(relation, block, &vmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
/*
* If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, we'll have to unlock and
- * re-lock, to avoid holding the buffer lock across an I/O. That's a bit
- * unfortunate, but hopefully shouldn't happen often.
+ * visible or all frozen while we were busy locking the buffer, we'll
+ * have to unlock and re-lock, to avoid holding the buffer lock across an
+ * I/O. That's a bit unfortunate, but hopefully shouldn't happen often.
*/
- if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
+ if (vmbuffer == InvalidBuffer &&
+ (PageIsAllVisible(page) || PageIsAllFrozen(page)))
{
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
visibilitymap_pin(relation, block, &vmbuffer);
@@ -2925,12 +2953,22 @@ l1:
*/
PageSetPrunable(page, xid);
+ /* clear PD_ALL_VISIBLE flags */
if (PageIsAllVisible(page))
{
all_visible_cleared = true;
PageClearAllVisible(page);
visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
- vmbuffer);
+ vmbuffer, VISIBILITYMAP_ALL_VISIBLE);
+ }
+
+ /* clear PD_ALL_FROZEN flags */
+ if (PageIsAllFrozen(page))
+ {
+ all_frozen_cleared = true;
+ PageClearAllFrozen(page);
+ visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
+ vmbuffer, VISIBILITYMAP_ALL_FROZEN);
}
/* store transaction information of xact deleting the tuple */
@@ -2962,6 +3000,8 @@ l1:
log_heap_new_cid(relation, &tp);
xlrec.flags = all_visible_cleared ? XLH_DELETE_ALL_VISIBLE_CLEARED : 0;
+ if (all_frozen_cleared)
+ xlrec.flags |= XLH_DELETE_ALL_FROZEN_CLEARED;
xlrec.infobits_set = compute_infobits(tp.t_data->t_infomask,
tp.t_data->t_infomask2);
xlrec.offnum = ItemPointerGetOffsetNumber(&tp.t_self);
@@ -3159,6 +3199,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
bool key_intact;
bool all_visible_cleared = false;
bool all_visible_cleared_new = false;
+ bool all_frozen_cleared = false;
+ bool all_frozen_cleared_new = false;
bool checked_lockers;
bool locker_remains;
TransactionId xmax_new_tuple,
@@ -3202,12 +3244,12 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
page = BufferGetPage(buffer);
/*
- * Before locking the buffer, pin the visibility map page if it appears to
- * be necessary. Since we haven't got the lock yet, someone else might be
+ * Before locking the buffer, pin the visibility map if it appears to be
+ * necessary. Since we haven't got the lock yet, someone else might be
* in the middle of changing this, so we'll need to recheck after we have
* the lock.
*/
- if (PageIsAllVisible(page))
+ if (PageIsAllVisible(page) || PageIsAllFrozen(page))
visibilitymap_pin(relation, block, &vmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
@@ -3490,21 +3532,23 @@ l2:
UnlockTupleTuplock(relation, &(oldtup.t_self), *lockmode);
if (vmbuffer != InvalidBuffer)
ReleaseBuffer(vmbuffer);
+
bms_free(hot_attrs);
bms_free(key_attrs);
return result;
}
/*
- * If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, or during some
- * subsequent window during which we had it unlocked, we'll have to unlock
- * and re-lock, to avoid holding the buffer lock across an I/O. That's a
- * bit unfortunate, especially since we'll now have to recheck whether the
- * tuple has been locked or updated under us, but hopefully it won't
- * happen very often.
+ * If we didn't pin the visibility map page and the page has
+ * become all visible(and frozen) while we were busy locking the buffer,
+ * or during some subsequent window during which we had it unlocked,
+ * we'll have to unlock and re-lock, to avoid holding the buffer lock
+ * across an I/O. That's a bit unfortunate, especially since we'll now
+ * have to recheck whether the tuple has been locked or updated under us,
+ * but hopefully it won't happen very often.
*/
- if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
+ if (vmbuffer == InvalidBuffer &&
+ (PageIsAllVisible(page) || PageIsAllFrozen(page)))
{
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
visibilitymap_pin(relation, block, &vmbuffer);
@@ -3803,14 +3847,30 @@ l2:
all_visible_cleared = true;
PageClearAllVisible(BufferGetPage(buffer));
visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
- vmbuffer);
+ vmbuffer, VISIBILITYMAP_ALL_VISIBLE);
}
if (newbuf != buffer && PageIsAllVisible(BufferGetPage(newbuf)))
{
all_visible_cleared_new = true;
PageClearAllVisible(BufferGetPage(newbuf));
visibilitymap_clear(relation, BufferGetBlockNumber(newbuf),
- vmbuffer_new);
+ vmbuffer_new, VISIBILITYMAP_ALL_VISIBLE);
+ }
+
+ /* clear PD_ALL_FROZEN flags */
+ if (PageIsAllFrozen(BufferGetPage(buffer)))
+ {
+ all_frozen_cleared = true;
+ PageClearAllFrozen(BufferGetPage(buffer));
+ visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
+ vmbuffer, VISIBILITYMAP_ALL_FROZEN);
+ }
+ if (newbuf != buffer && PageIsAllFrozen(BufferGetPage(newbuf)))
+ {
+ all_frozen_cleared_new = true;
+ PageClearAllFrozen(BufferGetPage(newbuf));
+ visibilitymap_clear(relation, BufferGetBlockNumber(newbuf),
+ vmbuffer_new, VISIBILITYMAP_ALL_FROZEN);
}
if (newbuf != buffer)
@@ -3836,7 +3896,9 @@ l2:
newbuf, &oldtup, heaptup,
old_key_tuple,
all_visible_cleared,
- all_visible_cleared_new);
+ all_visible_cleared_new,
+ all_frozen_cleared,
+ all_frozen_cleared_new);
if (newbuf != buffer)
{
PageSetLSN(BufferGetPage(newbuf), recptr);
@@ -6893,7 +6955,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
*/
XLogRecPtr
log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
- TransactionId cutoff_xid)
+ TransactionId cutoff_xid, uint8 vmflags)
{
xl_heap_visible xlrec;
XLogRecPtr recptr;
@@ -6903,6 +6965,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
Assert(BufferIsValid(vm_buffer));
xlrec.cutoff_xid = cutoff_xid;
+ xlrec.flags = vmflags;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
@@ -6926,7 +6989,8 @@ static XLogRecPtr
log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup, HeapTuple newtup,
HeapTuple old_key_tuple,
- bool all_visible_cleared, bool new_all_visible_cleared)
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool all_frozen_cleared, bool new_all_frozen_cleared)
{
xl_heap_update xlrec;
xl_heap_header xlhdr;
@@ -7009,6 +7073,10 @@ log_heap_update(Relation reln, Buffer oldbuf,
xlrec.flags |= XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED;
if (new_all_visible_cleared)
xlrec.flags |= XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED;
+ if (all_frozen_cleared)
+ xlrec.flags |= XLH_UPDATE_OLD_ALL_FROZEN_CLEARED;
+ if (new_all_frozen_cleared)
+ xlrec.flags |= XLH_UPDATE_NEW_ALL_FROZEN_CLEARED;
if (prefixlen > 0)
xlrec.flags |= XLH_UPDATE_PREFIX_FROM_OLD;
if (suffixlen > 0)
@@ -7492,8 +7560,14 @@ heap_xlog_visible(XLogReaderState *record)
* the subsequent update won't be replayed to clear the flag.
*/
page = BufferGetPage(buffer);
- PageSetAllVisible(page);
+
+ if (xlrec->flags & VISIBILITYMAP_ALL_VISIBLE)
+ PageSetAllVisible(page);
+ if (xlrec->flags & VISIBILITYMAP_ALL_FROZEN)
+ PageSetAllFrozen(page);
+
MarkBufferDirty(buffer);
+
}
else if (action == BLK_RESTORED)
{
@@ -7544,7 +7618,7 @@ heap_xlog_visible(XLogReaderState *record)
*/
if (lsn > PageGetLSN(vmpage))
visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
- xlrec->cutoff_xid);
+ xlrec->cutoff_xid, xlrec->flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
@@ -7656,13 +7730,20 @@ heap_xlog_delete(XLogReaderState *record)
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->flags & XLH_DELETE_ALL_VISIBLE_CLEARED)
+ if (xlrec->flags & (XLH_DELETE_ALL_VISIBLE_CLEARED | XLH_DELETE_ALL_FROZEN_CLEARED))
{
Relation reln = CreateFakeRelcacheEntry(target_node);
Buffer vmbuffer = InvalidBuffer;
+ uint8 flags = 0;
+
+ /* set flags for either clear one flags or both */
+ flags |= xlrec->flags & XLH_DELETE_ALL_VISIBLE_CLEARED ?
+ VISIBILITYMAP_ALL_VISIBLE : 0;
+ flags |= xlrec->flags & XLH_DELETE_ALL_FROZEN_CLEARED ?
+ VISIBILITYMAP_ALL_FROZEN : 0;
visibilitymap_pin(reln, blkno, &vmbuffer);
- visibilitymap_clear(reln, blkno, vmbuffer);
+ visibilitymap_clear(reln, blkno, vmbuffer, flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
}
@@ -7734,13 +7815,20 @@ heap_xlog_insert(XLogReaderState *record)
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
+ if (xlrec->flags & (XLH_INSERT_ALL_VISIBLE_CLEARED | XLH_INSERT_ALL_FROZEN_CLEARED))
{
Relation reln = CreateFakeRelcacheEntry(target_node);
Buffer vmbuffer = InvalidBuffer;
+ uint8 flags = 0;
+
+ /* set flags for either clear one flags or both */
+ flags |= xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED ?
+ VISIBILITYMAP_ALL_VISIBLE : 0;
+ flags |= xlrec->flags & XLH_INSERT_ALL_FROZEN_CLEARED ?
+ VISIBILITYMAP_ALL_FROZEN : 0;
visibilitymap_pin(reln, blkno, &vmbuffer);
- visibilitymap_clear(reln, blkno, vmbuffer);
+ visibilitymap_clear(reln, blkno, vmbuffer, flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
}
@@ -7800,6 +7888,9 @@ heap_xlog_insert(XLogReaderState *record)
if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
+ if (xlrec->flags & XLH_INSERT_ALL_FROZEN_CLEARED)
+ PageClearAllFrozen(page);
+
MarkBufferDirty(buffer);
}
if (BufferIsValid(buffer))
@@ -7854,13 +7945,20 @@ heap_xlog_multi_insert(XLogReaderState *record)
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
+ if (xlrec->flags & (XLH_INSERT_ALL_VISIBLE_CLEARED | XLH_INSERT_ALL_FROZEN_CLEARED))
{
Relation reln = CreateFakeRelcacheEntry(rnode);
Buffer vmbuffer = InvalidBuffer;
+ uint8 flags = 0;
+
+ /* set flags for either clear one flags or both */
+ flags |= xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED ?
+ VISIBILITYMAP_ALL_VISIBLE : 0;
+ flags |= xlrec->flags & XLH_INSERT_ALL_FROZEN_CLEARED ?
+ VISIBILITYMAP_ALL_FROZEN : 0;
visibilitymap_pin(reln, blkno, &vmbuffer);
- visibilitymap_clear(reln, blkno, vmbuffer);
+ visibilitymap_clear(reln, blkno, vmbuffer, flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
}
@@ -7938,6 +8036,8 @@ heap_xlog_multi_insert(XLogReaderState *record)
if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
+ if (xlrec->flags & XLH_INSERT_ALL_FROZEN_CLEARED)
+ PageClearAllFrozen(page);
MarkBufferDirty(buffer);
}
@@ -8009,13 +8109,20 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->flags & XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED)
+ if (xlrec->flags & (XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED | XLH_UPDATE_OLD_ALL_FROZEN_CLEARED))
{
Relation reln = CreateFakeRelcacheEntry(rnode);
Buffer vmbuffer = InvalidBuffer;
+ uint8 flags = 0;
+
+ /* set flags for either clear one flags or both */
+ flags |= xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED ?
+ VISIBILITYMAP_ALL_VISIBLE : 0;
+ flags |= xlrec->flags & XLH_INSERT_ALL_FROZEN_CLEARED ?
+ VISIBILITYMAP_ALL_FROZEN : 0;
visibilitymap_pin(reln, oldblk, &vmbuffer);
- visibilitymap_clear(reln, oldblk, vmbuffer);
+ visibilitymap_clear(reln, oldblk, vmbuffer, flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
}
@@ -8066,6 +8173,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
if (xlrec->flags & XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
+ if (xlrec->flags & XLH_UPDATE_OLD_ALL_FROZEN_CLEARED)
+ PageClearAllFrozen(page);
PageSetLSN(page, lsn);
MarkBufferDirty(obuffer);
@@ -8093,13 +8202,20 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->flags & XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED)
+ if (xlrec->flags & (XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED | XLH_UPDATE_NEW_ALL_FROZEN_CLEARED))
{
Relation reln = CreateFakeRelcacheEntry(rnode);
Buffer vmbuffer = InvalidBuffer;
+ uint8 flags = 0;
+
+ /* set flags for either clear one flags or both */
+ flags |= xlrec->flags & XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED ?
+ VISIBILITYMAP_ALL_VISIBLE : 0;
+ flags |= xlrec->flags & XLH_UPDATE_NEW_ALL_FROZEN_CLEARED ?
+ VISIBILITYMAP_ALL_FROZEN : 0;
visibilitymap_pin(reln, newblk, &vmbuffer);
- visibilitymap_clear(reln, newblk, vmbuffer);
+ visibilitymap_clear(reln, newblk, vmbuffer, flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
}
@@ -8201,6 +8317,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
if (xlrec->flags & XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
+ if (xlrec->flags & XLH_UPDATE_NEW_ALL_FROZEN_CLEARED)
+ PageClearAllFrozen(page);
freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 6db73bf..4e19f9c 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -327,7 +327,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
{
/* easy case */
buffer = ReadBufferBI(relation, targetBlock, bistate);
- if (PageIsAllVisible(BufferGetPage(buffer)))
+ if (PageIsAllVisible(BufferGetPage(buffer)) ||
+ PageIsAllFrozen(BufferGetPage(buffer)))
visibilitymap_pin(relation, targetBlock, vmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
}
@@ -335,7 +336,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
{
/* also easy case */
buffer = otherBuffer;
- if (PageIsAllVisible(BufferGetPage(buffer)))
+ if (PageIsAllVisible(BufferGetPage(buffer)) ||
+ PageIsAllFrozen(BufferGetPage(buffer)))
visibilitymap_pin(relation, targetBlock, vmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
}
@@ -343,7 +345,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
{
/* lock other buffer first */
buffer = ReadBuffer(relation, targetBlock);
- if (PageIsAllVisible(BufferGetPage(buffer)))
+ if (PageIsAllVisible(BufferGetPage(buffer)) ||
+ PageIsAllFrozen(BufferGetPage(buffer)))
visibilitymap_pin(relation, targetBlock, vmbuffer);
LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
@@ -352,7 +355,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
{
/* lock target buffer first */
buffer = ReadBuffer(relation, targetBlock);
- if (PageIsAllVisible(BufferGetPage(buffer)))
+ if (PageIsAllVisible(BufferGetPage(buffer)) ||
+ PageIsAllFrozen(BufferGetPage(buffer)))
visibilitymap_pin(relation, targetBlock, vmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 7c38772..ab8beef 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -21,11 +21,14 @@
*
* NOTES
*
- * The visibility map is a bitmap with one bit per heap page. A set bit means
- * that all tuples on the page are known visible to all transactions, and
- * therefore the page doesn't need to be vacuumed. The map is conservative in
- * the sense that we make sure that whenever a bit is set, we know the
- * condition is true, but if a bit is not set, it might or might not be true.
+ * The visibility map is a bitmap with two bits (all-visible and all-frozen
+ * per heap page. A set all-visible bit means that all tuples on the page are
+ * known visible to all transactions, and therefore the page doesn't need to
+ * be vacuumed. A set all-frozen bit means that all tuples on the page are
+ * completely frozen, so the page doesn't need to be vacuumed even if whole
+ * table scanning vacuum is required. The map is conservative in the sense that
+ * we make sure that whenever a bit is set, we know the condition is true,
+ * but if a bit is not set, it might or might not be true.
*
* Clearing a visibility map bit is not separately WAL-logged. The callers
* must make sure that whenever a bit is cleared, the bit is cleared on WAL
@@ -33,21 +36,25 @@
*
* When we *set* a visibility map during VACUUM, we must write WAL. This may
* seem counterintuitive, since the bit is basically a hint: if it is clear,
- * it may still be the case that every tuple on the page is visible to all
- * transactions; we just don't know that for certain. The difficulty is that
- * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
- * on the page itself, and the visibility map bit. If a crash occurs after the
- * visibility map page makes it to disk and before the updated heap page makes
- * it to disk, redo must set the bit on the heap page. Otherwise, the next
- * insert, update, or delete on the heap page will fail to realize that the
- * visibility map bit must be cleared, possibly causing index-only scans to
- * return wrong answers.
+ * it may still be the case that every tuple on the page is visible or frozen
+ * to all transactions; we just don't know that for certain. The difficulty is
+ * that there are two bits which are typically set together: the PD_ALL_VISIBLE
+ * or PD_ALL_FROZEN bit on the page itself, and the visibility map bit. If a
+ * crash occurs after the visibility map page makes it to disk and before the
+ * updated heap page makes it to disk, redo must set the bit on the heap page.
+ * Otherwise, the next insert, update, or delete on the heap page will fail to
+ * realize that the visibility map bit must be cleared, possibly causing index-only
+ * scans to return wrong answers.
*
- * VACUUM will normally skip pages for which the visibility map bit is set;
+ * VACUUM will normally skip pages for which the visibility map either bit is set;
* such pages can't contain any dead tuples and therefore don't need vacuuming.
- * The visibility map is not used for anti-wraparound vacuums, because
+ * The visibility map is not used for anti-wraparound vacuums before 9.5, because
* an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
* present in the table, even on pages that don't have any dead tuples.
+ * 9.6 or later, the visibility map has a additional bit that indicates all tuple
+ * on single page has been completely forzen, so the visibility map is also used for
+ * anti-wraparound vacuums.
+ *
*
* LOCKING
*
@@ -58,14 +65,14 @@
* section that logs the page modification. However, we don't want to hold
* the buffer lock over any I/O that may be required to read in the visibility
* map page. To avoid this, we examine the heap page before locking it;
- * if the page-level PD_ALL_VISIBLE bit is set, we pin the visibility map
- * bit. Then, we lock the buffer. But this creates a race condition: there
- * is a possibility that in the time it takes to lock the buffer, the
- * PD_ALL_VISIBLE bit gets set. If that happens, we have to unlock the
- * buffer, pin the visibility map page, and relock the buffer. This shouldn't
- * happen often, because only VACUUM currently sets visibility map bits,
- * and the race will only occur if VACUUM processes a given page at almost
- * exactly the same time that someone tries to further modify it.
+ * if the page-level PD_ALL_VISIBLE or PD_ALL_FROZEN bit is set, we pin the
+ * visibility map bit. Then, we lock the buffer. But this creates a race
+ * condition: there is a possibility that in the time it takes to lock the
+ * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that happens,
+ * we have to unlock the buffer, pin the visibility map page, and relock the
+ * buffer. This shouldn't happen often, because only VACUUM currently sets
+ * visibility map bits, and the race will only occur if VACUUM processes a given
+ * page at almost exactly the same time that someone tries to further modify it.
*
* To set a bit, you need to hold a lock on the heap page. That prevents
* the race condition where VACUUM sees that all tuples on the page are
@@ -101,11 +108,14 @@
*/
#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
-/* Number of bits allocated for each heap block. */
-#define BITS_PER_HEAPBLOCK 1
+/*
+ * Number of bits allocated for each heap block.
+ * One for all-visible, another one for all-frozen.
+*/
+#define BITS_PER_HEAPBLOCK 2
/* Number of heap blocks we can represent in one byte. */
-#define HEAPBLOCKS_PER_BYTE 8
+#define HEAPBLOCKS_PER_BYTE 4
/* Number of heap blocks we can represent in one visibility map page. */
#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
@@ -115,24 +125,42 @@
#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
#define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
-/* table for fast counting of set bits */
-static const uint8 number_of_ones[256] = {
- 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
+/* tables for fast counting of set bits for visible and freeze */
+static const uint8 number_of_ones_for_visible[256] = {
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4
+};
+static const uint8 number_of_ones_for_freeze[256] = {
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4
};
/* prototypes for internal routines */
@@ -141,23 +169,23 @@ static void vm_extend(Relation rel, BlockNumber nvmblocks);
/*
- * visibilitymap_clear - clear a bit in visibility map
+ * visibilitymap_clear - clear bits in visibility map
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
* any I/O.
*/
void
-visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
+visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf, uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
int mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- uint8 mask = 1 << mapBit;
+ uint8 mask = flags << (BITS_PER_HEAPBLOCK * mapBit);
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_clear %s %d %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
if (!BufferIsValid(buf) || BufferGetBlockNumber(buf) != mapBlock)
@@ -225,7 +253,7 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
}
/*
- * visibilitymap_set - set a bit on a previously pinned page
+ * visibilitymap_set - set bits on a previously pinned page
*
* recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
* or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
@@ -234,10 +262,10 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
* marked all-visible; it is needed for Hot Standby, and can be
* InvalidTransactionId if the page contains no tuples.
*
- * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
- * this function. Except in recovery, caller should also pass the heap
- * buffer. When checksums are enabled and we're not in recovery, we must add
- * the heap buffer to the WAL chain to protect it from being torn.
+ * Caller is expected to set the heap page's PD_ALL_VISIBLE or PD_ALL_FROZEN
+ * bit before calling this function. Except in recovery, caller should also
+ * pass the heap buffer. When checksums are enabled and we're not in recovery,
+ * we must add the heap buffer to the WAL chain to protect it from being torn.
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -245,7 +273,8 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
*/
void
visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid)
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -254,7 +283,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_set %s %d %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
@@ -272,11 +301,11 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map = PageGetContents(page);
LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
- if (!(map[mapByte] & (1 << mapBit)))
+ if (!(map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))))
{
START_CRIT_SECTION();
- map[mapByte] |= (1 << mapBit);
+ map[mapByte] |= (flags << (BITS_PER_HEAPBLOCK * mapBit));
MarkBufferDirty(vmBuf);
if (RelationNeedsWAL(rel))
@@ -285,7 +314,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Assert(!InRecovery);
recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
- cutoff_xid);
+ cutoff_xid, flags);
/*
* If data checksums are enabled (or wal_log_hints=on), we
@@ -310,9 +339,9 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
}
/*
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_test - test if bits are set
*
- * Are all tuples on heapBlk visible to all, according to the visibility map?
+ * Are all tuples on heapBlk visible or frozen to all, according to the visibility map?
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by an
* earlier call to visibilitymap_pin or visibilitymap_test on the same
@@ -328,7 +357,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
* all concurrency issues!
*/
bool
-visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
+visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf, uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -337,7 +366,7 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_test %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_test %s %d %u", RelationGetRelationName(rel), heapBlk, bits);
#endif
/* Reuse the old pinned buffer if possible */
@@ -360,11 +389,11 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
map = PageGetContents(BufferGetPage(*buf));
/*
- * A single-bit read is atomic. There could be memory-ordering effects
+ * A single or double bit read is atomic. There could be memory-ordering effects
* here, but for performance reasons we make it the caller's job to worry
* about that.
*/
- result = (map[mapByte] & (1 << mapBit)) ? true : false;
+ result = (map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))) ? true : false;
return result;
}
@@ -374,10 +403,12 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
*
* Note: we ignore the possibility of race conditions when the table is being
* extended concurrently with the call. New pages added to the table aren't
- * going to be marked all-visible, so they won't affect the result.
+ * going to be marked all-visible or all-frozen, so they won't affect the result.
+ * if for_visible is true, we count the number of all-visible flag. If false,
+ * we count the number of all-frozen flag.
*/
BlockNumber
-visibilitymap_count(Relation rel)
+visibilitymap_count(Relation rel, bool for_visible)
{
BlockNumber result = 0;
BlockNumber mapBlock;
@@ -406,7 +437,8 @@ visibilitymap_count(Relation rel)
for (i = 0; i < MAPSIZE; i++)
{
- result += number_of_ones[map[i]];
+ result += for_visible ?
+ number_of_ones_for_visible[map[i]] : number_of_ones_for_freeze[map[i]];
}
ReleaseBuffer(mapBuffer);
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 4246554..65753d9 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1919,11 +1919,18 @@ index_update_stats(Relation rel,
{
BlockNumber relpages = RelationGetNumberOfBlocks(rel);
BlockNumber relallvisible;
+ BlockNumber relallfrozen;
if (rd_rel->relkind != RELKIND_INDEX)
- relallvisible = visibilitymap_count(rel);
+ {
+ relallvisible = visibilitymap_count(rel, true);
+ relallfrozen = visibilitymap_count(rel, false);
+ }
else /* don't bother for indexes */
+ {
relallvisible = 0;
+ relallfrozen = 0;
+ }
if (rd_rel->relpages != (int32) relpages)
{
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 861048f..1eaf2da 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -572,7 +572,8 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
vac_update_relstats(onerel,
relpages,
totalrows,
- visibilitymap_count(onerel),
+ visibilitymap_count(onerel, true),
+ visibilitymap_count(onerel, false),
hasindex,
InvalidTransactionId,
InvalidMultiXactId,
@@ -595,6 +596,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
RelationGetNumberOfBlocks(Irel[ind]),
totalindexrows,
0,
+ 0,
false,
InvalidTransactionId,
InvalidMultiXactId,
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 7ab4874..d3725dd 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -22,6 +22,7 @@
#include "access/rewriteheap.h"
#include "access/transam.h"
#include "access/tuptoaster.h"
+#include "access/visibilitymap.h"
#include "access/xact.h"
#include "access/xlog.h"
#include "catalog/catalog.h"
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index baf66f1..d68c7c4 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -744,6 +744,7 @@ void
vac_update_relstats(Relation relation,
BlockNumber num_pages, double num_tuples,
BlockNumber num_all_visible_pages,
+ BlockNumber num_all_frozen_pages,
bool hasindex, TransactionId frozenxid,
MultiXactId minmulti,
bool in_outer_xact)
@@ -781,6 +782,11 @@ vac_update_relstats(Relation relation,
pgcform->relallvisible = (int32) num_all_visible_pages;
dirty = true;
}
+ if (pgcform->relallfrozen != (int32) num_all_frozen_pages)
+ {
+ pgcform->relallfrozen = (int32) num_all_frozen_pages;
+ dirty = true;
+ }
/* Apply DDL updates, but not inside an outer transaction (see above) */
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index a01cfb4..fc149af 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -106,6 +106,8 @@ typedef struct LVRelStats
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* number of pages we examined */
BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber vmskipped_pages; /* # of pages we skipped by all-frozen bit
+ of visibility map */
double scanned_tuples; /* counts only tuples on scanned pages */
double old_rel_tuples; /* previous value of pg_class.reltuples */
double new_rel_tuples; /* new estimated total # of tuples */
@@ -188,7 +190,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
MultiXactId mxactFullScanLimit;
BlockNumber new_rel_pages;
double new_rel_tuples;
- BlockNumber new_rel_allvisible;
+ BlockNumber new_rel_allvisible,
+ new_rel_allfrozen;
double new_live_tuples;
TransactionId new_frozen_xid;
MultiXactId new_min_multi;
@@ -222,6 +225,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* than or equal to the requested Xid full-table scan limit; or if the
* table's minimum MultiXactId is older than or equal to the requested
* mxid full-table scan limit.
+ * Even if scan_all is set so far, we could skip to scan some pages
+ * according by frozen map.
*/
scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
xidFullScanLimit);
@@ -253,14 +258,16 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
- if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages + vacrelstats->vmskipped_pages)
+ < vacrelstats->rel_pages)
{
- Assert(!scan_all);
scanned_all = false;
}
else
scanned_all = true;
+ scanned_all |= scan_all;
+
/*
* Optionally truncate the relation.
*
@@ -301,10 +308,16 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_tuples = vacrelstats->old_rel_tuples;
}
- new_rel_allvisible = visibilitymap_count(onerel);
+ /* true means that count for all-visible */
+ new_rel_allvisible = visibilitymap_count(onerel, true);
if (new_rel_allvisible > new_rel_pages)
new_rel_allvisible = new_rel_pages;
+ /* false means that count for all-frozen */
+ new_rel_allfrozen = visibilitymap_count(onerel, false);
+ if (new_rel_allfrozen > new_rel_pages)
+ new_rel_allfrozen = new_rel_pages;
+
new_frozen_xid = scanned_all ? FreezeLimit : InvalidTransactionId;
new_min_multi = scanned_all ? MultiXactCutoff : InvalidMultiXactId;
@@ -312,6 +325,7 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_pages,
new_rel_tuples,
new_rel_allvisible,
+ new_rel_allfrozen,
vacrelstats->hasindex,
new_frozen_xid,
new_min_multi,
@@ -515,7 +529,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer))
+ if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer,
+ VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN))
break;
vacuum_delay_point();
}
@@ -534,6 +549,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
hastup;
int prev_dead_count;
int nfrozen;
+ int already_nfrozen; /* # of tuples already frozen */
+ int ntup_blk; /* # of tuples in single page */
Size freespace;
bool all_visible_according_to_vm;
bool all_visible;
@@ -548,7 +565,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block++)
{
if (!visibilitymap_test(onerel, next_not_all_visible_block,
- &vmbuffer))
+ &vmbuffer,
+ VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN))
break;
vacuum_delay_point();
}
@@ -566,9 +584,20 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
}
else
{
- /* Current block is all-visible */
+ /*
+ * Current block is all-visible.
+ * If visibility map represents that it's all frozen, we can
+ * skip to vacuum page unconditionally.
+ */
+ if (visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_FROZEN))
+ {
+ vacrelstats->vmskipped_pages++;
+ continue;
+ }
+
if (skipping_all_visible_blocks && !scan_all)
continue;
+
all_visible_according_to_vm = true;
}
@@ -740,7 +769,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
PageSetAllVisible(page);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, InvalidTransactionId);
+ vmbuffer, InvalidTransactionId,
+ VISIBILITYMAP_ALL_VISIBLE);
END_CRIT_SECTION();
}
@@ -764,6 +794,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
all_visible = true;
has_dead_tuples = false;
nfrozen = 0;
+ already_nfrozen = 0;
+ ntup_blk = 0;
hastup = false;
prev_dead_count = vacrelstats->num_dead_tuples;
maxoff = PageGetMaxOffsetNumber(page);
@@ -918,8 +950,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
{
num_tuples += 1;
+ ntup_blk += 1;
hastup = true;
+ /* If current tuple is already frozen, count it up */
+ if (HeapTupleHeaderXminFrozen(tuple.t_data))
+ already_nfrozen += 1;
+
/*
* Each non-removable tuple must be checked to see if it needs
* freezing. Note we already have exclusive buffer lock.
@@ -931,11 +968,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
} /* scan along page */
/*
- * If we froze any tuples, mark the buffer dirty, and write a WAL
- * record recording the changes. We must log the changes to be
- * crash-safe against future truncation of CLOG.
+ * If we froze any tuples or any tuples are already frozen,
+ * mark the buffer dirty, and write a WAL record recording the changes.
+ * We must log the changes to be crash-safe against future truncation
+ * of CLOG.
*/
- if (nfrozen > 0)
+ if (nfrozen > 0 || already_nfrozen > 0)
{
START_CRIT_SECTION();
@@ -953,8 +991,20 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
heap_execute_freeze_tuple(htup, &frozen[i]);
}
+ /*
+ * As a result of scanning a page, we ensure that all tuples
+ * are completely frozen. Set VISIBILITYMAP_ALL_FROZEN bit on
+ * visibility map and PD_ALL_FROZEN flag on page.
+ */
+ if (ntup_blk == (nfrozen + already_nfrozen))
+ {
+ PageSetAllFrozen(page);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr, vmbuffer,
+ InvalidTransactionId, VISIBILITYMAP_ALL_FROZEN);
+ }
+
/* Now WAL-log freezing if necessary */
- if (RelationNeedsWAL(onerel))
+ if (nfrozen > 0 && RelationNeedsWAL(onerel))
{
XLogRecPtr recptr;
@@ -1007,7 +1057,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
PageSetAllVisible(page);
MarkBufferDirty(buf);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, visibility_cutoff_xid);
+ vmbuffer, visibility_cutoff_xid, VISIBILITYMAP_ALL_VISIBLE);
}
/*
@@ -1018,11 +1068,11 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* that something bad has happened.
*/
else if (all_visible_according_to_vm && !PageIsAllVisible(page)
- && visibilitymap_test(onerel, blkno, &vmbuffer))
+ && visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
relname, blkno);
- visibilitymap_clear(onerel, blkno, vmbuffer);
+ visibilitymap_clear(onerel, blkno, vmbuffer, VISIBILITYMAP_ALL_VISIBLE);
}
/*
@@ -1044,7 +1094,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
relname, blkno);
PageClearAllVisible(page);
MarkBufferDirty(buf);
- visibilitymap_clear(onerel, blkno, vmbuffer);
+ visibilitymap_clear(onerel, blkno, vmbuffer, VISIBILITYMAP_ALL_VISIBLE);
}
UnlockReleaseBuffer(buf);
@@ -1078,7 +1128,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
num_tuples);
/*
- * Release any remaining pin on visibility map page.
+ * Release any remaining pin on visibility map and frozen map page.
*/
if (BufferIsValid(vmbuffer))
{
@@ -1285,11 +1335,11 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
* flag is now set, also set the VM bit.
*/
if (PageIsAllVisible(page) &&
- !visibilitymap_test(onerel, blkno, vmbuffer))
+ !visibilitymap_test(onerel, blkno, vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
Assert(BufferIsValid(*vmbuffer));
visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
- visibility_cutoff_xid);
+ visibility_cutoff_xid, VISIBILITYMAP_ALL_VISIBLE);
}
return tupindex;
@@ -1408,6 +1458,7 @@ lazy_cleanup_index(Relation indrel,
stats->num_pages,
stats->num_index_tuples,
0,
+ 0,
false,
InvalidTransactionId,
InvalidMultiXactId,
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 9f54c46..08df289 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -116,7 +116,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
*/
if (!visibilitymap_test(scandesc->heapRelation,
ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
+ &node->ioss_VMBuffer, VISIBILITYMAP_ALL_VISIBLE))
{
/*
* Rats, we have to visit the heap to check visibility.
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 874ca6a..376841a 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -127,7 +127,7 @@ ExecCheckPlanOutput(Relation resultRel, List *targetList)
if (attno != resultDesc->natts)
ereport(ERROR,
(errcode(ERRCODE_DATATYPE_MISMATCH),
- errmsg("table row type and query-specified row type do not match"),
+ errmsg("table row type and query-specified row type do not match"),
errdetail("Query has too few columns.")));
}
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index caa0f14..d2f083b 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -64,9 +64,10 @@
*/
/* PD_ALL_VISIBLE was cleared */
#define XLH_INSERT_ALL_VISIBLE_CLEARED (1<<0)
-#define XLH_INSERT_LAST_IN_MULTI (1<<1)
-#define XLH_INSERT_IS_SPECULATIVE (1<<2)
-#define XLH_INSERT_CONTAINS_NEW_TUPLE (1<<3)
+#define XLH_INSERT_ALL_FROZEN_CLEARED (1<<1)
+#define XLH_INSERT_LAST_IN_MULTI (1<<2)
+#define XLH_INSERT_IS_SPECULATIVE (1<<3)
+#define XLH_INSERT_CONTAINS_NEW_TUPLE (1<<4)
/*
* xl_heap_update flag values, 8 bits are available.
@@ -75,11 +76,15 @@
#define XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED (1<<0)
/* PD_ALL_VISIBLE was cleared in the 2nd page */
#define XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED (1<<1)
-#define XLH_UPDATE_CONTAINS_OLD_TUPLE (1<<2)
-#define XLH_UPDATE_CONTAINS_OLD_KEY (1<<3)
-#define XLH_UPDATE_CONTAINS_NEW_TUPLE (1<<4)
-#define XLH_UPDATE_PREFIX_FROM_OLD (1<<5)
-#define XLH_UPDATE_SUFFIX_FROM_OLD (1<<6)
+/* PD_FROZEN_VISIBLE was cleared */
+#define XLH_UPDATE_OLD_ALL_FROZEN_CLEARED (1<<2)
+/* PD_FROZEN_VISIBLE was cleared in the 2nd page */
+#define XLH_UPDATE_NEW_ALL_FROZEN_CLEARED (1<<3)
+#define XLH_UPDATE_CONTAINS_OLD_TUPLE (1<<4)
+#define XLH_UPDATE_CONTAINS_OLD_KEY (1<<5)
+#define XLH_UPDATE_CONTAINS_NEW_TUPLE (1<<6)
+#define XLH_UPDATE_PREFIX_FROM_OLD (1<<7)
+#define XLH_UPDATE_SUFFIX_FROM_OLD (1<<8)
/* convenience macro for checking whether any form of old tuple was logged */
#define XLH_UPDATE_CONTAINS_OLD \
@@ -90,9 +95,10 @@
*/
/* PD_ALL_VISIBLE was cleared */
#define XLH_DELETE_ALL_VISIBLE_CLEARED (1<<0)
-#define XLH_DELETE_CONTAINS_OLD_TUPLE (1<<1)
-#define XLH_DELETE_CONTAINS_OLD_KEY (1<<2)
-#define XLH_DELETE_IS_SUPER (1<<3)
+#define XLH_DELETE_ALL_FROZEN_CLEARED (1<<1)
+#define XLH_DELETE_CONTAINS_OLD_TUPLE (1<<2)
+#define XLH_DELETE_CONTAINS_OLD_KEY (1<<3)
+#define XLH_DELETE_IS_SUPER (1<<4)
/* convenience macro for checking whether any form of old tuple was logged */
#define XLH_DELETE_CONTAINS_OLD \
@@ -320,9 +326,10 @@ typedef struct xl_heap_freeze_page
typedef struct xl_heap_visible
{
TransactionId cutoff_xid;
+ uint8 flags;
} xl_heap_visible;
-#define SizeOfHeapVisible (offsetof(xl_heap_visible, cutoff_xid) + sizeof(TransactionId))
+#define SizeOfHeapVisible (offsetof(xl_heap_visible, flags) + sizeof(uint8))
typedef struct xl_heap_new_cid
{
@@ -382,6 +389,8 @@ extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer,
TransactionId cutoff_xid, xl_heap_freeze_tuple *tuples,
int ntuples);
+extern XLogRecPtr log_heap_frozenmap(RelFileNode rnode, Buffer heap_buffer,
+ Buffer fm_buffer);
extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
TransactionId cutoff_xid,
TransactionId cutoff_multi,
@@ -389,6 +398,6 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
xl_heap_freeze_tuple *xlrec_tp);
extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
- Buffer vm_buffer, TransactionId cutoff_xid);
+ Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
#endif /* HEAPAM_XLOG_H */
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 0c0e0ef..53d8103 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -19,15 +19,21 @@
#include "storage/buf.h"
#include "utils/relcache.h"
+/* Flags for bit map */
+#define VISIBILITYMAP_ALL_VISIBLE 0x01
+#define VISIBILITYMAP_ALL_FROZEN 0x02
+
extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk,
- Buffer vmbuf);
+ Buffer vmbuf, uint8 flags);
extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
Buffer *vmbuf);
extern bool visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid);
-extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
-extern BlockNumber visibilitymap_count(Relation rel);
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags);
+extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf,
+ uint8 flags);
+extern BlockNumber visibilitymap_count(Relation rel, bool for_visible);
extern void visibilitymap_truncate(Relation rel, BlockNumber nheapblocks);
#endif /* VISIBILITYMAP_H */
diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
index e526cd9..ea0f7c1 100644
--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -47,6 +47,8 @@ CATALOG(pg_class,1259) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83) BKI_SCHEMA_MACRO
float4 reltuples; /* # of tuples (not always up-to-date) */
int32 relallvisible; /* # of all-visible blocks (not always
* up-to-date) */
+ int32 relallfrozen; /* # of all-frozen blocks (not always
+ up-to-date) */
Oid reltoastrelid; /* OID of toast table; 0 if none */
bool relhasindex; /* T if has (or has had) any indexes */
bool relisshared; /* T if shared across databases */
@@ -95,7 +97,7 @@ typedef FormData_pg_class *Form_pg_class;
* ----------------
*/
-#define Natts_pg_class 30
+#define Natts_pg_class 31
#define Anum_pg_class_relname 1
#define Anum_pg_class_relnamespace 2
#define Anum_pg_class_reltype 3
@@ -107,25 +109,26 @@ typedef FormData_pg_class *Form_pg_class;
#define Anum_pg_class_relpages 9
#define Anum_pg_class_reltuples 10
#define Anum_pg_class_relallvisible 11
-#define Anum_pg_class_reltoastrelid 12
-#define Anum_pg_class_relhasindex 13
-#define Anum_pg_class_relisshared 14
-#define Anum_pg_class_relpersistence 15
-#define Anum_pg_class_relkind 16
-#define Anum_pg_class_relnatts 17
-#define Anum_pg_class_relchecks 18
-#define Anum_pg_class_relhasoids 19
-#define Anum_pg_class_relhaspkey 20
-#define Anum_pg_class_relhasrules 21
-#define Anum_pg_class_relhastriggers 22
-#define Anum_pg_class_relhassubclass 23
-#define Anum_pg_class_relrowsecurity 24
-#define Anum_pg_class_relispopulated 25
-#define Anum_pg_class_relreplident 26
-#define Anum_pg_class_relfrozenxid 27
-#define Anum_pg_class_relminmxid 28
-#define Anum_pg_class_relacl 29
-#define Anum_pg_class_reloptions 30
+#define Anum_pg_class_relallfrozen 12
+#define Anum_pg_class_reltoastrelid 13
+#define Anum_pg_class_relhasindex 14
+#define Anum_pg_class_relisshared 15
+#define Anum_pg_class_relpersistence 16
+#define Anum_pg_class_relkind 17
+#define Anum_pg_class_relnatts 18
+#define Anum_pg_class_relchecks 19
+#define Anum_pg_class_relhasoids 20
+#define Anum_pg_class_relhaspkey 21
+#define Anum_pg_class_relhasrules 22
+#define Anum_pg_class_relhastriggers 23
+#define Anum_pg_class_relhassubclass 24
+#define Anum_pg_class_relrowsecurity 25
+#define Anum_pg_class_relispopulated 26
+#define Anum_pg_class_relreplident 27
+#define Anum_pg_class_relfrozenxid 28
+#define Anum_pg_class_relminmxid 29
+#define Anum_pg_class_relacl 30
+#define Anum_pg_class_reloptions 31
/* ----------------
* initial contents of pg_class
@@ -140,13 +143,13 @@ typedef FormData_pg_class *Form_pg_class;
* Note: "3" in the relfrozenxid column stands for FirstNormalTransactionId;
* similarly, "1" in relminmxid stands for FirstMultiXactId
*/
-DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 f f p r 28 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 0 f f p r 28 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 0 f f p r 31 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index e3a31af..d2bae2d 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -172,6 +172,7 @@ extern void vac_update_relstats(Relation relation,
BlockNumber num_pages,
double num_tuples,
BlockNumber num_all_visible_pages,
+ BlockNumber num_all_frozen_pages,
bool hasindex,
TransactionId frozenxid,
MultiXactId minmulti,
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index a2f78ee..7bf2718 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -178,8 +178,10 @@ typedef PageHeaderData *PageHeader;
* tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
+#define PD_ALL_FROZEN 0x0008 /* all tuples on page are completely
+ frozen */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x000F /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -369,6 +371,13 @@ typedef PageHeaderData *PageHeader;
#define PageClearAllVisible(page) \
(((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
+#define PageIsAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags & PD_ALL_FROZEN)
+#define PageSetAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags |= PD_ALL_FROZEN)
+#define PageClearAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags &= ~PD_ALL_FROZEN)
+
#define PageIsPrunable(page, oldestxmin) \
( \
AssertMacro(TransactionIdIsNormal(oldestxmin)), \
On 30 April 2015 at 12:07, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
This patch lack some point: documentation, comment in source code,
etc, so it's WIP patch yet,
but I think that it's enough to discuss about this.
Code comments exist to indicate the intention of sections of code. They are
essential for reviewers, not a cosmetic thing to be added later. To gain
wide agreement we need wide understanding. (I recommend a development
approach where you write the comments first, then add code later.)
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Jul 2, 2015 at 12:13 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
On Thu, May 28, 2015 at 11:34 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
On Thu, Apr 30, 2015 at 8:07 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
On Fri, Apr 24, 2015 at 11:21 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
On Fri, Apr 24, 2015 at 1:31 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 4/23/15 11:06 AM, Petr Jelinek wrote:
On 23/04/15 17:45, Bruce Momjian wrote:
On Thu, Apr 23, 2015 at 09:45:38AM -0400, Robert Haas wrote:
Agreed, no extra file, and the same write volume as currently. It would
also match pg_clog, which uses two bits per transaction --- maybe we can
reuse some of that code.Yeah, this approach seems promising. We probably can't reuse code from
clog because the usage pattern is different (key for clog is xid, while
for visibility/freeze map ctid is used). But visibility map storage
layer is pretty simple so it should be easy to extend it for this use.Actually, there may be some bit manipulation functions we could reuse;
things like efficiently counting how many things in a byte are set. Probably
doesn't make sense to fully refactor it, but at least CLOG is a good source
for cut/paste/whack.I agree with adding a bit that indicates corresponding page is
all-frozen into VM, just like CLOG.
I'll change the patch as second version patch.The second patch is attached.
In second patch, I added a bit that indicates all tuples in page are
completely frozen into visibility map.
The visibility map became a bitmap with two bit per heap page:
all-visible and all-frozen.
The logics around vacuum, insert/update/delete heap are almost same as
previous version.This patch lack some point: documentation, comment in source code,
etc, so it's WIP patch yet,
but I think that it's enough to discuss about this.The previous patch is no longer applied cleanly to HEAD.
The attached v2 patch is latest version.Please review it.
Attached new rebased version patch.
Please give me comments!
Now we should review your design and approach rather than code,
but since I got an assertion error while trying the patch, I report it.
"initdb -D test -k" caused the following assertion failure.
vacuuming database template1 ... TRAP:
FailedAssertion("!((((PageHeader) (heapPage))->pd_flags & 0x0004))",
File: "visibilitymap.c", Line: 328)
sh: line 1: 83785 Abort trap: 6
"/dav/000_add_frozen_bit_into_visibilitymap_v3/bin/postgres" --single
-F -O -c search_path=pg_catalog -c exit_on_error=true template1 >
/dev/null
child process exited with exit code 134
initdb: removing data directory "test"
Regards,
--
Fujii Masao
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jul 2, 2015 at 1:06 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
On Thu, Jul 2, 2015 at 12:13 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
On Thu, May 28, 2015 at 11:34 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
On Thu, Apr 30, 2015 at 8:07 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
On Fri, Apr 24, 2015 at 11:21 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
On Fri, Apr 24, 2015 at 1:31 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 4/23/15 11:06 AM, Petr Jelinek wrote:
On 23/04/15 17:45, Bruce Momjian wrote:
On Thu, Apr 23, 2015 at 09:45:38AM -0400, Robert Haas wrote:
Agreed, no extra file, and the same write volume as currently. It would
also match pg_clog, which uses two bits per transaction --- maybe we can
reuse some of that code.Yeah, this approach seems promising. We probably can't reuse code from
clog because the usage pattern is different (key for clog is xid, while
for visibility/freeze map ctid is used). But visibility map storage
layer is pretty simple so it should be easy to extend it for this use.Actually, there may be some bit manipulation functions we could reuse;
things like efficiently counting how many things in a byte are set. Probably
doesn't make sense to fully refactor it, but at least CLOG is a good source
for cut/paste/whack.I agree with adding a bit that indicates corresponding page is
all-frozen into VM, just like CLOG.
I'll change the patch as second version patch.The second patch is attached.
In second patch, I added a bit that indicates all tuples in page are
completely frozen into visibility map.
The visibility map became a bitmap with two bit per heap page:
all-visible and all-frozen.
The logics around vacuum, insert/update/delete heap are almost same as
previous version.This patch lack some point: documentation, comment in source code,
etc, so it's WIP patch yet,
but I think that it's enough to discuss about this.The previous patch is no longer applied cleanly to HEAD.
The attached v2 patch is latest version.Please review it.
Attached new rebased version patch.
Please give me comments!Now we should review your design and approach rather than code,
but since I got an assertion error while trying the patch, I report it."initdb -D test -k" caused the following assertion failure.
vacuuming database template1 ... TRAP:
FailedAssertion("!((((PageHeader) (heapPage))->pd_flags & 0x0004))",
File: "visibilitymap.c", Line: 328)
sh: line 1: 83785 Abort trap: 6
"/dav/000_add_frozen_bit_into_visibilitymap_v3/bin/postgres" --single
-F -O -c search_path=pg_catalog -c exit_on_error=true template1 >
/dev/null
child process exited with exit code 134
initdb: removing data directory "test"
Thank you for bug report, and comments.
Fixed version is attached, and source code comment is also updated.
Please review it.
And I explain again here about what this patch does, current design.
- A additional bit for visibility map.
I added additional bit, say all-frozen bit, which indicates whether
the all pages of corresponding page are frozen, to visibility map.
This structure is similar to CLOG.
So the size of VM grew as twice as today.
Also, the flags of each heap page header might be set PD_ALL_FROZEN,
as well as all-visible
- Set and clear a all-frozen bit
Update and delete and insert(multi insert) operation would clear a bit
of that page, and clear flags of page header at same time.
Only vauum operation can set a bit if all tuple of a page are frozen.
- Anti-wrapping vacuum
We have to scan whole table for XID anti-warring today, and it's
really quite expensive because disk I/O.
The main benefit of this proposal is to reduce and avoid such
extremely large quantity I/O even when anti-wrapping vacuum is
executed.
We have to scan whole table for XID anti-warring today, and it's
really quite expensive.
In lazy_scan_heap() function, I added a such logic for experimental.
There were several another idea on previous discussion such as
read-only table, frozen map. But advantage of this direction is that
we don't need additional heap file, and can use the matured VM
mechanism.
Regards,
--
Sawada Masahiko
Attachments:
000_add_frozen_bit_into_visibilitymap_v4.patchtext/x-diff; charset=US-ASCII; name=000_add_frozen_bit_into_visibilitymap_v4.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 86a2e6b..835d714 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -88,7 +88,8 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
HeapTuple newtup, HeapTuple old_key_tup,
- bool all_visible_cleared, bool new_all_visible_cleared);
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool all_frozen_cleared, bool new_all_frozen_cleared);
static void HeapSatisfiesHOTandKeyUpdate(Relation relation,
Bitmapset *hot_attrs,
Bitmapset *key_attrs, Bitmapset *id_attrs,
@@ -2107,7 +2108,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
HeapTuple heaptup;
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
- bool all_visible_cleared = false;
+ bool all_visible_cleared = false,
+ all_frozen_cleared = false;
/*
* Fill in tuple header fields, assign an OID, and toast the tuple if
@@ -2131,8 +2133,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
/*
- * Find buffer to insert this tuple into. If the page is all visible,
- * this will also pin the requisite visibility map page.
+ * Find buffer to insert this tuple into. If the page is all visible
+ * of all frozen, this will also pin the requisite visibility map and
+ * frozen map page.
*/
buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
InvalidBuffer, options, bistate,
@@ -2150,7 +2153,16 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
PageClearAllVisible(BufferGetPage(buffer));
visibilitymap_clear(relation,
ItemPointerGetBlockNumber(&(heaptup->t_self)),
- vmbuffer);
+ vmbuffer, VISIBILITYMAP_ALL_VISIBLE);
+ }
+
+ if (PageIsAllFrozen(BufferGetPage(buffer)))
+ {
+ all_frozen_cleared = true;
+ PageClearAllFrozen(BufferGetPage(buffer));
+ visibilitymap_clear(relation,
+ ItemPointerGetBlockNumber(&(heaptup->t_self)),
+ vmbuffer, VISIBILITYMAP_ALL_FROZEN);
}
/*
@@ -2199,6 +2211,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
xlrec.flags = 0;
if (all_visible_cleared)
xlrec.flags |= XLH_INSERT_ALL_VISIBLE_CLEARED;
+ if (all_frozen_cleared)
+ xlrec.flags |= XLH_INSERT_ALL_FROZEN_CLEARED;
if (options & HEAP_INSERT_SPECULATIVE)
xlrec.flags |= XLH_INSERT_IS_SPECULATIVE;
Assert(ItemPointerGetBlockNumber(&heaptup->t_self) == BufferGetBlockNumber(buffer));
@@ -2406,7 +2420,8 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
{
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
- bool all_visible_cleared = false;
+ bool all_visible_cleared = false,
+ all_frozen_cleared = false;
int nthispage;
CHECK_FOR_INTERRUPTS();
@@ -2451,7 +2466,16 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
PageClearAllVisible(page);
visibilitymap_clear(relation,
BufferGetBlockNumber(buffer),
- vmbuffer);
+ vmbuffer, VISIBILITYMAP_ALL_VISIBLE);
+ }
+
+ if (PageIsAllFrozen(page))
+ {
+ all_frozen_cleared = true;
+ PageClearAllFrozen(page);
+ visibilitymap_clear(relation,
+ BufferGetBlockNumber(buffer),
+ vmbuffer, VISIBILITYMAP_ALL_FROZEN);
}
/*
@@ -2496,6 +2520,8 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
tupledata = scratchptr;
xlrec->flags = all_visible_cleared ? XLH_INSERT_ALL_VISIBLE_CLEARED : 0;
+ if (all_frozen_cleared)
+ xlrec->flags |= XLH_INSERT_ALL_FROZEN_CLEARED;
xlrec->ntuples = nthispage;
/*
@@ -2698,7 +2724,8 @@ heap_delete(Relation relation, ItemPointer tid,
new_infomask2;
bool have_tuple_lock = false;
bool iscombo;
- bool all_visible_cleared = false;
+ bool all_visible_cleared = false,
+ all_frozen_cleared = false;
HeapTuple old_key_tuple = NULL; /* replica identity of the tuple */
bool old_key_copied = false;
@@ -2724,18 +2751,19 @@ heap_delete(Relation relation, ItemPointer tid,
* in the middle of changing this, so we'll need to recheck after we have
* the lock.
*/
- if (PageIsAllVisible(page))
+ if (PageIsAllVisible(page) || PageIsAllFrozen(page))
visibilitymap_pin(relation, block, &vmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
/*
* If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, we'll have to unlock and
- * re-lock, to avoid holding the buffer lock across an I/O. That's a bit
- * unfortunate, but hopefully shouldn't happen often.
+ * visible or all frozen while we were busy locking the buffer, we'll
+ * have to unlock and re-lock, to avoid holding the buffer lock across an
+ * I/O. That's a bit unfortunate, but hopefully shouldn't happen often.
*/
- if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
+ if (vmbuffer == InvalidBuffer &&
+ (PageIsAllVisible(page) || PageIsAllFrozen(page)))
{
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
visibilitymap_pin(relation, block, &vmbuffer);
@@ -2925,12 +2953,22 @@ l1:
*/
PageSetPrunable(page, xid);
+ /* clear PD_ALL_VISIBLE flags */
if (PageIsAllVisible(page))
{
all_visible_cleared = true;
PageClearAllVisible(page);
visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
- vmbuffer);
+ vmbuffer, VISIBILITYMAP_ALL_VISIBLE);
+ }
+
+ /* clear PD_ALL_FROZEN flags */
+ if (PageIsAllFrozen(page))
+ {
+ all_frozen_cleared = true;
+ PageClearAllFrozen(page);
+ visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
+ vmbuffer, VISIBILITYMAP_ALL_FROZEN);
}
/* store transaction information of xact deleting the tuple */
@@ -2962,6 +3000,8 @@ l1:
log_heap_new_cid(relation, &tp);
xlrec.flags = all_visible_cleared ? XLH_DELETE_ALL_VISIBLE_CLEARED : 0;
+ if (all_frozen_cleared)
+ xlrec.flags |= XLH_DELETE_ALL_FROZEN_CLEARED;
xlrec.infobits_set = compute_infobits(tp.t_data->t_infomask,
tp.t_data->t_infomask2);
xlrec.offnum = ItemPointerGetOffsetNumber(&tp.t_self);
@@ -3159,6 +3199,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
bool key_intact;
bool all_visible_cleared = false;
bool all_visible_cleared_new = false;
+ bool all_frozen_cleared = false;
+ bool all_frozen_cleared_new = false;
bool checked_lockers;
bool locker_remains;
TransactionId xmax_new_tuple,
@@ -3202,12 +3244,12 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
page = BufferGetPage(buffer);
/*
- * Before locking the buffer, pin the visibility map page if it appears to
- * be necessary. Since we haven't got the lock yet, someone else might be
+ * Before locking the buffer, pin the visibility map if it appears to be
+ * necessary. Since we haven't got the lock yet, someone else might be
* in the middle of changing this, so we'll need to recheck after we have
* the lock.
*/
- if (PageIsAllVisible(page))
+ if (PageIsAllVisible(page) || PageIsAllFrozen(page))
visibilitymap_pin(relation, block, &vmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
@@ -3490,21 +3532,23 @@ l2:
UnlockTupleTuplock(relation, &(oldtup.t_self), *lockmode);
if (vmbuffer != InvalidBuffer)
ReleaseBuffer(vmbuffer);
+
bms_free(hot_attrs);
bms_free(key_attrs);
return result;
}
/*
- * If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, or during some
- * subsequent window during which we had it unlocked, we'll have to unlock
- * and re-lock, to avoid holding the buffer lock across an I/O. That's a
- * bit unfortunate, especially since we'll now have to recheck whether the
- * tuple has been locked or updated under us, but hopefully it won't
- * happen very often.
+ * If we didn't pin the visibility map page and the page has
+ * become all visible(and frozen) while we were busy locking the buffer,
+ * or during some subsequent window during which we had it unlocked,
+ * we'll have to unlock and re-lock, to avoid holding the buffer lock
+ * across an I/O. That's a bit unfortunate, especially since we'll now
+ * have to recheck whether the tuple has been locked or updated under us,
+ * but hopefully it won't happen very often.
*/
- if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
+ if (vmbuffer == InvalidBuffer &&
+ (PageIsAllVisible(page) || PageIsAllFrozen(page)))
{
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
visibilitymap_pin(relation, block, &vmbuffer);
@@ -3803,14 +3847,30 @@ l2:
all_visible_cleared = true;
PageClearAllVisible(BufferGetPage(buffer));
visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
- vmbuffer);
+ vmbuffer, VISIBILITYMAP_ALL_VISIBLE);
}
if (newbuf != buffer && PageIsAllVisible(BufferGetPage(newbuf)))
{
all_visible_cleared_new = true;
PageClearAllVisible(BufferGetPage(newbuf));
visibilitymap_clear(relation, BufferGetBlockNumber(newbuf),
- vmbuffer_new);
+ vmbuffer_new, VISIBILITYMAP_ALL_VISIBLE);
+ }
+
+ /* clear PD_ALL_FROZEN flags */
+ if (PageIsAllFrozen(BufferGetPage(buffer)))
+ {
+ all_frozen_cleared = true;
+ PageClearAllFrozen(BufferGetPage(buffer));
+ visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
+ vmbuffer, VISIBILITYMAP_ALL_FROZEN);
+ }
+ if (newbuf != buffer && PageIsAllFrozen(BufferGetPage(newbuf)))
+ {
+ all_frozen_cleared_new = true;
+ PageClearAllFrozen(BufferGetPage(newbuf));
+ visibilitymap_clear(relation, BufferGetBlockNumber(newbuf),
+ vmbuffer_new, VISIBILITYMAP_ALL_FROZEN);
}
if (newbuf != buffer)
@@ -3836,7 +3896,9 @@ l2:
newbuf, &oldtup, heaptup,
old_key_tuple,
all_visible_cleared,
- all_visible_cleared_new);
+ all_visible_cleared_new,
+ all_frozen_cleared,
+ all_frozen_cleared_new);
if (newbuf != buffer)
{
PageSetLSN(BufferGetPage(newbuf), recptr);
@@ -6893,7 +6955,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
*/
XLogRecPtr
log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
- TransactionId cutoff_xid)
+ TransactionId cutoff_xid, uint8 vmflags)
{
xl_heap_visible xlrec;
XLogRecPtr recptr;
@@ -6903,6 +6965,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
Assert(BufferIsValid(vm_buffer));
xlrec.cutoff_xid = cutoff_xid;
+ xlrec.flags = vmflags;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
@@ -6926,7 +6989,8 @@ static XLogRecPtr
log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup, HeapTuple newtup,
HeapTuple old_key_tuple,
- bool all_visible_cleared, bool new_all_visible_cleared)
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool all_frozen_cleared, bool new_all_frozen_cleared)
{
xl_heap_update xlrec;
xl_heap_header xlhdr;
@@ -7009,6 +7073,10 @@ log_heap_update(Relation reln, Buffer oldbuf,
xlrec.flags |= XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED;
if (new_all_visible_cleared)
xlrec.flags |= XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED;
+ if (all_frozen_cleared)
+ xlrec.flags |= XLH_UPDATE_OLD_ALL_FROZEN_CLEARED;
+ if (new_all_frozen_cleared)
+ xlrec.flags |= XLH_UPDATE_NEW_ALL_FROZEN_CLEARED;
if (prefixlen > 0)
xlrec.flags |= XLH_UPDATE_PREFIX_FROM_OLD;
if (suffixlen > 0)
@@ -7492,8 +7560,14 @@ heap_xlog_visible(XLogReaderState *record)
* the subsequent update won't be replayed to clear the flag.
*/
page = BufferGetPage(buffer);
- PageSetAllVisible(page);
+
+ if (xlrec->flags & VISIBILITYMAP_ALL_VISIBLE)
+ PageSetAllVisible(page);
+ if (xlrec->flags & VISIBILITYMAP_ALL_FROZEN)
+ PageSetAllFrozen(page);
+
MarkBufferDirty(buffer);
+
}
else if (action == BLK_RESTORED)
{
@@ -7544,7 +7618,7 @@ heap_xlog_visible(XLogReaderState *record)
*/
if (lsn > PageGetLSN(vmpage))
visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
- xlrec->cutoff_xid);
+ xlrec->cutoff_xid, xlrec->flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
@@ -7656,13 +7730,20 @@ heap_xlog_delete(XLogReaderState *record)
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->flags & XLH_DELETE_ALL_VISIBLE_CLEARED)
+ if (xlrec->flags & (XLH_DELETE_ALL_VISIBLE_CLEARED | XLH_DELETE_ALL_FROZEN_CLEARED))
{
Relation reln = CreateFakeRelcacheEntry(target_node);
Buffer vmbuffer = InvalidBuffer;
+ uint8 flags = 0;
+
+ /* set flags for either clear one flags or both */
+ flags |= xlrec->flags & XLH_DELETE_ALL_VISIBLE_CLEARED ?
+ VISIBILITYMAP_ALL_VISIBLE : 0;
+ flags |= xlrec->flags & XLH_DELETE_ALL_FROZEN_CLEARED ?
+ VISIBILITYMAP_ALL_FROZEN : 0;
visibilitymap_pin(reln, blkno, &vmbuffer);
- visibilitymap_clear(reln, blkno, vmbuffer);
+ visibilitymap_clear(reln, blkno, vmbuffer, flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
}
@@ -7734,13 +7815,20 @@ heap_xlog_insert(XLogReaderState *record)
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
+ if (xlrec->flags & (XLH_INSERT_ALL_VISIBLE_CLEARED | XLH_INSERT_ALL_FROZEN_CLEARED))
{
Relation reln = CreateFakeRelcacheEntry(target_node);
Buffer vmbuffer = InvalidBuffer;
+ uint8 flags = 0;
+
+ /* set flags for either clear one flags or both */
+ flags |= xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED ?
+ VISIBILITYMAP_ALL_VISIBLE : 0;
+ flags |= xlrec->flags & XLH_INSERT_ALL_FROZEN_CLEARED ?
+ VISIBILITYMAP_ALL_FROZEN : 0;
visibilitymap_pin(reln, blkno, &vmbuffer);
- visibilitymap_clear(reln, blkno, vmbuffer);
+ visibilitymap_clear(reln, blkno, vmbuffer, flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
}
@@ -7800,6 +7888,9 @@ heap_xlog_insert(XLogReaderState *record)
if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
+ if (xlrec->flags & XLH_INSERT_ALL_FROZEN_CLEARED)
+ PageClearAllFrozen(page);
+
MarkBufferDirty(buffer);
}
if (BufferIsValid(buffer))
@@ -7854,13 +7945,20 @@ heap_xlog_multi_insert(XLogReaderState *record)
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
+ if (xlrec->flags & (XLH_INSERT_ALL_VISIBLE_CLEARED | XLH_INSERT_ALL_FROZEN_CLEARED))
{
Relation reln = CreateFakeRelcacheEntry(rnode);
Buffer vmbuffer = InvalidBuffer;
+ uint8 flags = 0;
+
+ /* set flags for either clear one flags or both */
+ flags |= xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED ?
+ VISIBILITYMAP_ALL_VISIBLE : 0;
+ flags |= xlrec->flags & XLH_INSERT_ALL_FROZEN_CLEARED ?
+ VISIBILITYMAP_ALL_FROZEN : 0;
visibilitymap_pin(reln, blkno, &vmbuffer);
- visibilitymap_clear(reln, blkno, vmbuffer);
+ visibilitymap_clear(reln, blkno, vmbuffer, flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
}
@@ -7938,6 +8036,8 @@ heap_xlog_multi_insert(XLogReaderState *record)
if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
+ if (xlrec->flags & XLH_INSERT_ALL_FROZEN_CLEARED)
+ PageClearAllFrozen(page);
MarkBufferDirty(buffer);
}
@@ -8009,13 +8109,20 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->flags & XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED)
+ if (xlrec->flags & (XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED | XLH_UPDATE_OLD_ALL_FROZEN_CLEARED))
{
Relation reln = CreateFakeRelcacheEntry(rnode);
Buffer vmbuffer = InvalidBuffer;
+ uint8 flags = 0;
+
+ /* set flags for either clear one flags or both */
+ flags |= xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED ?
+ VISIBILITYMAP_ALL_VISIBLE : 0;
+ flags |= xlrec->flags & XLH_INSERT_ALL_FROZEN_CLEARED ?
+ VISIBILITYMAP_ALL_FROZEN : 0;
visibilitymap_pin(reln, oldblk, &vmbuffer);
- visibilitymap_clear(reln, oldblk, vmbuffer);
+ visibilitymap_clear(reln, oldblk, vmbuffer, flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
}
@@ -8066,6 +8173,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
if (xlrec->flags & XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
+ if (xlrec->flags & XLH_UPDATE_OLD_ALL_FROZEN_CLEARED)
+ PageClearAllFrozen(page);
PageSetLSN(page, lsn);
MarkBufferDirty(obuffer);
@@ -8093,13 +8202,20 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->flags & XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED)
+ if (xlrec->flags & (XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED | XLH_UPDATE_NEW_ALL_FROZEN_CLEARED))
{
Relation reln = CreateFakeRelcacheEntry(rnode);
Buffer vmbuffer = InvalidBuffer;
+ uint8 flags = 0;
+
+ /* set flags for either clear one flags or both */
+ flags |= xlrec->flags & XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED ?
+ VISIBILITYMAP_ALL_VISIBLE : 0;
+ flags |= xlrec->flags & XLH_UPDATE_NEW_ALL_FROZEN_CLEARED ?
+ VISIBILITYMAP_ALL_FROZEN : 0;
visibilitymap_pin(reln, newblk, &vmbuffer);
- visibilitymap_clear(reln, newblk, vmbuffer);
+ visibilitymap_clear(reln, newblk, vmbuffer, flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
}
@@ -8201,6 +8317,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
if (xlrec->flags & XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
+ if (xlrec->flags & XLH_UPDATE_NEW_ALL_FROZEN_CLEARED)
+ PageClearAllFrozen(page);
freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 6db73bf..4e19f9c 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -327,7 +327,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
{
/* easy case */
buffer = ReadBufferBI(relation, targetBlock, bistate);
- if (PageIsAllVisible(BufferGetPage(buffer)))
+ if (PageIsAllVisible(BufferGetPage(buffer)) ||
+ PageIsAllFrozen(BufferGetPage(buffer)))
visibilitymap_pin(relation, targetBlock, vmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
}
@@ -335,7 +336,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
{
/* also easy case */
buffer = otherBuffer;
- if (PageIsAllVisible(BufferGetPage(buffer)))
+ if (PageIsAllVisible(BufferGetPage(buffer)) ||
+ PageIsAllFrozen(BufferGetPage(buffer)))
visibilitymap_pin(relation, targetBlock, vmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
}
@@ -343,7 +345,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
{
/* lock other buffer first */
buffer = ReadBuffer(relation, targetBlock);
- if (PageIsAllVisible(BufferGetPage(buffer)))
+ if (PageIsAllVisible(BufferGetPage(buffer)) ||
+ PageIsAllFrozen(BufferGetPage(buffer)))
visibilitymap_pin(relation, targetBlock, vmbuffer);
LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
@@ -352,7 +355,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
{
/* lock target buffer first */
buffer = ReadBuffer(relation, targetBlock);
- if (PageIsAllVisible(BufferGetPage(buffer)))
+ if (PageIsAllVisible(BufferGetPage(buffer)) ||
+ PageIsAllFrozen(BufferGetPage(buffer)))
visibilitymap_pin(relation, targetBlock, vmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 7c38772..cc2c274 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -21,11 +21,14 @@
*
* NOTES
*
- * The visibility map is a bitmap with one bit per heap page. A set bit means
- * that all tuples on the page are known visible to all transactions, and
- * therefore the page doesn't need to be vacuumed. The map is conservative in
- * the sense that we make sure that whenever a bit is set, we know the
- * condition is true, but if a bit is not set, it might or might not be true.
+ * The visibility map is a bitmap with two bits (all-visible and all-frozen
+ * per heap page. A set all-visible bit means that all tuples on the page are
+ * known visible to all transactions, and therefore the page doesn't need to
+ * be vacuumed. A set all-frozen bit means that all tuples on the page are
+ * completely frozen, so the page doesn't need to be vacuumed even if whole
+ * table scanning vacuum is required. The map is conservative in the sense that
+ * we make sure that whenever a bit is set, we know the condition is true,
+ * but if a bit is not set, it might or might not be true.
*
* Clearing a visibility map bit is not separately WAL-logged. The callers
* must make sure that whenever a bit is cleared, the bit is cleared on WAL
@@ -33,21 +36,25 @@
*
* When we *set* a visibility map during VACUUM, we must write WAL. This may
* seem counterintuitive, since the bit is basically a hint: if it is clear,
- * it may still be the case that every tuple on the page is visible to all
- * transactions; we just don't know that for certain. The difficulty is that
- * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
- * on the page itself, and the visibility map bit. If a crash occurs after the
- * visibility map page makes it to disk and before the updated heap page makes
- * it to disk, redo must set the bit on the heap page. Otherwise, the next
- * insert, update, or delete on the heap page will fail to realize that the
- * visibility map bit must be cleared, possibly causing index-only scans to
- * return wrong answers.
+ * it may still be the case that every tuple on the page is visible or frozen
+ * to all transactions; we just don't know that for certain. The difficulty is
+ * that there are two bits which are typically set together: the PD_ALL_VISIBLE
+ * or PD_ALL_FROZEN bit on the page itself, and the visibility map bit. If a
+ * crash occurs after the visibility map page makes it to disk and before the
+ * updated heap page makes it to disk, redo must set the bit on the heap page.
+ * Otherwise, the next insert, update, or delete on the heap page will fail to
+ * realize that the visibility map bit must be cleared, possibly causing index-only
+ * scans to return wrong answers.
*
- * VACUUM will normally skip pages for which the visibility map bit is set;
+ * VACUUM will normally skip pages for which the visibility map either bit is set;
* such pages can't contain any dead tuples and therefore don't need vacuuming.
- * The visibility map is not used for anti-wraparound vacuums, because
+ * The visibility map is not used for anti-wraparound vacuums before 9.5, because
* an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
* present in the table, even on pages that don't have any dead tuples.
+ * 9.6 or later, the visibility map has a additional bit which indicates all tuple
+ * on single page has been completely forzen, so the visibility map is also used for
+ * anti-wraparound vacuums.
+ *
*
* LOCKING
*
@@ -58,14 +65,14 @@
* section that logs the page modification. However, we don't want to hold
* the buffer lock over any I/O that may be required to read in the visibility
* map page. To avoid this, we examine the heap page before locking it;
- * if the page-level PD_ALL_VISIBLE bit is set, we pin the visibility map
- * bit. Then, we lock the buffer. But this creates a race condition: there
- * is a possibility that in the time it takes to lock the buffer, the
- * PD_ALL_VISIBLE bit gets set. If that happens, we have to unlock the
- * buffer, pin the visibility map page, and relock the buffer. This shouldn't
- * happen often, because only VACUUM currently sets visibility map bits,
- * and the race will only occur if VACUUM processes a given page at almost
- * exactly the same time that someone tries to further modify it.
+ * if the page-level PD_ALL_VISIBLE or PD_ALL_FROZEN bit is set, we pin the
+ * visibility map bit. Then, we lock the buffer. But this creates a race
+ * condition: there is a possibility that in the time it takes to lock the
+ * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that happens,
+ * we have to unlock the buffer, pin the visibility map page, and relock the
+ * buffer. This shouldn't happen often, because only VACUUM currently sets
+ * visibility map bits, and the race will only occur if VACUUM processes a given
+ * page at almost exactly the same time that someone tries to further modify it.
*
* To set a bit, you need to hold a lock on the heap page. That prevents
* the race condition where VACUUM sees that all tuples on the page are
@@ -92,7 +99,7 @@
#include "utils/inval.h"
-/*#define TRACE_VISIBILITYMAP */
+#define TRACE_VISIBILITYMAP
/*
* Size of the bitmap on each visibility map page, in bytes. There's no
@@ -101,11 +108,14 @@
*/
#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
-/* Number of bits allocated for each heap block. */
-#define BITS_PER_HEAPBLOCK 1
+/*
+ * Number of bits allocated for each heap block.
+ * One for all-visible, other for all-frozen.
+*/
+#define BITS_PER_HEAPBLOCK 2
/* Number of heap blocks we can represent in one byte. */
-#define HEAPBLOCKS_PER_BYTE 8
+#define HEAPBLOCKS_PER_BYTE 4
/* Number of heap blocks we can represent in one visibility map page. */
#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
@@ -115,24 +125,42 @@
#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
#define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
-/* table for fast counting of set bits */
-static const uint8 number_of_ones[256] = {
- 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
+/* tables for fast counting of set bits for visible and freeze */
+static const uint8 number_of_ones_for_visible[256] = {
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4
+};
+static const uint8 number_of_ones_for_freeze[256] = {
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4
};
/* prototypes for internal routines */
@@ -141,23 +169,23 @@ static void vm_extend(Relation rel, BlockNumber nvmblocks);
/*
- * visibilitymap_clear - clear a bit in visibility map
+ * visibilitymap_clear - clear bits in visibility map
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
- * any I/O.
+ * any I/O. Caller must pass flags which indicates what flags we want to clear.
*/
void
-visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
+visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf, uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
int mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- uint8 mask = 1 << mapBit;
+ uint8 mask = flags << (BITS_PER_HEAPBLOCK * mapBit);
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_clear %s %d %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
if (!BufferIsValid(buf) || BufferGetBlockNumber(buf) != mapBlock)
@@ -225,7 +253,7 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
}
/*
- * visibilitymap_set - set a bit on a previously pinned page
+ * visibilitymap_set - set bits on a previously pinned page
*
* recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
* or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
@@ -234,10 +262,11 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
* marked all-visible; it is needed for Hot Standby, and can be
* InvalidTransactionId if the page contains no tuples.
*
- * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
- * this function. Except in recovery, caller should also pass the heap
- * buffer. When checksums are enabled and we're not in recovery, we must add
- * the heap buffer to the WAL chain to protect it from being torn.
+ * Caller is expected to set the heap page's PD_ALL_VISIBLE or PD_ALL_FROZEN
+ * bit before calling this function. Except in recovery, caller should also
+ * pass the heap buffer and flags which indicates what flag we want to set.
+ * When checksums are enabled and we're not in recovery, we must add the heap
+ * buffer to the WAL chain to protect it from being torn.
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -245,7 +274,8 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
*/
void
visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid)
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -254,7 +284,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_set %s %d %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
@@ -272,11 +302,11 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map = PageGetContents(page);
LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
- if (!(map[mapByte] & (1 << mapBit)))
+ if (!(map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))))
{
START_CRIT_SECTION();
- map[mapByte] |= (1 << mapBit);
+ map[mapByte] |= (flags << (BITS_PER_HEAPBLOCK * mapBit));
MarkBufferDirty(vmBuf);
if (RelationNeedsWAL(rel))
@@ -285,7 +315,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Assert(!InRecovery);
recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
- cutoff_xid);
+ cutoff_xid, flags);
/*
* If data checksums are enabled (or wal_log_hints=on), we
@@ -295,11 +325,15 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Page heapPage = BufferGetPage(heapBuf);
- /* caller is expected to set PD_ALL_VISIBLE first */
- Assert(PageIsAllVisible(heapPage));
+ /*
+ * caller is expected to set PD_ALL_VISIBLE or
+ * PD_ALL_FROZEN first.
+ */
+ Assert(PageIsAllVisible(heapPage) || PageIsAllFrozen(heapPage));
PageSetLSN(heapPage, recptr);
}
}
+
PageSetLSN(page, recptr);
}
@@ -310,15 +344,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
}
/*
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_test - test if bits are set
*
- * Are all tuples on heapBlk visible to all, according to the visibility map?
+ * Are all tuples on heapBlk visible or frozen to all, according to the visibility map?
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by an
* earlier call to visibilitymap_pin or visibilitymap_test on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk, or InvalidBuffer. The caller is responsible for
- * releasing *buf after it's done testing and setting bits.
+ * releasing *buf after it's done testing and setting bits, and must set flags
+ * which indicates what flag we want to test.
*
* NOTE: This function is typically called without a lock on the heap page,
* so somebody else could change the bit just after we look at it. In fact,
@@ -328,7 +363,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
* all concurrency issues!
*/
bool
-visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
+visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf, uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -337,7 +372,7 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_test %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_test %s %d %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
/* Reuse the old pinned buffer if possible */
@@ -360,11 +395,11 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
map = PageGetContents(BufferGetPage(*buf));
/*
- * A single-bit read is atomic. There could be memory-ordering effects
+ * A single or double bit read is atomic. There could be memory-ordering effects
* here, but for performance reasons we make it the caller's job to worry
* about that.
*/
- result = (map[mapByte] & (1 << mapBit)) ? true : false;
+ result = (map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))) ? true : false;
return result;
}
@@ -374,10 +409,12 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
*
* Note: we ignore the possibility of race conditions when the table is being
* extended concurrently with the call. New pages added to the table aren't
- * going to be marked all-visible, so they won't affect the result.
+ * going to be marked all-visible or all-frozen, so they won't affect the result.
+ * if for_visible is true, we count the number of all-visible flag. If false,
+ * we count the number of all-frozen flag.
*/
BlockNumber
-visibilitymap_count(Relation rel)
+visibilitymap_count(Relation rel, bool for_visible)
{
BlockNumber result = 0;
BlockNumber mapBlock;
@@ -406,7 +443,8 @@ visibilitymap_count(Relation rel)
for (i = 0; i < MAPSIZE; i++)
{
- result += number_of_ones[map[i]];
+ result += for_visible ?
+ number_of_ones_for_visible[map[i]] : number_of_ones_for_freeze[map[i]];
}
ReleaseBuffer(mapBuffer);
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 4246554..65753d9 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1919,11 +1919,18 @@ index_update_stats(Relation rel,
{
BlockNumber relpages = RelationGetNumberOfBlocks(rel);
BlockNumber relallvisible;
+ BlockNumber relallfrozen;
if (rd_rel->relkind != RELKIND_INDEX)
- relallvisible = visibilitymap_count(rel);
+ {
+ relallvisible = visibilitymap_count(rel, true);
+ relallfrozen = visibilitymap_count(rel, false);
+ }
else /* don't bother for indexes */
+ {
relallvisible = 0;
+ relallfrozen = 0;
+ }
if (rd_rel->relpages != (int32) relpages)
{
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 861048f..1eaf2da 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -572,7 +572,8 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
vac_update_relstats(onerel,
relpages,
totalrows,
- visibilitymap_count(onerel),
+ visibilitymap_count(onerel, true),
+ visibilitymap_count(onerel, false),
hasindex,
InvalidTransactionId,
InvalidMultiXactId,
@@ -595,6 +596,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
RelationGetNumberOfBlocks(Irel[ind]),
totalindexrows,
0,
+ 0,
false,
InvalidTransactionId,
InvalidMultiXactId,
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 7ab4874..d3725dd 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -22,6 +22,7 @@
#include "access/rewriteheap.h"
#include "access/transam.h"
#include "access/tuptoaster.h"
+#include "access/visibilitymap.h"
#include "access/xact.h"
#include "access/xlog.h"
#include "catalog/catalog.h"
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index baf66f1..d68c7c4 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -744,6 +744,7 @@ void
vac_update_relstats(Relation relation,
BlockNumber num_pages, double num_tuples,
BlockNumber num_all_visible_pages,
+ BlockNumber num_all_frozen_pages,
bool hasindex, TransactionId frozenxid,
MultiXactId minmulti,
bool in_outer_xact)
@@ -781,6 +782,11 @@ vac_update_relstats(Relation relation,
pgcform->relallvisible = (int32) num_all_visible_pages;
dirty = true;
}
+ if (pgcform->relallfrozen != (int32) num_all_frozen_pages)
+ {
+ pgcform->relallfrozen = (int32) num_all_frozen_pages;
+ dirty = true;
+ }
/* Apply DDL updates, but not inside an outer transaction (see above) */
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index a01cfb4..e4e60eb 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -106,6 +106,8 @@ typedef struct LVRelStats
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* number of pages we examined */
BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber vmskipped_pages; /* # of pages we skipped by all-frozen bit
+ of visibility map */
double scanned_tuples; /* counts only tuples on scanned pages */
double old_rel_tuples; /* previous value of pg_class.reltuples */
double new_rel_tuples; /* new estimated total # of tuples */
@@ -188,7 +190,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
MultiXactId mxactFullScanLimit;
BlockNumber new_rel_pages;
double new_rel_tuples;
- BlockNumber new_rel_allvisible;
+ BlockNumber new_rel_allvisible,
+ new_rel_allfrozen;
double new_live_tuples;
TransactionId new_frozen_xid;
MultiXactId new_min_multi;
@@ -222,6 +225,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* than or equal to the requested Xid full-table scan limit; or if the
* table's minimum MultiXactId is older than or equal to the requested
* mxid full-table scan limit.
+ * Even if scan_all is set so far, we could skip to scan some pages
+ * according by frozen map.
*/
scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
xidFullScanLimit);
@@ -253,14 +258,16 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
- if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages + vacrelstats->vmskipped_pages)
+ < vacrelstats->rel_pages)
{
- Assert(!scan_all);
scanned_all = false;
}
else
scanned_all = true;
+ scanned_all |= scan_all;
+
/*
* Optionally truncate the relation.
*
@@ -301,10 +308,16 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_tuples = vacrelstats->old_rel_tuples;
}
- new_rel_allvisible = visibilitymap_count(onerel);
+ /* true means that count for all-visible */
+ new_rel_allvisible = visibilitymap_count(onerel, true);
if (new_rel_allvisible > new_rel_pages)
new_rel_allvisible = new_rel_pages;
+ /* false means that count for all-frozen */
+ new_rel_allfrozen = visibilitymap_count(onerel, false);
+ if (new_rel_allfrozen > new_rel_pages)
+ new_rel_allfrozen = new_rel_pages;
+
new_frozen_xid = scanned_all ? FreezeLimit : InvalidTransactionId;
new_min_multi = scanned_all ? MultiXactCutoff : InvalidMultiXactId;
@@ -312,6 +325,7 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_pages,
new_rel_tuples,
new_rel_allvisible,
+ new_rel_allfrozen,
vacrelstats->hasindex,
new_frozen_xid,
new_min_multi,
@@ -486,9 +500,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* consecutive pages. Since we're reading sequentially, the OS should be
* doing readahead for us, so there's no gain in skipping a page now and
* then; that's likely to disable readahead and so be counterproductive.
- * Also, skipping even a single page means that we can't update
- * relfrozenxid, so we only want to do it if we can skip a goodly number
- * of pages.
+ * Also, skipping even a single page accorind to all-visible bit of
+ * visibility map means that we can't update relfrozenxid, so we only want
+ * to do it if we can skip a goodly number. On the other hand, we count
+ * both how many pages we skipped according to all-frozen bit of visibility
+ * map and how many pages we freeze page, so we can update relfrozenxid if
+ * the sum of their is as many as tuples per page.
+ * XXX : We use only all-visible bit to determine skip page for now.
*
* Before entering the main loop, establish the invariant that
* next_not_all_visible_block is the next block number >= blkno that's not
@@ -515,7 +533,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer))
+ if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer,
+ VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN))
break;
vacuum_delay_point();
}
@@ -534,6 +553,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
hastup;
int prev_dead_count;
int nfrozen;
+ int already_nfrozen; /* # of tuples already frozen */
+ int ntup_in_blk; /* # of tuples in single page */
Size freespace;
bool all_visible_according_to_vm;
bool all_visible;
@@ -548,7 +569,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block++)
{
if (!visibilitymap_test(onerel, next_not_all_visible_block,
- &vmbuffer))
+ &vmbuffer,
+ VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN))
break;
vacuum_delay_point();
}
@@ -566,9 +588,20 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
}
else
{
- /* Current block is all-visible */
+ /*
+ * Current block is all-visible.
+ * If visibility map represents that it's all frozen, we can
+ * skip to vacuum page unconditionally.
+ */
+ if (visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_FROZEN))
+ {
+ vacrelstats->vmskipped_pages++;
+ continue;
+ }
+
if (skipping_all_visible_blocks && !scan_all)
continue;
+
all_visible_according_to_vm = true;
}
@@ -740,7 +773,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
PageSetAllVisible(page);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, InvalidTransactionId);
+ vmbuffer, InvalidTransactionId,
+ VISIBILITYMAP_ALL_VISIBLE);
END_CRIT_SECTION();
}
@@ -764,6 +798,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
all_visible = true;
has_dead_tuples = false;
nfrozen = 0;
+ already_nfrozen = 0;
+ ntup_in_blk = 0;
hastup = false;
prev_dead_count = vacrelstats->num_dead_tuples;
maxoff = PageGetMaxOffsetNumber(page);
@@ -918,8 +954,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
{
num_tuples += 1;
+ ntup_in_blk += 1;
hastup = true;
+ /* If current tuple is already frozen, count it up */
+ if (HeapTupleHeaderXminFrozen(tuple.t_data))
+ already_nfrozen += 1;
+
/*
* Each non-removable tuple must be checked to see if it needs
* freezing. Note we already have exclusive buffer lock.
@@ -931,11 +972,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
} /* scan along page */
/*
- * If we froze any tuples, mark the buffer dirty, and write a WAL
- * record recording the changes. We must log the changes to be
- * crash-safe against future truncation of CLOG.
+ * If we froze any tuples or any tuples are already frozen,
+ * mark the buffer dirty, and write a WAL record recording the changes.
+ * We must log the changes to be crash-safe against future truncation
+ * of CLOG.
*/
- if (nfrozen > 0)
+ if (nfrozen > 0 || already_nfrozen > 0)
{
START_CRIT_SECTION();
@@ -953,8 +995,20 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
heap_execute_freeze_tuple(htup, &frozen[i]);
}
+ /*
+ * As a result of scanning a page, we ensure that all tuples
+ * are completely frozen. Set VISIBILITYMAP_ALL_FROZEN bit on
+ * visibility map and PD_ALL_FROZEN flag on page.
+ */
+ if (ntup_in_blk == (nfrozen + already_nfrozen))
+ {
+ PageSetAllFrozen(page);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr, vmbuffer,
+ InvalidTransactionId, VISIBILITYMAP_ALL_FROZEN);
+ }
+
/* Now WAL-log freezing if necessary */
- if (RelationNeedsWAL(onerel))
+ if (nfrozen > 0 && RelationNeedsWAL(onerel))
{
XLogRecPtr recptr;
@@ -1007,7 +1061,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
PageSetAllVisible(page);
MarkBufferDirty(buf);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, visibility_cutoff_xid);
+ vmbuffer, visibility_cutoff_xid, VISIBILITYMAP_ALL_VISIBLE);
}
/*
@@ -1018,11 +1072,11 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* that something bad has happened.
*/
else if (all_visible_according_to_vm && !PageIsAllVisible(page)
- && visibilitymap_test(onerel, blkno, &vmbuffer))
+ && visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
relname, blkno);
- visibilitymap_clear(onerel, blkno, vmbuffer);
+ visibilitymap_clear(onerel, blkno, vmbuffer, VISIBILITYMAP_ALL_VISIBLE);
}
/*
@@ -1044,7 +1098,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
relname, blkno);
PageClearAllVisible(page);
MarkBufferDirty(buf);
- visibilitymap_clear(onerel, blkno, vmbuffer);
+ visibilitymap_clear(onerel, blkno, vmbuffer, VISIBILITYMAP_ALL_VISIBLE);
}
UnlockReleaseBuffer(buf);
@@ -1078,7 +1132,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
num_tuples);
/*
- * Release any remaining pin on visibility map page.
+ * Release any remaining pin on visibility map and frozen map page.
*/
if (BufferIsValid(vmbuffer))
{
@@ -1285,11 +1339,11 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
* flag is now set, also set the VM bit.
*/
if (PageIsAllVisible(page) &&
- !visibilitymap_test(onerel, blkno, vmbuffer))
+ !visibilitymap_test(onerel, blkno, vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
Assert(BufferIsValid(*vmbuffer));
visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
- visibility_cutoff_xid);
+ visibility_cutoff_xid, VISIBILITYMAP_ALL_VISIBLE);
}
return tupindex;
@@ -1408,6 +1462,7 @@ lazy_cleanup_index(Relation indrel,
stats->num_pages,
stats->num_index_tuples,
0,
+ 0,
false,
InvalidTransactionId,
InvalidMultiXactId,
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 9f54c46..08df289 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -116,7 +116,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
*/
if (!visibilitymap_test(scandesc->heapRelation,
ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
+ &node->ioss_VMBuffer, VISIBILITYMAP_ALL_VISIBLE))
{
/*
* Rats, we have to visit the heap to check visibility.
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 874ca6a..376841a 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -127,7 +127,7 @@ ExecCheckPlanOutput(Relation resultRel, List *targetList)
if (attno != resultDesc->natts)
ereport(ERROR,
(errcode(ERRCODE_DATATYPE_MISMATCH),
- errmsg("table row type and query-specified row type do not match"),
+ errmsg("table row type and query-specified row type do not match"),
errdetail("Query has too few columns.")));
}
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index caa0f14..d2f083b 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -64,9 +64,10 @@
*/
/* PD_ALL_VISIBLE was cleared */
#define XLH_INSERT_ALL_VISIBLE_CLEARED (1<<0)
-#define XLH_INSERT_LAST_IN_MULTI (1<<1)
-#define XLH_INSERT_IS_SPECULATIVE (1<<2)
-#define XLH_INSERT_CONTAINS_NEW_TUPLE (1<<3)
+#define XLH_INSERT_ALL_FROZEN_CLEARED (1<<1)
+#define XLH_INSERT_LAST_IN_MULTI (1<<2)
+#define XLH_INSERT_IS_SPECULATIVE (1<<3)
+#define XLH_INSERT_CONTAINS_NEW_TUPLE (1<<4)
/*
* xl_heap_update flag values, 8 bits are available.
@@ -75,11 +76,15 @@
#define XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED (1<<0)
/* PD_ALL_VISIBLE was cleared in the 2nd page */
#define XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED (1<<1)
-#define XLH_UPDATE_CONTAINS_OLD_TUPLE (1<<2)
-#define XLH_UPDATE_CONTAINS_OLD_KEY (1<<3)
-#define XLH_UPDATE_CONTAINS_NEW_TUPLE (1<<4)
-#define XLH_UPDATE_PREFIX_FROM_OLD (1<<5)
-#define XLH_UPDATE_SUFFIX_FROM_OLD (1<<6)
+/* PD_FROZEN_VISIBLE was cleared */
+#define XLH_UPDATE_OLD_ALL_FROZEN_CLEARED (1<<2)
+/* PD_FROZEN_VISIBLE was cleared in the 2nd page */
+#define XLH_UPDATE_NEW_ALL_FROZEN_CLEARED (1<<3)
+#define XLH_UPDATE_CONTAINS_OLD_TUPLE (1<<4)
+#define XLH_UPDATE_CONTAINS_OLD_KEY (1<<5)
+#define XLH_UPDATE_CONTAINS_NEW_TUPLE (1<<6)
+#define XLH_UPDATE_PREFIX_FROM_OLD (1<<7)
+#define XLH_UPDATE_SUFFIX_FROM_OLD (1<<8)
/* convenience macro for checking whether any form of old tuple was logged */
#define XLH_UPDATE_CONTAINS_OLD \
@@ -90,9 +95,10 @@
*/
/* PD_ALL_VISIBLE was cleared */
#define XLH_DELETE_ALL_VISIBLE_CLEARED (1<<0)
-#define XLH_DELETE_CONTAINS_OLD_TUPLE (1<<1)
-#define XLH_DELETE_CONTAINS_OLD_KEY (1<<2)
-#define XLH_DELETE_IS_SUPER (1<<3)
+#define XLH_DELETE_ALL_FROZEN_CLEARED (1<<1)
+#define XLH_DELETE_CONTAINS_OLD_TUPLE (1<<2)
+#define XLH_DELETE_CONTAINS_OLD_KEY (1<<3)
+#define XLH_DELETE_IS_SUPER (1<<4)
/* convenience macro for checking whether any form of old tuple was logged */
#define XLH_DELETE_CONTAINS_OLD \
@@ -320,9 +326,10 @@ typedef struct xl_heap_freeze_page
typedef struct xl_heap_visible
{
TransactionId cutoff_xid;
+ uint8 flags;
} xl_heap_visible;
-#define SizeOfHeapVisible (offsetof(xl_heap_visible, cutoff_xid) + sizeof(TransactionId))
+#define SizeOfHeapVisible (offsetof(xl_heap_visible, flags) + sizeof(uint8))
typedef struct xl_heap_new_cid
{
@@ -382,6 +389,8 @@ extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer,
TransactionId cutoff_xid, xl_heap_freeze_tuple *tuples,
int ntuples);
+extern XLogRecPtr log_heap_frozenmap(RelFileNode rnode, Buffer heap_buffer,
+ Buffer fm_buffer);
extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
TransactionId cutoff_xid,
TransactionId cutoff_multi,
@@ -389,6 +398,6 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
xl_heap_freeze_tuple *xlrec_tp);
extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
- Buffer vm_buffer, TransactionId cutoff_xid);
+ Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
#endif /* HEAPAM_XLOG_H */
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 0c0e0ef..53d8103 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -19,15 +19,21 @@
#include "storage/buf.h"
#include "utils/relcache.h"
+/* Flags for bit map */
+#define VISIBILITYMAP_ALL_VISIBLE 0x01
+#define VISIBILITYMAP_ALL_FROZEN 0x02
+
extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk,
- Buffer vmbuf);
+ Buffer vmbuf, uint8 flags);
extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
Buffer *vmbuf);
extern bool visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid);
-extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
-extern BlockNumber visibilitymap_count(Relation rel);
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags);
+extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf,
+ uint8 flags);
+extern BlockNumber visibilitymap_count(Relation rel, bool for_visible);
extern void visibilitymap_truncate(Relation rel, BlockNumber nheapblocks);
#endif /* VISIBILITYMAP_H */
diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
index e526cd9..ea0f7c1 100644
--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -47,6 +47,8 @@ CATALOG(pg_class,1259) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83) BKI_SCHEMA_MACRO
float4 reltuples; /* # of tuples (not always up-to-date) */
int32 relallvisible; /* # of all-visible blocks (not always
* up-to-date) */
+ int32 relallfrozen; /* # of all-frozen blocks (not always
+ up-to-date) */
Oid reltoastrelid; /* OID of toast table; 0 if none */
bool relhasindex; /* T if has (or has had) any indexes */
bool relisshared; /* T if shared across databases */
@@ -95,7 +97,7 @@ typedef FormData_pg_class *Form_pg_class;
* ----------------
*/
-#define Natts_pg_class 30
+#define Natts_pg_class 31
#define Anum_pg_class_relname 1
#define Anum_pg_class_relnamespace 2
#define Anum_pg_class_reltype 3
@@ -107,25 +109,26 @@ typedef FormData_pg_class *Form_pg_class;
#define Anum_pg_class_relpages 9
#define Anum_pg_class_reltuples 10
#define Anum_pg_class_relallvisible 11
-#define Anum_pg_class_reltoastrelid 12
-#define Anum_pg_class_relhasindex 13
-#define Anum_pg_class_relisshared 14
-#define Anum_pg_class_relpersistence 15
-#define Anum_pg_class_relkind 16
-#define Anum_pg_class_relnatts 17
-#define Anum_pg_class_relchecks 18
-#define Anum_pg_class_relhasoids 19
-#define Anum_pg_class_relhaspkey 20
-#define Anum_pg_class_relhasrules 21
-#define Anum_pg_class_relhastriggers 22
-#define Anum_pg_class_relhassubclass 23
-#define Anum_pg_class_relrowsecurity 24
-#define Anum_pg_class_relispopulated 25
-#define Anum_pg_class_relreplident 26
-#define Anum_pg_class_relfrozenxid 27
-#define Anum_pg_class_relminmxid 28
-#define Anum_pg_class_relacl 29
-#define Anum_pg_class_reloptions 30
+#define Anum_pg_class_relallfrozen 12
+#define Anum_pg_class_reltoastrelid 13
+#define Anum_pg_class_relhasindex 14
+#define Anum_pg_class_relisshared 15
+#define Anum_pg_class_relpersistence 16
+#define Anum_pg_class_relkind 17
+#define Anum_pg_class_relnatts 18
+#define Anum_pg_class_relchecks 19
+#define Anum_pg_class_relhasoids 20
+#define Anum_pg_class_relhaspkey 21
+#define Anum_pg_class_relhasrules 22
+#define Anum_pg_class_relhastriggers 23
+#define Anum_pg_class_relhassubclass 24
+#define Anum_pg_class_relrowsecurity 25
+#define Anum_pg_class_relispopulated 26
+#define Anum_pg_class_relreplident 27
+#define Anum_pg_class_relfrozenxid 28
+#define Anum_pg_class_relminmxid 29
+#define Anum_pg_class_relacl 30
+#define Anum_pg_class_reloptions 31
/* ----------------
* initial contents of pg_class
@@ -140,13 +143,13 @@ typedef FormData_pg_class *Form_pg_class;
* Note: "3" in the relfrozenxid column stands for FirstNormalTransactionId;
* similarly, "1" in relminmxid stands for FirstMultiXactId
*/
-DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 f f p r 28 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 0 f f p r 28 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 0 f f p r 31 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index e3a31af..d2bae2d 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -172,6 +172,7 @@ extern void vac_update_relstats(Relation relation,
BlockNumber num_pages,
double num_tuples,
BlockNumber num_all_visible_pages,
+ BlockNumber num_all_frozen_pages,
bool hasindex,
TransactionId frozenxid,
MultiXactId minmulti,
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index a2f78ee..7bf2718 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -178,8 +178,10 @@ typedef PageHeaderData *PageHeader;
* tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
+#define PD_ALL_FROZEN 0x0008 /* all tuples on page are completely
+ frozen */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x000F /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -369,6 +371,13 @@ typedef PageHeaderData *PageHeader;
#define PageClearAllVisible(page) \
(((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
+#define PageIsAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags & PD_ALL_FROZEN)
+#define PageSetAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags |= PD_ALL_FROZEN)
+#define PageClearAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags &= ~PD_ALL_FROZEN)
+
#define PageIsPrunable(page, oldestxmin) \
( \
AssertMacro(TransactionIdIsNormal(oldestxmin)), \
On 2 July 2015 at 16:30, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
Also, the flags of each heap page header might be set PD_ALL_FROZEN,
as well as all-visible
Is it possible to have VM bits set to frozen but not visible?
The description makes those two states sound independent of each other.
Are they? Or not? Do we test for an impossible state?
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Jul 3, 2015 at 1:23 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 2 July 2015 at 16:30, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
Also, the flags of each heap page header might be set PD_ALL_FROZEN,
as well as all-visibleIs it possible to have VM bits set to frozen but not visible?
The description makes those two states sound independent of each other.
Are they? Or not? Do we test for an impossible state?
It's impossible to have VM bits set to frozen but not visible.
These bit are controlled independently. But eventually, when
all-frozen bit is set, all-visible is also set.
Regards,
--
Sawada Masahiko
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Jul 3, 2015 at 5:25 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
On Fri, Jul 3, 2015 at 1:23 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 2 July 2015 at 16:30, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
Also, the flags of each heap page header might be set PD_ALL_FROZEN,
as well as all-visibleIs it possible to have VM bits set to frozen but not visible?
The description makes those two states sound independent of each other.
Are they? Or not? Do we test for an impossible state?
It's impossible to have VM bits set to frozen but not visible.
These bit are controlled independently. But eventually, when
all-frozen bit is set, all-visible is also set.
Attached latest version including some bug fix.
Please review it.
Regards,
--
Sawada Masahiko
Attachments:
000_add_frozen_bit_into_visibilitymap_v5.patchapplication/octet-stream; name=000_add_frozen_bit_into_visibilitymap_v5.patchDownload
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 22c5f7a..b1b6a06 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -87,7 +87,7 @@ statapprox_heap(Relation rel, output_type *stat)
* If the page has only visible tuples, then we can find out the free
* space from the FSM and move on.
*/
- if (visibilitymap_test(rel, blkno, &vmbuffer))
+ if (visibilitymap_test(rel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
freespace = GetRecordedFreeSpace(rel, blkno);
stat->tuple_len += BLCKSZ - freespace;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 86a2e6b..835d714 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -88,7 +88,8 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
HeapTuple newtup, HeapTuple old_key_tup,
- bool all_visible_cleared, bool new_all_visible_cleared);
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool all_frozen_cleared, bool new_all_frozen_cleared);
static void HeapSatisfiesHOTandKeyUpdate(Relation relation,
Bitmapset *hot_attrs,
Bitmapset *key_attrs, Bitmapset *id_attrs,
@@ -2107,7 +2108,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
HeapTuple heaptup;
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
- bool all_visible_cleared = false;
+ bool all_visible_cleared = false,
+ all_frozen_cleared = false;
/*
* Fill in tuple header fields, assign an OID, and toast the tuple if
@@ -2131,8 +2133,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
/*
- * Find buffer to insert this tuple into. If the page is all visible,
- * this will also pin the requisite visibility map page.
+ * Find buffer to insert this tuple into. If the page is all visible
+ * of all frozen, this will also pin the requisite visibility map and
+ * frozen map page.
*/
buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
InvalidBuffer, options, bistate,
@@ -2150,7 +2153,16 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
PageClearAllVisible(BufferGetPage(buffer));
visibilitymap_clear(relation,
ItemPointerGetBlockNumber(&(heaptup->t_self)),
- vmbuffer);
+ vmbuffer, VISIBILITYMAP_ALL_VISIBLE);
+ }
+
+ if (PageIsAllFrozen(BufferGetPage(buffer)))
+ {
+ all_frozen_cleared = true;
+ PageClearAllFrozen(BufferGetPage(buffer));
+ visibilitymap_clear(relation,
+ ItemPointerGetBlockNumber(&(heaptup->t_self)),
+ vmbuffer, VISIBILITYMAP_ALL_FROZEN);
}
/*
@@ -2199,6 +2211,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
xlrec.flags = 0;
if (all_visible_cleared)
xlrec.flags |= XLH_INSERT_ALL_VISIBLE_CLEARED;
+ if (all_frozen_cleared)
+ xlrec.flags |= XLH_INSERT_ALL_FROZEN_CLEARED;
if (options & HEAP_INSERT_SPECULATIVE)
xlrec.flags |= XLH_INSERT_IS_SPECULATIVE;
Assert(ItemPointerGetBlockNumber(&heaptup->t_self) == BufferGetBlockNumber(buffer));
@@ -2406,7 +2420,8 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
{
Buffer buffer;
Buffer vmbuffer = InvalidBuffer;
- bool all_visible_cleared = false;
+ bool all_visible_cleared = false,
+ all_frozen_cleared = false;
int nthispage;
CHECK_FOR_INTERRUPTS();
@@ -2451,7 +2466,16 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
PageClearAllVisible(page);
visibilitymap_clear(relation,
BufferGetBlockNumber(buffer),
- vmbuffer);
+ vmbuffer, VISIBILITYMAP_ALL_VISIBLE);
+ }
+
+ if (PageIsAllFrozen(page))
+ {
+ all_frozen_cleared = true;
+ PageClearAllFrozen(page);
+ visibilitymap_clear(relation,
+ BufferGetBlockNumber(buffer),
+ vmbuffer, VISIBILITYMAP_ALL_FROZEN);
}
/*
@@ -2496,6 +2520,8 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
tupledata = scratchptr;
xlrec->flags = all_visible_cleared ? XLH_INSERT_ALL_VISIBLE_CLEARED : 0;
+ if (all_frozen_cleared)
+ xlrec->flags |= XLH_INSERT_ALL_FROZEN_CLEARED;
xlrec->ntuples = nthispage;
/*
@@ -2698,7 +2724,8 @@ heap_delete(Relation relation, ItemPointer tid,
new_infomask2;
bool have_tuple_lock = false;
bool iscombo;
- bool all_visible_cleared = false;
+ bool all_visible_cleared = false,
+ all_frozen_cleared = false;
HeapTuple old_key_tuple = NULL; /* replica identity of the tuple */
bool old_key_copied = false;
@@ -2724,18 +2751,19 @@ heap_delete(Relation relation, ItemPointer tid,
* in the middle of changing this, so we'll need to recheck after we have
* the lock.
*/
- if (PageIsAllVisible(page))
+ if (PageIsAllVisible(page) || PageIsAllFrozen(page))
visibilitymap_pin(relation, block, &vmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
/*
* If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, we'll have to unlock and
- * re-lock, to avoid holding the buffer lock across an I/O. That's a bit
- * unfortunate, but hopefully shouldn't happen often.
+ * visible or all frozen while we were busy locking the buffer, we'll
+ * have to unlock and re-lock, to avoid holding the buffer lock across an
+ * I/O. That's a bit unfortunate, but hopefully shouldn't happen often.
*/
- if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
+ if (vmbuffer == InvalidBuffer &&
+ (PageIsAllVisible(page) || PageIsAllFrozen(page)))
{
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
visibilitymap_pin(relation, block, &vmbuffer);
@@ -2925,12 +2953,22 @@ l1:
*/
PageSetPrunable(page, xid);
+ /* clear PD_ALL_VISIBLE flags */
if (PageIsAllVisible(page))
{
all_visible_cleared = true;
PageClearAllVisible(page);
visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
- vmbuffer);
+ vmbuffer, VISIBILITYMAP_ALL_VISIBLE);
+ }
+
+ /* clear PD_ALL_FROZEN flags */
+ if (PageIsAllFrozen(page))
+ {
+ all_frozen_cleared = true;
+ PageClearAllFrozen(page);
+ visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
+ vmbuffer, VISIBILITYMAP_ALL_FROZEN);
}
/* store transaction information of xact deleting the tuple */
@@ -2962,6 +3000,8 @@ l1:
log_heap_new_cid(relation, &tp);
xlrec.flags = all_visible_cleared ? XLH_DELETE_ALL_VISIBLE_CLEARED : 0;
+ if (all_frozen_cleared)
+ xlrec.flags |= XLH_DELETE_ALL_FROZEN_CLEARED;
xlrec.infobits_set = compute_infobits(tp.t_data->t_infomask,
tp.t_data->t_infomask2);
xlrec.offnum = ItemPointerGetOffsetNumber(&tp.t_self);
@@ -3159,6 +3199,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
bool key_intact;
bool all_visible_cleared = false;
bool all_visible_cleared_new = false;
+ bool all_frozen_cleared = false;
+ bool all_frozen_cleared_new = false;
bool checked_lockers;
bool locker_remains;
TransactionId xmax_new_tuple,
@@ -3202,12 +3244,12 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
page = BufferGetPage(buffer);
/*
- * Before locking the buffer, pin the visibility map page if it appears to
- * be necessary. Since we haven't got the lock yet, someone else might be
+ * Before locking the buffer, pin the visibility map if it appears to be
+ * necessary. Since we haven't got the lock yet, someone else might be
* in the middle of changing this, so we'll need to recheck after we have
* the lock.
*/
- if (PageIsAllVisible(page))
+ if (PageIsAllVisible(page) || PageIsAllFrozen(page))
visibilitymap_pin(relation, block, &vmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
@@ -3490,21 +3532,23 @@ l2:
UnlockTupleTuplock(relation, &(oldtup.t_self), *lockmode);
if (vmbuffer != InvalidBuffer)
ReleaseBuffer(vmbuffer);
+
bms_free(hot_attrs);
bms_free(key_attrs);
return result;
}
/*
- * If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, or during some
- * subsequent window during which we had it unlocked, we'll have to unlock
- * and re-lock, to avoid holding the buffer lock across an I/O. That's a
- * bit unfortunate, especially since we'll now have to recheck whether the
- * tuple has been locked or updated under us, but hopefully it won't
- * happen very often.
+ * If we didn't pin the visibility map page and the page has
+ * become all visible(and frozen) while we were busy locking the buffer,
+ * or during some subsequent window during which we had it unlocked,
+ * we'll have to unlock and re-lock, to avoid holding the buffer lock
+ * across an I/O. That's a bit unfortunate, especially since we'll now
+ * have to recheck whether the tuple has been locked or updated under us,
+ * but hopefully it won't happen very often.
*/
- if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
+ if (vmbuffer == InvalidBuffer &&
+ (PageIsAllVisible(page) || PageIsAllFrozen(page)))
{
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
visibilitymap_pin(relation, block, &vmbuffer);
@@ -3803,14 +3847,30 @@ l2:
all_visible_cleared = true;
PageClearAllVisible(BufferGetPage(buffer));
visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
- vmbuffer);
+ vmbuffer, VISIBILITYMAP_ALL_VISIBLE);
}
if (newbuf != buffer && PageIsAllVisible(BufferGetPage(newbuf)))
{
all_visible_cleared_new = true;
PageClearAllVisible(BufferGetPage(newbuf));
visibilitymap_clear(relation, BufferGetBlockNumber(newbuf),
- vmbuffer_new);
+ vmbuffer_new, VISIBILITYMAP_ALL_VISIBLE);
+ }
+
+ /* clear PD_ALL_FROZEN flags */
+ if (PageIsAllFrozen(BufferGetPage(buffer)))
+ {
+ all_frozen_cleared = true;
+ PageClearAllFrozen(BufferGetPage(buffer));
+ visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
+ vmbuffer, VISIBILITYMAP_ALL_FROZEN);
+ }
+ if (newbuf != buffer && PageIsAllFrozen(BufferGetPage(newbuf)))
+ {
+ all_frozen_cleared_new = true;
+ PageClearAllFrozen(BufferGetPage(newbuf));
+ visibilitymap_clear(relation, BufferGetBlockNumber(newbuf),
+ vmbuffer_new, VISIBILITYMAP_ALL_FROZEN);
}
if (newbuf != buffer)
@@ -3836,7 +3896,9 @@ l2:
newbuf, &oldtup, heaptup,
old_key_tuple,
all_visible_cleared,
- all_visible_cleared_new);
+ all_visible_cleared_new,
+ all_frozen_cleared,
+ all_frozen_cleared_new);
if (newbuf != buffer)
{
PageSetLSN(BufferGetPage(newbuf), recptr);
@@ -6893,7 +6955,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
*/
XLogRecPtr
log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
- TransactionId cutoff_xid)
+ TransactionId cutoff_xid, uint8 vmflags)
{
xl_heap_visible xlrec;
XLogRecPtr recptr;
@@ -6903,6 +6965,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
Assert(BufferIsValid(vm_buffer));
xlrec.cutoff_xid = cutoff_xid;
+ xlrec.flags = vmflags;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
@@ -6926,7 +6989,8 @@ static XLogRecPtr
log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup, HeapTuple newtup,
HeapTuple old_key_tuple,
- bool all_visible_cleared, bool new_all_visible_cleared)
+ bool all_visible_cleared, bool new_all_visible_cleared,
+ bool all_frozen_cleared, bool new_all_frozen_cleared)
{
xl_heap_update xlrec;
xl_heap_header xlhdr;
@@ -7009,6 +7073,10 @@ log_heap_update(Relation reln, Buffer oldbuf,
xlrec.flags |= XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED;
if (new_all_visible_cleared)
xlrec.flags |= XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED;
+ if (all_frozen_cleared)
+ xlrec.flags |= XLH_UPDATE_OLD_ALL_FROZEN_CLEARED;
+ if (new_all_frozen_cleared)
+ xlrec.flags |= XLH_UPDATE_NEW_ALL_FROZEN_CLEARED;
if (prefixlen > 0)
xlrec.flags |= XLH_UPDATE_PREFIX_FROM_OLD;
if (suffixlen > 0)
@@ -7492,8 +7560,14 @@ heap_xlog_visible(XLogReaderState *record)
* the subsequent update won't be replayed to clear the flag.
*/
page = BufferGetPage(buffer);
- PageSetAllVisible(page);
+
+ if (xlrec->flags & VISIBILITYMAP_ALL_VISIBLE)
+ PageSetAllVisible(page);
+ if (xlrec->flags & VISIBILITYMAP_ALL_FROZEN)
+ PageSetAllFrozen(page);
+
MarkBufferDirty(buffer);
+
}
else if (action == BLK_RESTORED)
{
@@ -7544,7 +7618,7 @@ heap_xlog_visible(XLogReaderState *record)
*/
if (lsn > PageGetLSN(vmpage))
visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
- xlrec->cutoff_xid);
+ xlrec->cutoff_xid, xlrec->flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
@@ -7656,13 +7730,20 @@ heap_xlog_delete(XLogReaderState *record)
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->flags & XLH_DELETE_ALL_VISIBLE_CLEARED)
+ if (xlrec->flags & (XLH_DELETE_ALL_VISIBLE_CLEARED | XLH_DELETE_ALL_FROZEN_CLEARED))
{
Relation reln = CreateFakeRelcacheEntry(target_node);
Buffer vmbuffer = InvalidBuffer;
+ uint8 flags = 0;
+
+ /* set flags for either clear one flags or both */
+ flags |= xlrec->flags & XLH_DELETE_ALL_VISIBLE_CLEARED ?
+ VISIBILITYMAP_ALL_VISIBLE : 0;
+ flags |= xlrec->flags & XLH_DELETE_ALL_FROZEN_CLEARED ?
+ VISIBILITYMAP_ALL_FROZEN : 0;
visibilitymap_pin(reln, blkno, &vmbuffer);
- visibilitymap_clear(reln, blkno, vmbuffer);
+ visibilitymap_clear(reln, blkno, vmbuffer, flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
}
@@ -7734,13 +7815,20 @@ heap_xlog_insert(XLogReaderState *record)
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
+ if (xlrec->flags & (XLH_INSERT_ALL_VISIBLE_CLEARED | XLH_INSERT_ALL_FROZEN_CLEARED))
{
Relation reln = CreateFakeRelcacheEntry(target_node);
Buffer vmbuffer = InvalidBuffer;
+ uint8 flags = 0;
+
+ /* set flags for either clear one flags or both */
+ flags |= xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED ?
+ VISIBILITYMAP_ALL_VISIBLE : 0;
+ flags |= xlrec->flags & XLH_INSERT_ALL_FROZEN_CLEARED ?
+ VISIBILITYMAP_ALL_FROZEN : 0;
visibilitymap_pin(reln, blkno, &vmbuffer);
- visibilitymap_clear(reln, blkno, vmbuffer);
+ visibilitymap_clear(reln, blkno, vmbuffer, flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
}
@@ -7800,6 +7888,9 @@ heap_xlog_insert(XLogReaderState *record)
if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
+ if (xlrec->flags & XLH_INSERT_ALL_FROZEN_CLEARED)
+ PageClearAllFrozen(page);
+
MarkBufferDirty(buffer);
}
if (BufferIsValid(buffer))
@@ -7854,13 +7945,20 @@ heap_xlog_multi_insert(XLogReaderState *record)
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
+ if (xlrec->flags & (XLH_INSERT_ALL_VISIBLE_CLEARED | XLH_INSERT_ALL_FROZEN_CLEARED))
{
Relation reln = CreateFakeRelcacheEntry(rnode);
Buffer vmbuffer = InvalidBuffer;
+ uint8 flags = 0;
+
+ /* set flags for either clear one flags or both */
+ flags |= xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED ?
+ VISIBILITYMAP_ALL_VISIBLE : 0;
+ flags |= xlrec->flags & XLH_INSERT_ALL_FROZEN_CLEARED ?
+ VISIBILITYMAP_ALL_FROZEN : 0;
visibilitymap_pin(reln, blkno, &vmbuffer);
- visibilitymap_clear(reln, blkno, vmbuffer);
+ visibilitymap_clear(reln, blkno, vmbuffer, flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
}
@@ -7938,6 +8036,8 @@ heap_xlog_multi_insert(XLogReaderState *record)
if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
+ if (xlrec->flags & XLH_INSERT_ALL_FROZEN_CLEARED)
+ PageClearAllFrozen(page);
MarkBufferDirty(buffer);
}
@@ -8009,13 +8109,20 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->flags & XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED)
+ if (xlrec->flags & (XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED | XLH_UPDATE_OLD_ALL_FROZEN_CLEARED))
{
Relation reln = CreateFakeRelcacheEntry(rnode);
Buffer vmbuffer = InvalidBuffer;
+ uint8 flags = 0;
+
+ /* set flags for either clear one flags or both */
+ flags |= xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED ?
+ VISIBILITYMAP_ALL_VISIBLE : 0;
+ flags |= xlrec->flags & XLH_INSERT_ALL_FROZEN_CLEARED ?
+ VISIBILITYMAP_ALL_FROZEN : 0;
visibilitymap_pin(reln, oldblk, &vmbuffer);
- visibilitymap_clear(reln, oldblk, vmbuffer);
+ visibilitymap_clear(reln, oldblk, vmbuffer, flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
}
@@ -8066,6 +8173,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
if (xlrec->flags & XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
+ if (xlrec->flags & XLH_UPDATE_OLD_ALL_FROZEN_CLEARED)
+ PageClearAllFrozen(page);
PageSetLSN(page, lsn);
MarkBufferDirty(obuffer);
@@ -8093,13 +8202,20 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
- if (xlrec->flags & XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED)
+ if (xlrec->flags & (XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED | XLH_UPDATE_NEW_ALL_FROZEN_CLEARED))
{
Relation reln = CreateFakeRelcacheEntry(rnode);
Buffer vmbuffer = InvalidBuffer;
+ uint8 flags = 0;
+
+ /* set flags for either clear one flags or both */
+ flags |= xlrec->flags & XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED ?
+ VISIBILITYMAP_ALL_VISIBLE : 0;
+ flags |= xlrec->flags & XLH_UPDATE_NEW_ALL_FROZEN_CLEARED ?
+ VISIBILITYMAP_ALL_FROZEN : 0;
visibilitymap_pin(reln, newblk, &vmbuffer);
- visibilitymap_clear(reln, newblk, vmbuffer);
+ visibilitymap_clear(reln, newblk, vmbuffer, flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
}
@@ -8201,6 +8317,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
if (xlrec->flags & XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
+ if (xlrec->flags & XLH_UPDATE_NEW_ALL_FROZEN_CLEARED)
+ PageClearAllFrozen(page);
freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 6db73bf..4e19f9c 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -327,7 +327,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
{
/* easy case */
buffer = ReadBufferBI(relation, targetBlock, bistate);
- if (PageIsAllVisible(BufferGetPage(buffer)))
+ if (PageIsAllVisible(BufferGetPage(buffer)) ||
+ PageIsAllFrozen(BufferGetPage(buffer)))
visibilitymap_pin(relation, targetBlock, vmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
}
@@ -335,7 +336,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
{
/* also easy case */
buffer = otherBuffer;
- if (PageIsAllVisible(BufferGetPage(buffer)))
+ if (PageIsAllVisible(BufferGetPage(buffer)) ||
+ PageIsAllFrozen(BufferGetPage(buffer)))
visibilitymap_pin(relation, targetBlock, vmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
}
@@ -343,7 +345,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
{
/* lock other buffer first */
buffer = ReadBuffer(relation, targetBlock);
- if (PageIsAllVisible(BufferGetPage(buffer)))
+ if (PageIsAllVisible(BufferGetPage(buffer)) ||
+ PageIsAllFrozen(BufferGetPage(buffer)))
visibilitymap_pin(relation, targetBlock, vmbuffer);
LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
@@ -352,7 +355,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
{
/* lock target buffer first */
buffer = ReadBuffer(relation, targetBlock);
- if (PageIsAllVisible(BufferGetPage(buffer)))
+ if (PageIsAllVisible(BufferGetPage(buffer)) ||
+ PageIsAllFrozen(BufferGetPage(buffer)))
visibilitymap_pin(relation, targetBlock, vmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 7c38772..cc2c274 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -21,11 +21,14 @@
*
* NOTES
*
- * The visibility map is a bitmap with one bit per heap page. A set bit means
- * that all tuples on the page are known visible to all transactions, and
- * therefore the page doesn't need to be vacuumed. The map is conservative in
- * the sense that we make sure that whenever a bit is set, we know the
- * condition is true, but if a bit is not set, it might or might not be true.
+ * The visibility map is a bitmap with two bits (all-visible and all-frozen
+ * per heap page. A set all-visible bit means that all tuples on the page are
+ * known visible to all transactions, and therefore the page doesn't need to
+ * be vacuumed. A set all-frozen bit means that all tuples on the page are
+ * completely frozen, so the page doesn't need to be vacuumed even if whole
+ * table scanning vacuum is required. The map is conservative in the sense that
+ * we make sure that whenever a bit is set, we know the condition is true,
+ * but if a bit is not set, it might or might not be true.
*
* Clearing a visibility map bit is not separately WAL-logged. The callers
* must make sure that whenever a bit is cleared, the bit is cleared on WAL
@@ -33,21 +36,25 @@
*
* When we *set* a visibility map during VACUUM, we must write WAL. This may
* seem counterintuitive, since the bit is basically a hint: if it is clear,
- * it may still be the case that every tuple on the page is visible to all
- * transactions; we just don't know that for certain. The difficulty is that
- * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
- * on the page itself, and the visibility map bit. If a crash occurs after the
- * visibility map page makes it to disk and before the updated heap page makes
- * it to disk, redo must set the bit on the heap page. Otherwise, the next
- * insert, update, or delete on the heap page will fail to realize that the
- * visibility map bit must be cleared, possibly causing index-only scans to
- * return wrong answers.
+ * it may still be the case that every tuple on the page is visible or frozen
+ * to all transactions; we just don't know that for certain. The difficulty is
+ * that there are two bits which are typically set together: the PD_ALL_VISIBLE
+ * or PD_ALL_FROZEN bit on the page itself, and the visibility map bit. If a
+ * crash occurs after the visibility map page makes it to disk and before the
+ * updated heap page makes it to disk, redo must set the bit on the heap page.
+ * Otherwise, the next insert, update, or delete on the heap page will fail to
+ * realize that the visibility map bit must be cleared, possibly causing index-only
+ * scans to return wrong answers.
*
- * VACUUM will normally skip pages for which the visibility map bit is set;
+ * VACUUM will normally skip pages for which the visibility map either bit is set;
* such pages can't contain any dead tuples and therefore don't need vacuuming.
- * The visibility map is not used for anti-wraparound vacuums, because
+ * The visibility map is not used for anti-wraparound vacuums before 9.5, because
* an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
* present in the table, even on pages that don't have any dead tuples.
+ * 9.6 or later, the visibility map has a additional bit which indicates all tuple
+ * on single page has been completely forzen, so the visibility map is also used for
+ * anti-wraparound vacuums.
+ *
*
* LOCKING
*
@@ -58,14 +65,14 @@
* section that logs the page modification. However, we don't want to hold
* the buffer lock over any I/O that may be required to read in the visibility
* map page. To avoid this, we examine the heap page before locking it;
- * if the page-level PD_ALL_VISIBLE bit is set, we pin the visibility map
- * bit. Then, we lock the buffer. But this creates a race condition: there
- * is a possibility that in the time it takes to lock the buffer, the
- * PD_ALL_VISIBLE bit gets set. If that happens, we have to unlock the
- * buffer, pin the visibility map page, and relock the buffer. This shouldn't
- * happen often, because only VACUUM currently sets visibility map bits,
- * and the race will only occur if VACUUM processes a given page at almost
- * exactly the same time that someone tries to further modify it.
+ * if the page-level PD_ALL_VISIBLE or PD_ALL_FROZEN bit is set, we pin the
+ * visibility map bit. Then, we lock the buffer. But this creates a race
+ * condition: there is a possibility that in the time it takes to lock the
+ * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that happens,
+ * we have to unlock the buffer, pin the visibility map page, and relock the
+ * buffer. This shouldn't happen often, because only VACUUM currently sets
+ * visibility map bits, and the race will only occur if VACUUM processes a given
+ * page at almost exactly the same time that someone tries to further modify it.
*
* To set a bit, you need to hold a lock on the heap page. That prevents
* the race condition where VACUUM sees that all tuples on the page are
@@ -92,7 +99,7 @@
#include "utils/inval.h"
-/*#define TRACE_VISIBILITYMAP */
+#define TRACE_VISIBILITYMAP
/*
* Size of the bitmap on each visibility map page, in bytes. There's no
@@ -101,11 +108,14 @@
*/
#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
-/* Number of bits allocated for each heap block. */
-#define BITS_PER_HEAPBLOCK 1
+/*
+ * Number of bits allocated for each heap block.
+ * One for all-visible, other for all-frozen.
+*/
+#define BITS_PER_HEAPBLOCK 2
/* Number of heap blocks we can represent in one byte. */
-#define HEAPBLOCKS_PER_BYTE 8
+#define HEAPBLOCKS_PER_BYTE 4
/* Number of heap blocks we can represent in one visibility map page. */
#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
@@ -115,24 +125,42 @@
#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
#define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
-/* table for fast counting of set bits */
-static const uint8 number_of_ones[256] = {
- 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
+/* tables for fast counting of set bits for visible and freeze */
+static const uint8 number_of_ones_for_visible[256] = {
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4
+};
+static const uint8 number_of_ones_for_freeze[256] = {
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4
};
/* prototypes for internal routines */
@@ -141,23 +169,23 @@ static void vm_extend(Relation rel, BlockNumber nvmblocks);
/*
- * visibilitymap_clear - clear a bit in visibility map
+ * visibilitymap_clear - clear bits in visibility map
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
- * any I/O.
+ * any I/O. Caller must pass flags which indicates what flags we want to clear.
*/
void
-visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
+visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf, uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
int mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- uint8 mask = 1 << mapBit;
+ uint8 mask = flags << (BITS_PER_HEAPBLOCK * mapBit);
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_clear %s %d %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
if (!BufferIsValid(buf) || BufferGetBlockNumber(buf) != mapBlock)
@@ -225,7 +253,7 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
}
/*
- * visibilitymap_set - set a bit on a previously pinned page
+ * visibilitymap_set - set bits on a previously pinned page
*
* recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
* or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
@@ -234,10 +262,11 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
* marked all-visible; it is needed for Hot Standby, and can be
* InvalidTransactionId if the page contains no tuples.
*
- * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
- * this function. Except in recovery, caller should also pass the heap
- * buffer. When checksums are enabled and we're not in recovery, we must add
- * the heap buffer to the WAL chain to protect it from being torn.
+ * Caller is expected to set the heap page's PD_ALL_VISIBLE or PD_ALL_FROZEN
+ * bit before calling this function. Except in recovery, caller should also
+ * pass the heap buffer and flags which indicates what flag we want to set.
+ * When checksums are enabled and we're not in recovery, we must add the heap
+ * buffer to the WAL chain to protect it from being torn.
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -245,7 +274,8 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
*/
void
visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid)
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -254,7 +284,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_set %s %d %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
@@ -272,11 +302,11 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map = PageGetContents(page);
LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
- if (!(map[mapByte] & (1 << mapBit)))
+ if (!(map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))))
{
START_CRIT_SECTION();
- map[mapByte] |= (1 << mapBit);
+ map[mapByte] |= (flags << (BITS_PER_HEAPBLOCK * mapBit));
MarkBufferDirty(vmBuf);
if (RelationNeedsWAL(rel))
@@ -285,7 +315,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Assert(!InRecovery);
recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
- cutoff_xid);
+ cutoff_xid, flags);
/*
* If data checksums are enabled (or wal_log_hints=on), we
@@ -295,11 +325,15 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Page heapPage = BufferGetPage(heapBuf);
- /* caller is expected to set PD_ALL_VISIBLE first */
- Assert(PageIsAllVisible(heapPage));
+ /*
+ * caller is expected to set PD_ALL_VISIBLE or
+ * PD_ALL_FROZEN first.
+ */
+ Assert(PageIsAllVisible(heapPage) || PageIsAllFrozen(heapPage));
PageSetLSN(heapPage, recptr);
}
}
+
PageSetLSN(page, recptr);
}
@@ -310,15 +344,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
}
/*
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_test - test if bits are set
*
- * Are all tuples on heapBlk visible to all, according to the visibility map?
+ * Are all tuples on heapBlk visible or frozen to all, according to the visibility map?
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by an
* earlier call to visibilitymap_pin or visibilitymap_test on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk, or InvalidBuffer. The caller is responsible for
- * releasing *buf after it's done testing and setting bits.
+ * releasing *buf after it's done testing and setting bits, and must set flags
+ * which indicates what flag we want to test.
*
* NOTE: This function is typically called without a lock on the heap page,
* so somebody else could change the bit just after we look at it. In fact,
@@ -328,7 +363,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
* all concurrency issues!
*/
bool
-visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
+visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf, uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -337,7 +372,7 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_test %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_test %s %d %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
/* Reuse the old pinned buffer if possible */
@@ -360,11 +395,11 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
map = PageGetContents(BufferGetPage(*buf));
/*
- * A single-bit read is atomic. There could be memory-ordering effects
+ * A single or double bit read is atomic. There could be memory-ordering effects
* here, but for performance reasons we make it the caller's job to worry
* about that.
*/
- result = (map[mapByte] & (1 << mapBit)) ? true : false;
+ result = (map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))) ? true : false;
return result;
}
@@ -374,10 +409,12 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
*
* Note: we ignore the possibility of race conditions when the table is being
* extended concurrently with the call. New pages added to the table aren't
- * going to be marked all-visible, so they won't affect the result.
+ * going to be marked all-visible or all-frozen, so they won't affect the result.
+ * if for_visible is true, we count the number of all-visible flag. If false,
+ * we count the number of all-frozen flag.
*/
BlockNumber
-visibilitymap_count(Relation rel)
+visibilitymap_count(Relation rel, bool for_visible)
{
BlockNumber result = 0;
BlockNumber mapBlock;
@@ -406,7 +443,8 @@ visibilitymap_count(Relation rel)
for (i = 0; i < MAPSIZE; i++)
{
- result += number_of_ones[map[i]];
+ result += for_visible ?
+ number_of_ones_for_visible[map[i]] : number_of_ones_for_freeze[map[i]];
}
ReleaseBuffer(mapBuffer);
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 4246554..65753d9 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1919,11 +1919,18 @@ index_update_stats(Relation rel,
{
BlockNumber relpages = RelationGetNumberOfBlocks(rel);
BlockNumber relallvisible;
+ BlockNumber relallfrozen;
if (rd_rel->relkind != RELKIND_INDEX)
- relallvisible = visibilitymap_count(rel);
+ {
+ relallvisible = visibilitymap_count(rel, true);
+ relallfrozen = visibilitymap_count(rel, false);
+ }
else /* don't bother for indexes */
+ {
relallvisible = 0;
+ relallfrozen = 0;
+ }
if (rd_rel->relpages != (int32) relpages)
{
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 861048f..1eaf2da 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -572,7 +572,8 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
vac_update_relstats(onerel,
relpages,
totalrows,
- visibilitymap_count(onerel),
+ visibilitymap_count(onerel, true),
+ visibilitymap_count(onerel, false),
hasindex,
InvalidTransactionId,
InvalidMultiXactId,
@@ -595,6 +596,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
RelationGetNumberOfBlocks(Irel[ind]),
totalindexrows,
0,
+ 0,
false,
InvalidTransactionId,
InvalidMultiXactId,
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 7ab4874..d3725dd 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -22,6 +22,7 @@
#include "access/rewriteheap.h"
#include "access/transam.h"
#include "access/tuptoaster.h"
+#include "access/visibilitymap.h"
#include "access/xact.h"
#include "access/xlog.h"
#include "catalog/catalog.h"
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index baf66f1..d68c7c4 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -744,6 +744,7 @@ void
vac_update_relstats(Relation relation,
BlockNumber num_pages, double num_tuples,
BlockNumber num_all_visible_pages,
+ BlockNumber num_all_frozen_pages,
bool hasindex, TransactionId frozenxid,
MultiXactId minmulti,
bool in_outer_xact)
@@ -781,6 +782,11 @@ vac_update_relstats(Relation relation,
pgcform->relallvisible = (int32) num_all_visible_pages;
dirty = true;
}
+ if (pgcform->relallfrozen != (int32) num_all_frozen_pages)
+ {
+ pgcform->relallfrozen = (int32) num_all_frozen_pages;
+ dirty = true;
+ }
/* Apply DDL updates, but not inside an outer transaction (see above) */
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index a01cfb4..e4e60eb 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -106,6 +106,8 @@ typedef struct LVRelStats
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* number of pages we examined */
BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber vmskipped_pages; /* # of pages we skipped by all-frozen bit
+ of visibility map */
double scanned_tuples; /* counts only tuples on scanned pages */
double old_rel_tuples; /* previous value of pg_class.reltuples */
double new_rel_tuples; /* new estimated total # of tuples */
@@ -188,7 +190,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
MultiXactId mxactFullScanLimit;
BlockNumber new_rel_pages;
double new_rel_tuples;
- BlockNumber new_rel_allvisible;
+ BlockNumber new_rel_allvisible,
+ new_rel_allfrozen;
double new_live_tuples;
TransactionId new_frozen_xid;
MultiXactId new_min_multi;
@@ -222,6 +225,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* than or equal to the requested Xid full-table scan limit; or if the
* table's minimum MultiXactId is older than or equal to the requested
* mxid full-table scan limit.
+ * Even if scan_all is set so far, we could skip to scan some pages
+ * according by frozen map.
*/
scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
xidFullScanLimit);
@@ -253,14 +258,16 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
- if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages + vacrelstats->vmskipped_pages)
+ < vacrelstats->rel_pages)
{
- Assert(!scan_all);
scanned_all = false;
}
else
scanned_all = true;
+ scanned_all |= scan_all;
+
/*
* Optionally truncate the relation.
*
@@ -301,10 +308,16 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_tuples = vacrelstats->old_rel_tuples;
}
- new_rel_allvisible = visibilitymap_count(onerel);
+ /* true means that count for all-visible */
+ new_rel_allvisible = visibilitymap_count(onerel, true);
if (new_rel_allvisible > new_rel_pages)
new_rel_allvisible = new_rel_pages;
+ /* false means that count for all-frozen */
+ new_rel_allfrozen = visibilitymap_count(onerel, false);
+ if (new_rel_allfrozen > new_rel_pages)
+ new_rel_allfrozen = new_rel_pages;
+
new_frozen_xid = scanned_all ? FreezeLimit : InvalidTransactionId;
new_min_multi = scanned_all ? MultiXactCutoff : InvalidMultiXactId;
@@ -312,6 +325,7 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_pages,
new_rel_tuples,
new_rel_allvisible,
+ new_rel_allfrozen,
vacrelstats->hasindex,
new_frozen_xid,
new_min_multi,
@@ -486,9 +500,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* consecutive pages. Since we're reading sequentially, the OS should be
* doing readahead for us, so there's no gain in skipping a page now and
* then; that's likely to disable readahead and so be counterproductive.
- * Also, skipping even a single page means that we can't update
- * relfrozenxid, so we only want to do it if we can skip a goodly number
- * of pages.
+ * Also, skipping even a single page accorind to all-visible bit of
+ * visibility map means that we can't update relfrozenxid, so we only want
+ * to do it if we can skip a goodly number. On the other hand, we count
+ * both how many pages we skipped according to all-frozen bit of visibility
+ * map and how many pages we freeze page, so we can update relfrozenxid if
+ * the sum of their is as many as tuples per page.
+ * XXX : We use only all-visible bit to determine skip page for now.
*
* Before entering the main loop, establish the invariant that
* next_not_all_visible_block is the next block number >= blkno that's not
@@ -515,7 +533,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer))
+ if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer,
+ VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN))
break;
vacuum_delay_point();
}
@@ -534,6 +553,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
hastup;
int prev_dead_count;
int nfrozen;
+ int already_nfrozen; /* # of tuples already frozen */
+ int ntup_in_blk; /* # of tuples in single page */
Size freespace;
bool all_visible_according_to_vm;
bool all_visible;
@@ -548,7 +569,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block++)
{
if (!visibilitymap_test(onerel, next_not_all_visible_block,
- &vmbuffer))
+ &vmbuffer,
+ VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN))
break;
vacuum_delay_point();
}
@@ -566,9 +588,20 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
}
else
{
- /* Current block is all-visible */
+ /*
+ * Current block is all-visible.
+ * If visibility map represents that it's all frozen, we can
+ * skip to vacuum page unconditionally.
+ */
+ if (visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_FROZEN))
+ {
+ vacrelstats->vmskipped_pages++;
+ continue;
+ }
+
if (skipping_all_visible_blocks && !scan_all)
continue;
+
all_visible_according_to_vm = true;
}
@@ -740,7 +773,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
PageSetAllVisible(page);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, InvalidTransactionId);
+ vmbuffer, InvalidTransactionId,
+ VISIBILITYMAP_ALL_VISIBLE);
END_CRIT_SECTION();
}
@@ -764,6 +798,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
all_visible = true;
has_dead_tuples = false;
nfrozen = 0;
+ already_nfrozen = 0;
+ ntup_in_blk = 0;
hastup = false;
prev_dead_count = vacrelstats->num_dead_tuples;
maxoff = PageGetMaxOffsetNumber(page);
@@ -918,8 +954,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
{
num_tuples += 1;
+ ntup_in_blk += 1;
hastup = true;
+ /* If current tuple is already frozen, count it up */
+ if (HeapTupleHeaderXminFrozen(tuple.t_data))
+ already_nfrozen += 1;
+
/*
* Each non-removable tuple must be checked to see if it needs
* freezing. Note we already have exclusive buffer lock.
@@ -931,11 +972,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
} /* scan along page */
/*
- * If we froze any tuples, mark the buffer dirty, and write a WAL
- * record recording the changes. We must log the changes to be
- * crash-safe against future truncation of CLOG.
+ * If we froze any tuples or any tuples are already frozen,
+ * mark the buffer dirty, and write a WAL record recording the changes.
+ * We must log the changes to be crash-safe against future truncation
+ * of CLOG.
*/
- if (nfrozen > 0)
+ if (nfrozen > 0 || already_nfrozen > 0)
{
START_CRIT_SECTION();
@@ -953,8 +995,20 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
heap_execute_freeze_tuple(htup, &frozen[i]);
}
+ /*
+ * As a result of scanning a page, we ensure that all tuples
+ * are completely frozen. Set VISIBILITYMAP_ALL_FROZEN bit on
+ * visibility map and PD_ALL_FROZEN flag on page.
+ */
+ if (ntup_in_blk == (nfrozen + already_nfrozen))
+ {
+ PageSetAllFrozen(page);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr, vmbuffer,
+ InvalidTransactionId, VISIBILITYMAP_ALL_FROZEN);
+ }
+
/* Now WAL-log freezing if necessary */
- if (RelationNeedsWAL(onerel))
+ if (nfrozen > 0 && RelationNeedsWAL(onerel))
{
XLogRecPtr recptr;
@@ -1007,7 +1061,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
PageSetAllVisible(page);
MarkBufferDirty(buf);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, visibility_cutoff_xid);
+ vmbuffer, visibility_cutoff_xid, VISIBILITYMAP_ALL_VISIBLE);
}
/*
@@ -1018,11 +1072,11 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* that something bad has happened.
*/
else if (all_visible_according_to_vm && !PageIsAllVisible(page)
- && visibilitymap_test(onerel, blkno, &vmbuffer))
+ && visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
relname, blkno);
- visibilitymap_clear(onerel, blkno, vmbuffer);
+ visibilitymap_clear(onerel, blkno, vmbuffer, VISIBILITYMAP_ALL_VISIBLE);
}
/*
@@ -1044,7 +1098,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
relname, blkno);
PageClearAllVisible(page);
MarkBufferDirty(buf);
- visibilitymap_clear(onerel, blkno, vmbuffer);
+ visibilitymap_clear(onerel, blkno, vmbuffer, VISIBILITYMAP_ALL_VISIBLE);
}
UnlockReleaseBuffer(buf);
@@ -1078,7 +1132,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
num_tuples);
/*
- * Release any remaining pin on visibility map page.
+ * Release any remaining pin on visibility map and frozen map page.
*/
if (BufferIsValid(vmbuffer))
{
@@ -1285,11 +1339,11 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
* flag is now set, also set the VM bit.
*/
if (PageIsAllVisible(page) &&
- !visibilitymap_test(onerel, blkno, vmbuffer))
+ !visibilitymap_test(onerel, blkno, vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
Assert(BufferIsValid(*vmbuffer));
visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
- visibility_cutoff_xid);
+ visibility_cutoff_xid, VISIBILITYMAP_ALL_VISIBLE);
}
return tupindex;
@@ -1408,6 +1462,7 @@ lazy_cleanup_index(Relation indrel,
stats->num_pages,
stats->num_index_tuples,
0,
+ 0,
false,
InvalidTransactionId,
InvalidMultiXactId,
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 9f54c46..08df289 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -116,7 +116,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
*/
if (!visibilitymap_test(scandesc->heapRelation,
ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
+ &node->ioss_VMBuffer, VISIBILITYMAP_ALL_VISIBLE))
{
/*
* Rats, we have to visit the heap to check visibility.
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 874ca6a..376841a 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -127,7 +127,7 @@ ExecCheckPlanOutput(Relation resultRel, List *targetList)
if (attno != resultDesc->natts)
ereport(ERROR,
(errcode(ERRCODE_DATATYPE_MISMATCH),
- errmsg("table row type and query-specified row type do not match"),
+ errmsg("table row type and query-specified row type do not match"),
errdetail("Query has too few columns.")));
}
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index caa0f14..779afd8 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -64,9 +64,10 @@
*/
/* PD_ALL_VISIBLE was cleared */
#define XLH_INSERT_ALL_VISIBLE_CLEARED (1<<0)
-#define XLH_INSERT_LAST_IN_MULTI (1<<1)
-#define XLH_INSERT_IS_SPECULATIVE (1<<2)
-#define XLH_INSERT_CONTAINS_NEW_TUPLE (1<<3)
+#define XLH_INSERT_ALL_FROZEN_CLEARED (1<<1)
+#define XLH_INSERT_LAST_IN_MULTI (1<<2)
+#define XLH_INSERT_IS_SPECULATIVE (1<<3)
+#define XLH_INSERT_CONTAINS_NEW_TUPLE (1<<4)
/*
* xl_heap_update flag values, 8 bits are available.
@@ -75,11 +76,15 @@
#define XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED (1<<0)
/* PD_ALL_VISIBLE was cleared in the 2nd page */
#define XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED (1<<1)
-#define XLH_UPDATE_CONTAINS_OLD_TUPLE (1<<2)
-#define XLH_UPDATE_CONTAINS_OLD_KEY (1<<3)
-#define XLH_UPDATE_CONTAINS_NEW_TUPLE (1<<4)
-#define XLH_UPDATE_PREFIX_FROM_OLD (1<<5)
-#define XLH_UPDATE_SUFFIX_FROM_OLD (1<<6)
+/* PD_FROZEN_VISIBLE was cleared */
+#define XLH_UPDATE_OLD_ALL_FROZEN_CLEARED (1<<2)
+/* PD_FROZEN_VISIBLE was cleared in the 2nd page */
+#define XLH_UPDATE_NEW_ALL_FROZEN_CLEARED (1<<3)
+#define XLH_UPDATE_CONTAINS_OLD_TUPLE (1<<4)
+#define XLH_UPDATE_CONTAINS_OLD_KEY (1<<5)
+#define XLH_UPDATE_CONTAINS_NEW_TUPLE (1<<6)
+#define XLH_UPDATE_PREFIX_FROM_OLD (1<<7)
+#define XLH_UPDATE_SUFFIX_FROM_OLD (1<<8)
/* convenience macro for checking whether any form of old tuple was logged */
#define XLH_UPDATE_CONTAINS_OLD \
@@ -90,9 +95,10 @@
*/
/* PD_ALL_VISIBLE was cleared */
#define XLH_DELETE_ALL_VISIBLE_CLEARED (1<<0)
-#define XLH_DELETE_CONTAINS_OLD_TUPLE (1<<1)
-#define XLH_DELETE_CONTAINS_OLD_KEY (1<<2)
-#define XLH_DELETE_IS_SUPER (1<<3)
+#define XLH_DELETE_ALL_FROZEN_CLEARED (1<<1)
+#define XLH_DELETE_CONTAINS_OLD_TUPLE (1<<2)
+#define XLH_DELETE_CONTAINS_OLD_KEY (1<<3)
+#define XLH_DELETE_IS_SUPER (1<<4)
/* convenience macro for checking whether any form of old tuple was logged */
#define XLH_DELETE_CONTAINS_OLD \
@@ -190,7 +196,7 @@ typedef struct xl_heap_update
TransactionId old_xmax; /* xmax of the old tuple */
OffsetNumber old_offnum; /* old tuple's offset */
uint8 old_infobits_set; /* infomask bits to set on old tuple */
- uint8 flags;
+ uint16 flags;
TransactionId new_xmax; /* xmax of the new tuple */
OffsetNumber new_offnum; /* new tuple's offset */
@@ -320,9 +326,10 @@ typedef struct xl_heap_freeze_page
typedef struct xl_heap_visible
{
TransactionId cutoff_xid;
+ uint8 flags;
} xl_heap_visible;
-#define SizeOfHeapVisible (offsetof(xl_heap_visible, cutoff_xid) + sizeof(TransactionId))
+#define SizeOfHeapVisible (offsetof(xl_heap_visible, flags) + sizeof(uint8))
typedef struct xl_heap_new_cid
{
@@ -382,6 +389,8 @@ extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer,
TransactionId cutoff_xid, xl_heap_freeze_tuple *tuples,
int ntuples);
+extern XLogRecPtr log_heap_frozenmap(RelFileNode rnode, Buffer heap_buffer,
+ Buffer fm_buffer);
extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
TransactionId cutoff_xid,
TransactionId cutoff_multi,
@@ -389,6 +398,6 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
xl_heap_freeze_tuple *xlrec_tp);
extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
- Buffer vm_buffer, TransactionId cutoff_xid);
+ Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
#endif /* HEAPAM_XLOG_H */
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 0c0e0ef..53d8103 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -19,15 +19,21 @@
#include "storage/buf.h"
#include "utils/relcache.h"
+/* Flags for bit map */
+#define VISIBILITYMAP_ALL_VISIBLE 0x01
+#define VISIBILITYMAP_ALL_FROZEN 0x02
+
extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk,
- Buffer vmbuf);
+ Buffer vmbuf, uint8 flags);
extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
Buffer *vmbuf);
extern bool visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid);
-extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
-extern BlockNumber visibilitymap_count(Relation rel);
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags);
+extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf,
+ uint8 flags);
+extern BlockNumber visibilitymap_count(Relation rel, bool for_visible);
extern void visibilitymap_truncate(Relation rel, BlockNumber nheapblocks);
#endif /* VISIBILITYMAP_H */
diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
index e526cd9..ea0f7c1 100644
--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -47,6 +47,8 @@ CATALOG(pg_class,1259) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83) BKI_SCHEMA_MACRO
float4 reltuples; /* # of tuples (not always up-to-date) */
int32 relallvisible; /* # of all-visible blocks (not always
* up-to-date) */
+ int32 relallfrozen; /* # of all-frozen blocks (not always
+ up-to-date) */
Oid reltoastrelid; /* OID of toast table; 0 if none */
bool relhasindex; /* T if has (or has had) any indexes */
bool relisshared; /* T if shared across databases */
@@ -95,7 +97,7 @@ typedef FormData_pg_class *Form_pg_class;
* ----------------
*/
-#define Natts_pg_class 30
+#define Natts_pg_class 31
#define Anum_pg_class_relname 1
#define Anum_pg_class_relnamespace 2
#define Anum_pg_class_reltype 3
@@ -107,25 +109,26 @@ typedef FormData_pg_class *Form_pg_class;
#define Anum_pg_class_relpages 9
#define Anum_pg_class_reltuples 10
#define Anum_pg_class_relallvisible 11
-#define Anum_pg_class_reltoastrelid 12
-#define Anum_pg_class_relhasindex 13
-#define Anum_pg_class_relisshared 14
-#define Anum_pg_class_relpersistence 15
-#define Anum_pg_class_relkind 16
-#define Anum_pg_class_relnatts 17
-#define Anum_pg_class_relchecks 18
-#define Anum_pg_class_relhasoids 19
-#define Anum_pg_class_relhaspkey 20
-#define Anum_pg_class_relhasrules 21
-#define Anum_pg_class_relhastriggers 22
-#define Anum_pg_class_relhassubclass 23
-#define Anum_pg_class_relrowsecurity 24
-#define Anum_pg_class_relispopulated 25
-#define Anum_pg_class_relreplident 26
-#define Anum_pg_class_relfrozenxid 27
-#define Anum_pg_class_relminmxid 28
-#define Anum_pg_class_relacl 29
-#define Anum_pg_class_reloptions 30
+#define Anum_pg_class_relallfrozen 12
+#define Anum_pg_class_reltoastrelid 13
+#define Anum_pg_class_relhasindex 14
+#define Anum_pg_class_relisshared 15
+#define Anum_pg_class_relpersistence 16
+#define Anum_pg_class_relkind 17
+#define Anum_pg_class_relnatts 18
+#define Anum_pg_class_relchecks 19
+#define Anum_pg_class_relhasoids 20
+#define Anum_pg_class_relhaspkey 21
+#define Anum_pg_class_relhasrules 22
+#define Anum_pg_class_relhastriggers 23
+#define Anum_pg_class_relhassubclass 24
+#define Anum_pg_class_relrowsecurity 25
+#define Anum_pg_class_relispopulated 26
+#define Anum_pg_class_relreplident 27
+#define Anum_pg_class_relfrozenxid 28
+#define Anum_pg_class_relminmxid 29
+#define Anum_pg_class_relacl 30
+#define Anum_pg_class_reloptions 31
/* ----------------
* initial contents of pg_class
@@ -140,13 +143,13 @@ typedef FormData_pg_class *Form_pg_class;
* Note: "3" in the relfrozenxid column stands for FirstNormalTransactionId;
* similarly, "1" in relminmxid stands for FirstMultiXactId
*/
-DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 f f p r 28 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 0 f f p r 28 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 0 f f p r 31 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index e3a31af..d2bae2d 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -172,6 +172,7 @@ extern void vac_update_relstats(Relation relation,
BlockNumber num_pages,
double num_tuples,
BlockNumber num_all_visible_pages,
+ BlockNumber num_all_frozen_pages,
bool hasindex,
TransactionId frozenxid,
MultiXactId minmulti,
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index a2f78ee..7bf2718 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -178,8 +178,10 @@ typedef PageHeaderData *PageHeader;
* tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
+#define PD_ALL_FROZEN 0x0008 /* all tuples on page are completely
+ frozen */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x000F /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -369,6 +371,13 @@ typedef PageHeaderData *PageHeader;
#define PageClearAllVisible(page) \
(((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
+#define PageIsAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags & PD_ALL_FROZEN)
+#define PageSetAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags |= PD_ALL_FROZEN)
+#define PageClearAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags &= ~PD_ALL_FROZEN)
+
#define PageIsPrunable(page, oldestxmin) \
( \
AssertMacro(TransactionIdIsNormal(oldestxmin)), \
On 3 July 2015 at 09:25, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
On Fri, Jul 3, 2015 at 1:23 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 2 July 2015 at 16:30, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
Also, the flags of each heap page header might be set PD_ALL_FROZEN,
as well as all-visibleIs it possible to have VM bits set to frozen but not visible?
The description makes those two states sound independent of each other.
Are they? Or not? Do we test for an impossible state?
It's impossible to have VM bits set to frozen but not visible.
These bit are controlled independently. But eventually, when
all-frozen bit is set, all-visible is also set.
And my understanding is that if you clear all-visible you would also clear
all-frozen...
So I don't understand why you have two separate calls to
visibilitymap_clear()
Surely the logic should be to clear both bits at the same time?
In my understanding the state logic is
1. Both bits unset ~(VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN)
which can be changed to state 2 only
2. VISIBILITYMAP_ALL_VISIBLE only
which can be changed state 1 or state 3
3. VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN
which can be changed to state 1 only
If that is the case please simplify the logic for setting and unsetting the
bits so they are set together efficiently. At the same time please also put
in Asserts to ensure that the state logic is maintained when it is set and
when it is tested.
I would also like to see the visibilitymap_test function exposed in SQL, so
we can write code to examine the map contents for particular ctids. By
doing that we can then write a formal test that shows the evolution of
tuples from insertion, vacuuming and freezing, testing the map has been set
correctly at each stage. I guess that needs to be done as an isolationtest
so we have an observer that contrains the xmin in various ways. In light of
multixact bugs, any code that changes the on-disk tuple metadata needs
formal tests.
Other than that the overall concept seems sound.
I think we need something for pg_upgrade to rewrite existing VMs. Otherwise
a large read only database would suddenly require a massive revacuum after
upgrade, which seems bad. That can wait for now until we all agree this
patch is sound.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Jul 3, 2015 at 1:55 PM, Sawada Masahiko <sawada.mshk@gmail.com>
wrote:
On Fri, Jul 3, 2015 at 1:23 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 2 July 2015 at 16:30, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
Also, the flags of each heap page header might be set PD_ALL_FROZEN,
as well as all-visibleIs it possible to have VM bits set to frozen but not visible?
The description makes those two states sound independent of each other.
Are they? Or not? Do we test for an impossible state?
It's impossible to have VM bits set to frozen but not visible.
In patch, during Vacuum first the frozen bit is set and then the visibility
will be set in a later operation, now if the crash happens between those
2 operations, then isn't it possible that the frozen bit is set and visible
bit is not set?
These bit are controlled independently. But eventually, when
all-frozen bit is set, all-visible is also set.
Yes, during normal operations it will happen that way, but I think there
are corner cases where that assumption is not true.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Thu, Jul 2, 2015 at 9:00 PM, Sawada Masahiko <sawada.mshk@gmail.com>
wrote:
Thank you for bug report, and comments.
Fixed version is attached, and source code comment is also updated.
Please review it.
I am looking into this patch and would like to share my findings with
you:
1.
@@ -2131,8 +2133,9 @@ heap_insert(Relation relation, HeapTuple tup,
CommandId cid,
CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
/*
- * Find buffer to insert this
tuple into. If the page is all visible,
- * this will also pin the requisite visibility map page.
+
* Find buffer to insert this tuple into. If the page is all visible
+ * of all frozen, this will also pin
the requisite visibility map and
+ * frozen map page.
*/
typo in comments.
/of all frozen/or all frozen
2.
visibilitymap.c
+ * The visibility map is a bitmap with two bits (all-visible and all-frozen
+ * per heap page.
/and all-frozen/and all-frozen)
closing round bracket is missing.
3.
visibilitymap.c
-/*#define TRACE_VISIBILITYMAP */
+#define TRACE_VISIBILITYMAP
why is this hash define opened?
4.
-visibilitymap_count(Relation rel)
+visibilitymap_count(Relation rel, bool for_visible)
This API needs to count set bits for either visibility info, frozen info
or both (if required), it seems better to have second parameter as
uint8 flags rather than bool. Also, if it is required to be called at most
places for both visibility and frozen bits count, why not get them
in one call?
5.
Clearing visibility and frozen bit separately for the dml
operations would lead locking/unlocking the corresponding buffer
twice, can we do it as a one operation. I think this is suggested
by Simon as well.
6.
- * Before locking the buffer, pin the visibility map page if it appears to
- * be necessary.
Since we haven't got the lock yet, someone else might be
+ * Before locking the buffer, pin the
visibility map if it appears to be
+ * necessary. Since we haven't got the lock yet, someone else might
be
Why you have deleted 'page' in above comment?
7.
@@ -3490,21 +3532,23 @@ l2:
UnlockTupleTuplock(relation, &(oldtup.t_self), *lockmode);
if (vmbuffer != InvalidBuffer)
ReleaseBuffer(vmbuffer);
+
bms_free
(hot_attrs);
Seems unnecessary change.
8.
@@ -1919,11 +1919,18 @@ index_update_stats(Relation rel,
{
BlockNumber relpages =
RelationGetNumberOfBlocks(rel);
BlockNumber relallvisible;
+ BlockNumber
relallfrozen;
if (rd_rel->relkind != RELKIND_INDEX)
- relallvisible =
visibilitymap_count(rel);
+ {
+ relallvisible = visibilitymap_count(rel,
true);
+ relallfrozen = visibilitymap_count(rel, false);
+ }
else
/* don't bother for indexes */
+ {
relallvisible = 0;
+
relallfrozen = 0;
+ }
I think in this function, you have forgotten to update the
relallfrozen value in pg_class.
9.
vacuumlazy.c
@@ -253,14 +258,16 @@ lazy_vacuum_rel(Relation onerel, int options,
VacuumParams *params,
* NB: We
need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
-
if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages +
vacrelstats->vmskipped_pages)
+ < vacrelstats->rel_pages)
{
- Assert(!scan_all);
Why you have removed this Assert, won't the count of
vacrelstats->scanned_pages + vacrelstats->vmskipped_pages be
equal to vacrelstats->rel_pages when scall_all = true.
10.
vacuumlazy.c
lazy_vacuum_rel()
..
+ scanned_all |= scan_all;
+
Why this new assignment is added, please add a comment to
explain it.
11.
lazy_scan_heap()
..
+ * Also, skipping even a single page accorind to all-visible bit of
+ * visibility map means that we can't update relfrozenxid, so we only want
+ * to do it if we can skip a goodly number. On the other hand, we count
+ * both how many pages we skipped according to all-frozen bit of visibility
+ * map and how many pages we freeze page, so we can update relfrozenxid if
+ * the sum of their is as many as tuples per page.
a.
typo
/accorind/according
b.
is the second part of comment (starting from On the other hand)
right? I mean you are comparing sum of pages skipped due to
all_frozen bit and number of pages freezed with tuples per page.
I don't understand how are they related?
12.
@@ -918,8 +954,13 @@ lazy_scan_heap(Relation onerel, LVRelStats
*vacrelstats,
else
{
num_tuples += 1;
+ ntup_in_blk += 1;
hastup = true;
+ /* If current tuple is already frozen, count it up */
+ if (HeapTupleHeaderXminFrozen(tuple.t_data))
+ already_nfrozen += 1;
+
/*
* Each non-removable tuple must be checked to see if it needs
* freezing. Note we already have exclusive buffer lock.
Here, if tuple is already_frozen, can't we just continue and
check for next tuple?
13.
+extern XLogRecPtr log_heap_frozenmap(RelFileNode rnode, Buffer heap_buffer,
+ Buffer fm_buffer);
It seems like this function is not used.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On 6 July 2015 at 17:28, Simon Riggs <simon@2ndquadrant.com> wrote:
I think we need something for pg_upgrade to rewrite existing VMs. Otherwise
a large read only database would suddenly require a massive revacuum after
upgrade, which seems bad. That can wait for now until we all agree this
patch is sound.
Since we need to rewrite the "vm" map, I think we should call the new map
"vfm"
That way we will be able to easily check whether the rewrite has been
conducted on all relations.
Since the maps are just bits there is no other way to tell that a map has
been rewritten
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
So I don't understand why you have two separate calls to visibilitymap_clear()
Surely the logic should be to clear both bits at the same time?
Yes, you're right. all-frozen bit should be cleared at the same time
as clearing all-visible bit.
1. Both bits unset ~(VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN)
which can be changed to state 2 only
2. VISIBILITYMAP_ALL_VISIBLE only
which can be changed state 1 or state 3
3. VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN
which can be changed to state 1 only
If that is the case please simplify the logic for setting and unsetting the bits so they are set together efficiently.
At the same time please also put in Asserts to ensure that the state logic is maintained when it is set and when it is tested.In patch, during Vacuum first the frozen bit is set and then the visibility
will be set in a later operation, now if the crash happens between those
2 operations, then isn't it possible that the frozen bit is set and visible
bit is not set?
In current patch, frozen bit is set first in lazy_scan_heap(), so it's
possible to have VM bits set frozen bit but not visible as Amit
pointed out.
To fix it, I'm modifying the patch to more simpler and setting both
bits at the same time efficiently.
I would also like to see the visibilitymap_test function exposed in SQL,
so we can write code to examine the map contents for particular ctids.
By doing that we can then write a formal test that shows the evolution of tuples from insertion,
vacuuming and freezing, testing the map has been set correctly at each stage.
I guess that needs to be done as an isolationtest so we have an observer that contrains the xmin in various ways.
In light of multixact bugs, any code that changes the on-disk tuple metadata needs formal tests.
Attached patch adds a few function to contrib/pg_freespacemap to
explore the inside of visibility map, which I used for my test.
I hope it helps for testing this feature.
I think we need something for pg_upgrade to rewrite existing VMs.
Otherwise a large read only database would suddenly require a massive
revacuum after upgrade, which seems bad. That can wait for now until we all
agree this patch is sound.
Yeah, I will address them.
Regards,
--
Sawada Masahiko
Attachments:
001_visibilitymap_test_function.patchapplication/octet-stream; name=001_visibilitymap_test_function.patchDownload
diff --git a/contrib/pg_freespacemap/pg_freespacemap--1.0.sql b/contrib/pg_freespacemap/pg_freespacemap--1.0.sql
index 2adb52a..eb3e752 100644
--- a/contrib/pg_freespacemap/pg_freespacemap--1.0.sql
+++ b/contrib/pg_freespacemap/pg_freespacemap--1.0.sql
@@ -9,6 +9,16 @@ RETURNS int2
AS 'MODULE_PATHNAME', 'pg_freespace'
LANGUAGE C STRICT;
+CREATE FUNCTION pg_is_all_visible(regclass, bigint)
+RETURNS bool
+AS 'MODULE_PATHNAME', 'pg_is_all_visible'
+LANGUAGE C STRICT;
+
+CREATE FUNCTION pg_is_all_frozen(regclass, bigint)
+RETURNS bool
+AS 'MODULE_PATHNAME', 'pg_is_all_frozen'
+LANGUAGE C STRICT;
+
-- pg_freespace shows the recorded space avail at each block in a relation
CREATE FUNCTION
pg_freespace(rel regclass, blkno OUT bigint, avail OUT int2)
@@ -19,7 +29,18 @@ AS $$
$$
LANGUAGE SQL;
+CREATE FUNCTION
+ pg_visibilitymap(rel regclass, blkno OUT bigint, all_visible OUT bool, all_frozen OUT bool)
+RETURNS SETOF RECORD
+AS $$
+ SELECT blkno, pg_is_all_visible($1, blkno) AS all_visible, pg_is_all_frozen($1, blkno) AS all_frozen
+ FROM generate_series(0, pg_relation_size($1) / current_setting('block_size')::bigint - 1) AS blkno;
+$$
+LANGUAGE SQL;
-- Don't want these to be available to public.
REVOKE ALL ON FUNCTION pg_freespace(regclass, bigint) FROM PUBLIC;
REVOKE ALL ON FUNCTION pg_freespace(regclass) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_is_all_visible(regclass, bigint) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_is_all_frozen(regclass, bigint) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_visibilitymap(rel regclass) FROM PUBLIC;
diff --git a/contrib/pg_freespacemap/pg_freespacemap--unpackaged--1.0.sql b/contrib/pg_freespacemap/pg_freespacemap--unpackaged--1.0.sql
index 8651373..ebd5f5f 100644
--- a/contrib/pg_freespacemap/pg_freespacemap--unpackaged--1.0.sql
+++ b/contrib/pg_freespacemap/pg_freespacemap--unpackaged--1.0.sql
@@ -5,3 +5,6 @@
ALTER EXTENSION pg_freespacemap ADD function pg_freespace(regclass,bigint);
ALTER EXTENSION pg_freespacemap ADD function pg_freespace(regclass);
+ALTER EXTENSION pg_freespacemap ADD function pg_is_all_visible(regclass, bigint);
+ALTER EXTENSION pg_freespacemap ADD function pg_is_all_frozen(regclass, bigint);
+ALTER EXTENSION pg_freespacemap ADD function pg_visibilitymap(regclass);
diff --git a/contrib/pg_freespacemap/pg_freespacemap.c b/contrib/pg_freespacemap/pg_freespacemap.c
index 7d939a7..719c879 100644
--- a/contrib/pg_freespacemap/pg_freespacemap.c
+++ b/contrib/pg_freespacemap/pg_freespacemap.c
@@ -8,8 +8,10 @@
*/
#include "postgres.h"
+#include "access/visibilitymap.h"
#include "funcapi.h"
#include "storage/freespace.h"
+#include "storage/bufmgr.h"
PG_MODULE_MAGIC;
@@ -18,6 +20,10 @@ PG_MODULE_MAGIC;
* free space map.
*/
PG_FUNCTION_INFO_V1(pg_freespace);
+PG_FUNCTION_INFO_V1(pg_is_all_visible);
+PG_FUNCTION_INFO_V1(pg_is_all_frozen);
+
+static bool visibilitymap_test_internal(Oid relid, uint64 blkno, uint8);
Datum
pg_freespace(PG_FUNCTION_ARGS)
@@ -39,3 +45,56 @@ pg_freespace(PG_FUNCTION_ARGS)
relation_close(rel, AccessShareLock);
PG_RETURN_INT16(freespace);
}
+
+/*
+ * Return the page is all-visible or not, according to the visibility map.
+ */
+Datum
+pg_is_all_visible(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int64 blkno = PG_GETARG_INT64(1);
+ bool all_visible;
+
+ all_visible = visibilitymap_test_internal(relid, blkno, VISIBILITYMAP_ALL_VISIBLE);
+
+ PG_RETURN_BOOL(all_visible);
+}
+
+/*
+ * Return the page is all-frozen or not, according to the visibility map.
+ */
+Datum
+pg_is_all_frozen(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int64 blkno = PG_GETARG_INT64(1);
+ bool all_frozen;
+
+ all_frozen = visibilitymap_test_internal(relid, blkno, VISIBILITYMAP_ALL_FROZEN);
+
+ PG_RETURN_BOOL(all_frozen);
+}
+
+static bool
+visibilitymap_test_internal(Oid relid, uint64 blkno, uint8 flag)
+{
+
+ Relation rel;
+ Buffer vmbuffer = InvalidBuffer;
+ bool result;
+
+ rel = relation_open(relid, AccessShareLock);
+
+ if (blkno < 0 || blkno > MaxBlockNumber)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid block number")));
+
+ result = visibilitymap_test(rel, blkno, &vmbuffer, flag);
+
+ ReleaseBuffer(vmbuffer);
+ relation_close(rel, AccessShareLock);
+
+ return result;
+}
On 7 July 2015 at 15:18, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
I would also like to see the visibilitymap_test function exposed in SQL,
so we can write code to examine the map contents for particular ctids.
By doing that we can then write a formal test that shows the evolutionof tuples from insertion,
vacuuming and freezing, testing the map has been set correctly at each
stage.
I guess that needs to be done as an isolationtest so we have an observer
that contrains the xmin in various ways.
In light of multixact bugs, any code that changes the on-disk tuple
metadata needs formal tests.
Attached patch adds a few function to contrib/pg_freespacemap to
explore the inside of visibility map, which I used for my test.
I hope it helps for testing this feature.
I don't think pg_freespacemap is the right place.
I'd prefer to add that as a single function into core, so we can write
formal tests. I would not personally commit this feature without rigorous
and easily repeatable verification.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2015-07-07 16:25:13 +0100, Simon Riggs wrote:
I don't think pg_freespacemap is the right place.
I agree that pg_freespacemap sounds like an odd location.
I'd prefer to add that as a single function into core, so we can write
formal tests.
With the advent of src/test/modules it's not really a prerequisite for
things to be builtin to be testable. I think there's fair arguments for
moving stuff like pg_stattuple, pg_freespacemap, pg_buffercache into
core at some point, but that's probably a separate discussion.
Regards,
Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jul 8, 2015 at 12:37 AM, Andres Freund <andres@anarazel.de> wrote:
On 2015-07-07 16:25:13 +0100, Simon Riggs wrote:
I don't think pg_freespacemap is the right place.
I agree that pg_freespacemap sounds like an odd location.
I'd prefer to add that as a single function into core, so we can write
formal tests.With the advent of src/test/modules it's not really a prerequisite for
things to be builtin to be testable. I think there's fair arguments for
moving stuff like pg_stattuple, pg_freespacemap, pg_buffercache into
core at some point, but that's probably a separate discussion.
I understood.
So I will place bunch of test like src/test/module/visibilitymap_test,
which contains some tests regarding this feature,
and gather them into one patch.
Regards,
--
Sawada Masahiko
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jul 7, 2015 at 5:19 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Jul 2, 2015 at 9:00 PM, Sawada Masahiko <sawada.mshk@gmail.com>
wrote:
Thank you for bug report, and comments.
Fixed version is attached, and source code comment is also updated.
Please review it.I am looking into this patch and would like to share my findings with
you:
Few more comments:
@@ -47,6 +47,8 @@ CATALOG(pg_class,1259) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83)
BKI_SCHEMA_MACRO
float4 reltuples; /* # of tuples (not always up-to-date) */
int32 relallvisible; /* # of all-visible blocks (not always
* up-to-date) */
+ int32 relallfrozen; /* # of all-frozen blocks (not always
+ up-to-date) */
You have added relallfrozen similar to relallvisible, but how you
are planning to use it, is there any usecase for it?
lazy_scan_heap()
..
- /* Current block is all-visible */
+ /*
+ * Current block is all-visible.
+ * If visibility map represents that it's all frozen, we can
+ * skip to vacuum page unconditionally.
+ */
+ if (visibilitymap_test(onerel, blkno, &vmbuffer,
VISIBILITYMAP_ALL_FROZEN))
+ {
+ vacrelstats->vmskipped_pages++;
+ continue;
+ }
+
a. please explain in comment why it is safe if someone clear the
frozen bit concurrently
b. won't skipping pages intermittently due to set frozen bit break the
readahead mechanism? In this regard, if possible, I think we should
do some tests to see the benefit of this patch. I understand that in
general, it will be good to skip pages, however it seems better to check
that with some different kind of tests.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On 7 July 2015 at 18:45, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
On Wed, Jul 8, 2015 at 12:37 AM, Andres Freund <andres@anarazel.de> wrote:
On 2015-07-07 16:25:13 +0100, Simon Riggs wrote:
I don't think pg_freespacemap is the right place.
I agree that pg_freespacemap sounds like an odd location.
I'd prefer to add that as a single function into core, so we can write
formal tests.With the advent of src/test/modules it's not really a prerequisite for
things to be builtin to be testable. I think there's fair arguments for
moving stuff like pg_stattuple, pg_freespacemap, pg_buffercache into
core at some point, but that's probably a separate discussion.I understood.
So I will place bunch of test like src/test/module/visibilitymap_test,
which contains some tests regarding this feature,
and gather them into one patch.
Please place it in core. I see value in having a diagnostic function for
general use on production systems.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Jul 7, 2015 at 5:37 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 6 July 2015 at 17:28, Simon Riggs <simon@2ndquadrant.com> wrote:
I think we need something for pg_upgrade to rewrite existing VMs.
Otherwise a large read only database would suddenly require a massive
revacuum after upgrade, which seems bad. That can wait for now until we all
agree this patch is sound.
Since we need to rewrite the "vm" map, I think we should call the new map
"vfm"
+1 for changing the name, as now map contains more than visibility
information.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Fri, Jul 3, 2015 at 1:25 AM, Sawada Masahiko <sawada.mshk@gmail.com>
wrote:
On Fri, Jul 3, 2015 at 1:23 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 2 July 2015 at 16:30, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
Also, the flags of each heap page header might be set PD_ALL_FROZEN,
as well as all-visibleIs it possible to have VM bits set to frozen but not visible?
The description makes those two states sound independent of each other.
Are they? Or not? Do we test for an impossible state?
It's impossible to have VM bits set to frozen but not visible.
These bit are controlled independently. But eventually, when
all-frozen bit is set, all-visible is also set.
If that combination is currently impossible, could it be used indicate that
the page is all empty?
Having a crash-proof bitmap of all-empty pages would make vacuum truncation
scans much more efficient.
Cheers,
Jeff
On 7/8/15 8:31 AM, Simon Riggs wrote:
I understood.
So I will place bunch of test like src/test/module/visibilitymap_test,
which contains some tests regarding this feature,
and gather them into one patch.Please place it in core. I see value in having a diagnostic function for
general use on production systems.
+1. I don't think there's value to keeping this stuff away from DBAs.
Perhaps it should default to only SU being able to execute it though.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jul 9, 2015 at 4:31 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Fri, Jul 3, 2015 at 1:25 AM, Sawada Masahiko <sawada.mshk@gmail.com>
wrote:On Fri, Jul 3, 2015 at 1:23 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 2 July 2015 at 16:30, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
Also, the flags of each heap page header might be set PD_ALL_FROZEN,
as well as all-visibleIs it possible to have VM bits set to frozen but not visible?
The description makes those two states sound independent of each other.
Are they? Or not? Do we test for an impossible state?
It's impossible to have VM bits set to frozen but not visible.
These bit are controlled independently. But eventually, when
all-frozen bit is set, all-visible is also set.If that combination is currently impossible, could it be used indicate that
the page is all empty?
Yeah, the status of that VM bits set to frozen but not visible is
impossible, so we could use this status for another something status
of the page.
Having a crash-proof bitmap of all-empty pages would make vacuum truncation
scans much more efficient.
The empty page is always marked all-visible by vacuum today, it's not enough?
Regards,
--
Sawada Masahiko
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jul 7, 2015 at 8:49 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Jul 2, 2015 at 9:00 PM, Sawada Masahiko <sawada.mshk@gmail.com>
wrote:Thank you for bug report, and comments.
Fixed version is attached, and source code comment is also updated.
Please review it.I am looking into this patch and would like to share my findings with
you:
Thank you for comment.
I appreciate you taking time to review this patch.
1.
@@ -2131,8 +2133,9 @@ heap_insert(Relation relation, HeapTuple tup,
CommandId cid,CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
/* - * Find buffer to insert this tuple into. If the page is all visible, - * this will also pin the requisite visibility map page. + * Find buffer to insert this tuple into. If the page is all visible + * of all frozen, this will also pin the requisite visibility map and + * frozen map page. */typo in comments.
/of all frozen/or all frozen
Fixed.
2. visibilitymap.c + * The visibility map is a bitmap with two bits (all-visible and all-frozen + * per heap page./and all-frozen/and all-frozen)
closing round bracket is missing.
Fixed.
3. visibilitymap.c -/*#define TRACE_VISIBILITYMAP */ +#define TRACE_VISIBILITYMAPwhy is this hash define opened?
Fixed.
4. -visibilitymap_count(Relation rel) +visibilitymap_count(Relation rel, bool for_visible)This API needs to count set bits for either visibility info, frozen info
or both (if required), it seems better to have second parameter as
uint8 flags rather than bool. Also, if it is required to be called at most
places for both visibility and frozen bits count, why not get them
in one call?
Fixed.
5.
Clearing visibility and frozen bit separately for the dml
operations would lead locking/unlocking the corresponding buffer
twice, can we do it as a one operation. I think this is suggested
by Simon as well.
Latest patch clears bits in one operation, and set all-frozen with
all-visible in one operation.
We can judge the page is all-frozen in two places: first scanning the
page(lazy_scan_heap), and after cleaning garbage(lazy_vacuum_page).
6. - * Before locking the buffer, pin the visibility map page if it appears to - * be necessary. Since we haven't got the lock yet, someone else might be + * Before locking the buffer, pin the visibility map if it appears to be + * necessary. Since we haven't got the lock yet, someone else might beWhy you have deleted 'page' in above comment?
Fixed.
7.
@@ -3490,21 +3532,23 @@ l2:
UnlockTupleTuplock(relation, &(oldtup.t_self), *lockmode);if (vmbuffer != InvalidBuffer)
ReleaseBuffer(vmbuffer);
+
bms_free
(hot_attrs);Seems unnecessary change.
Fixed.
8.
@@ -1919,11 +1919,18 @@ index_update_stats(Relation rel,
{
BlockNumber relpages =
RelationGetNumberOfBlocks(rel);
BlockNumber relallvisible;
+ BlockNumber
relallfrozen;if (rd_rel->relkind != RELKIND_INDEX) - relallvisible = visibilitymap_count(rel); + { + relallvisible = visibilitymap_count(rel, true); + relallfrozen = visibilitymap_count(rel, false); + } else /* don't bother for indexes */ + { relallvisible = 0; + relallfrozen = 0; + }I think in this function, you have forgotten to update the
relallfrozen value in pg_class.
Fixed.
9.
vacuumlazy.c@@ -253,14 +258,16 @@ lazy_vacuum_rel(Relation onerel, int options,
VacuumParams *params,
* NB: We
need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
-
if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages +
vacrelstats->vmskipped_pages)
+ < vacrelstats->rel_pages)
{
- Assert(!scan_all);Why you have removed this Assert, won't the count of
vacrelstats->scanned_pages + vacrelstats->vmskipped_pages be
equal to vacrelstats->rel_pages when scall_all = true.
Fixed.
10.
vacuumlazy.c
lazy_vacuum_rel()
..
+ scanned_all |= scan_all;
+Why this new assignment is added, please add a comment to
explain it.
It's not necessary, removed.
11. lazy_scan_heap() .. + * Also, skipping even a single page accorind to all-visible bit of + * visibility map means that we can't update relfrozenxid, so we only want + * to do it if we can skip a goodly number. On the other hand, we count + * both how many pages we skipped according to all-frozen bit of visibility + * map and how many pages we freeze page, so we can update relfrozenxid if + * the sum of their is as many as tuples per page.a.
typo
/accorind/according
Fixed.
b.
is the second part of comment (starting from On the other hand)
right? I mean you are comparing sum of pages skipped due to
all_frozen bit and number of pages freezed with tuples per page.
I don't understand how are they related?
It's wrong, I wanted to say at last sentence that, "so we can update
relfrozenxid if the sum of them is as many as pages of table."
12.
@@ -918,8 +954,13 @@ lazy_scan_heap(Relation onerel, LVRelStats
*vacrelstats,
else
{
num_tuples += 1;
+ ntup_in_blk += 1;
hastup = true;+ /* If current tuple is already frozen, count it up */ + if (HeapTupleHeaderXminFrozen(tuple.t_data)) + already_nfrozen += 1; + /* * Each non-removable tuple must be checked to see if it needs * freezing. Note we already have exclusive buffer lock.Here, if tuple is already_frozen, can't we just continue and
check for next tuple?
I think it's impossible because the logic related to old-style VACUUM
FULL is remained yet in HeapTupleHeaderXminFrozen().
13. +extern XLogRecPtr log_heap_frozenmap(RelFileNode rnode, Buffer heap_buffer, + Buffer fm_buffer);It seems like this function is not used.
Fixed.
You have added relallfrozen similar to relallvisible, but how you
are planning to use it, is there any usecase for it?
Yep, the value of relallvisible would be effective for in case where
the user want to know how the vacuuming takes time to do.
If this value is low score, it's a usually good idea to do VACUUM
FREEZE manually to prevent unpredictable anti-wrapping vacuum.
a. please explain in comment why it is safe if someone clear the
frozen bit concurrently
b. won't skipping pages intermittently due to set frozen bit break the
readahead mechanism? In this regard, if possible, I think we should
do some tests to see the benefit of this patch. I understand that in
general, it will be good to skip pages, however it seems better to check
that with some different kind of tests.
In latest patch, we can skip the all-visible or all-frozen page until
we find next_not_all_visible_block,
and then we do re-check whether this page is all-frozen to skip to
vacuum this page even if scan_all is true.
Also, I added the message about number of the skipped frozen pages to
the verbose log for test.
Please place it in core. I see value in having a diagnostic function for
general use on production systems.
I added new heapfuncs.c file for related heap function which the DBA
uses, and then added theses function to that file.
But test cases are not yet, I'm making them.
Also something for pg_upgrade is also not yet.
TODO
- Test case for this feature
- pg_upgrade support.
Regards,
--
Sawada Masahiko
Attachments:
000_add_frozen_bit_into_visibilitymap_v6.patchapplication/octet-stream; name=000_add_frozen_bit_into_visibilitymap_v6.patchDownload
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 22c5f7a..b1b6a06 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -87,7 +87,7 @@ statapprox_heap(Relation rel, output_type *stat)
* If the page has only visible tuples, then we can find out the free
* space from the FSM and move on.
*/
- if (visibilitymap_test(rel, blkno, &vmbuffer))
+ if (visibilitymap_test(rel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
freespace = GetRecordedFreeSpace(rel, blkno);
stat->tuple_len += BLCKSZ - freespace;
diff --git a/src/backend/access/heap/Makefile b/src/backend/access/heap/Makefile
index b83d496..806ce27 100644
--- a/src/backend/access/heap/Makefile
+++ b/src/backend/access/heap/Makefile
@@ -12,6 +12,7 @@ subdir = src/backend/access/heap
top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
-OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o visibilitymap.o
+OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o visibilitymap.o \
+ heapfuncs.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 86a2e6b..ac74100 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -88,7 +88,7 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
HeapTuple newtup, HeapTuple old_key_tup,
- bool all_visible_cleared, bool new_all_visible_cleared);
+ bool all_visible_cleared, bool new_all_visible_cleared);
static void HeapSatisfiesHOTandKeyUpdate(Relation relation,
Bitmapset *hot_attrs,
Bitmapset *key_attrs, Bitmapset *id_attrs,
@@ -2131,8 +2131,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
/*
- * Find buffer to insert this tuple into. If the page is all visible,
- * this will also pin the requisite visibility map page.
+ * Find buffer to insert this tuple into. If the page is all visible
+ * or all frozen, this will also pin the requisite visibility map and
+ * frozen map page.
*/
buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
InvalidBuffer, options, bistate,
@@ -2147,10 +2148,13 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
if (PageIsAllVisible(BufferGetPage(buffer)))
{
all_visible_cleared = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(BufferGetPage(buffer));
+ PageClearAllFrozen(BufferGetPage(buffer));
+
visibilitymap_clear(relation,
- ItemPointerGetBlockNumber(&(heaptup->t_self)),
- vmbuffer);
+ ItemPointerGetBlockNumber(&(heaptup->t_self)), vmbuffer);
}
/*
@@ -2448,10 +2452,12 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
if (PageIsAllVisible(page))
{
all_visible_cleared = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(page);
- visibilitymap_clear(relation,
- BufferGetBlockNumber(buffer),
- vmbuffer);
+ PageClearAllFrozen(page);
+
+ visibilitymap_clear(relation, BufferGetBlockNumber(buffer), vmbuffer);
}
/*
@@ -2495,7 +2501,9 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
/* the rest of the scratch space is used for tuple data */
tupledata = scratchptr;
- xlrec->flags = all_visible_cleared ? XLH_INSERT_ALL_VISIBLE_CLEARED : 0;
+ if (all_visible_cleared)
+ xlrec->flags = XLH_INSERT_ALL_VISIBLE_CLEARED;
+
xlrec->ntuples = nthispage;
/*
@@ -2731,9 +2739,9 @@ heap_delete(Relation relation, ItemPointer tid,
/*
* If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, we'll have to unlock and
- * re-lock, to avoid holding the buffer lock across an I/O. That's a bit
- * unfortunate, but hopefully shouldn't happen often.
+ * visible or all frozen while we were busy locking the buffer, we'll
+ * have to unlock and re-lock, to avoid holding the buffer lock across an
+ * I/O. That's a bit unfortunate, but hopefully shouldn't happen often.
*/
if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
{
@@ -2925,12 +2933,16 @@ l1:
*/
PageSetPrunable(page, xid);
+ /* clear PD_ALL_VISIBLE and PD_ALL_FORZEN flags */
if (PageIsAllVisible(page))
{
all_visible_cleared = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(page);
- visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
- vmbuffer);
+ PageClearAllFrozen(page);
+
+ visibilitymap_clear(relation, BufferGetBlockNumber(buffer), vmbuffer);
}
/* store transaction information of xact deleting the tuple */
@@ -2961,7 +2973,9 @@ l1:
if (RelationIsAccessibleInLogicalDecoding(relation))
log_heap_new_cid(relation, &tp);
- xlrec.flags = all_visible_cleared ? XLH_DELETE_ALL_VISIBLE_CLEARED : 0;
+ if (all_visible_cleared)
+ xlrec.flags = XLH_DELETE_ALL_VISIBLE_CLEARED;
+
xlrec.infobits_set = compute_infobits(tp.t_data->t_infomask,
tp.t_data->t_infomask2);
xlrec.offnum = ItemPointerGetOffsetNumber(&tp.t_self);
@@ -3207,7 +3221,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
* in the middle of changing this, so we'll need to recheck after we have
* the lock.
*/
- if (PageIsAllVisible(page))
+ if (PageIsAllVisible(page) || PageIsAllFrozen(page))
visibilitymap_pin(relation, block, &vmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
@@ -3801,16 +3815,22 @@ l2:
if (PageIsAllVisible(BufferGetPage(buffer)))
{
all_visible_cleared = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(BufferGetPage(buffer));
- visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
- vmbuffer);
+ PageClearAllFrozen(BufferGetPage(buffer));
+
+ visibilitymap_clear(relation, BufferGetBlockNumber(buffer), vmbuffer);
}
if (newbuf != buffer && PageIsAllVisible(BufferGetPage(newbuf)))
{
all_visible_cleared_new = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(BufferGetPage(newbuf));
- visibilitymap_clear(relation, BufferGetBlockNumber(newbuf),
- vmbuffer_new);
+ PageClearAllFrozen(BufferGetPage(newbuf));
+
+ visibilitymap_clear(relation, BufferGetBlockNumber(newbuf), vmbuffer_new);
}
if (newbuf != buffer)
@@ -6893,7 +6913,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
*/
XLogRecPtr
log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
- TransactionId cutoff_xid)
+ TransactionId cutoff_xid, uint8 vmflags)
{
xl_heap_visible xlrec;
XLogRecPtr recptr;
@@ -6903,6 +6923,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
Assert(BufferIsValid(vm_buffer));
xlrec.cutoff_xid = cutoff_xid;
+ xlrec.flags = vmflags;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
@@ -7492,8 +7513,14 @@ heap_xlog_visible(XLogReaderState *record)
* the subsequent update won't be replayed to clear the flag.
*/
page = BufferGetPage(buffer);
- PageSetAllVisible(page);
+
+ if (xlrec->flags & VISIBILITYMAP_ALL_VISIBLE)
+ PageSetAllVisible(page);
+ if (xlrec->flags & VISIBILITYMAP_ALL_FROZEN)
+ PageSetAllFrozen(page);
+
MarkBufferDirty(buffer);
+
}
else if (action == BLK_RESTORED)
{
@@ -7544,7 +7571,7 @@ heap_xlog_visible(XLogReaderState *record)
*/
if (lsn > PageGetLSN(vmpage))
visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
- xlrec->cutoff_xid);
+ xlrec->cutoff_xid, xlrec->flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
@@ -7694,7 +7721,10 @@ heap_xlog_delete(XLogReaderState *record)
PageSetPrunable(page, XLogRecGetXid(record));
if (xlrec->flags & XLH_DELETE_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
/* Make sure there is no forward chain link in t_ctid */
htup->t_ctid = target_tid;
@@ -7798,7 +7828,10 @@ heap_xlog_insert(XLogReaderState *record)
PageSetLSN(page, lsn);
if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
MarkBufferDirty(buffer);
}
@@ -7937,7 +7970,10 @@ heap_xlog_multi_insert(XLogReaderState *record)
PageSetLSN(page, lsn);
if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
MarkBufferDirty(buffer);
}
@@ -8065,7 +8101,10 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
PageSetPrunable(page, XLogRecGetXid(record));
if (xlrec->flags & XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
PageSetLSN(page, lsn);
MarkBufferDirty(obuffer);
@@ -8200,7 +8239,10 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
elog(PANIC, "heap_update_redo: failed to add tuple");
if (xlrec->flags & XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */
diff --git a/src/backend/access/heap/heapfuncs.c b/src/backend/access/heap/heapfuncs.c
new file mode 100644
index 0000000..a0cc165
--- /dev/null
+++ b/src/backend/access/heap/heapfuncs.c
@@ -0,0 +1,80 @@
+/*-------------------------------------------------------------------------
+ *
+ * heapfuncs.c
+ * Functions for accessing the related heap page
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/heap/heapfuncs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/visibilitymap.h"
+#include "funcapi.h"
+#include "storage/freespace.h"
+#include "storage/bufmgr.h"
+
+/* Functions for visibilitymap */
+extern Datum pg_is_all_visible(PG_FUNCTION_ARGS);
+extern Datum pg_is_all_frozen(PG_FUNCTION_ARGS);
+
+static bool visibilitymap_test_internal(Oid relid, uint64 blkno, uint8);
+
+/*
+ * Return the page is all-visible or not, according to the visibility map.
+ */
+Datum
+pg_is_all_visible(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int64 blkno = PG_GETARG_INT64(1);
+ bool all_visible;
+
+ all_visible = visibilitymap_test_internal(relid, blkno, VISIBILITYMAP_ALL_VISIBLE);
+
+ PG_RETURN_BOOL(all_visible);
+}
+
+/*
+ * Return the page is all-frozen or not, according to the visibility map.
+ */
+Datum
+pg_is_all_frozen(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int64 blkno = PG_GETARG_INT64(1);
+ bool all_frozen;
+
+ all_frozen = visibilitymap_test_internal(relid, blkno, VISIBILITYMAP_ALL_FROZEN);
+
+ PG_RETURN_BOOL(all_frozen);
+}
+
+static bool
+visibilitymap_test_internal(Oid relid, uint64 blkno, uint8 flag)
+{
+
+ Relation rel;
+ Buffer vmbuffer = InvalidBuffer;
+ bool result;
+
+ rel = relation_open(relid, AccessShareLock);
+
+ if (blkno < 0 || blkno > MaxBlockNumber)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid block number")));
+
+ result = visibilitymap_test(rel, blkno, &vmbuffer, flag);
+
+ ReleaseBuffer(vmbuffer);
+ relation_close(rel, AccessShareLock);
+
+ return result;
+}
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 7c38772..13ad5b1 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -21,33 +21,44 @@
*
* NOTES
*
- * The visibility map is a bitmap with one bit per heap page. A set bit means
- * that all tuples on the page are known visible to all transactions, and
- * therefore the page doesn't need to be vacuumed. The map is conservative in
- * the sense that we make sure that whenever a bit is set, we know the
- * condition is true, but if a bit is not set, it might or might not be true.
+ * The visibility map is a bitmap with two bits (all-visible and all-frozen)
+ * per heap page. A set all-visible bit means that all tuples on the page are
+ * known visible to all transactions, and therefore the page doesn't need to
+ * be vacuumed. A set all-frozen bit means that all tuples on the page are
+ * completely frozen, and therefor the page doesn't need to be vacuumed even
+ * if whole table scanning vacuum is required. A all-frozen bit must be set only
+ * when the page is already all-visible. That is, all-frozen bit is always set
+ * with all-visible bit.
+ * The map is conservative in the sense that we make sure that whenever a bit
+ * is set, we know the condition is true, but if a bit is not set, it might or
+ * might not be true.
*
* Clearing a visibility map bit is not separately WAL-logged. The callers
* must make sure that whenever a bit is cleared, the bit is cleared on WAL
- * replay of the updating operation as well.
+ * replay of the updating operation as well. And all-frozen bit must be
+ * cleared with all-visible at the same time.
*
* When we *set* a visibility map during VACUUM, we must write WAL. This may
* seem counterintuitive, since the bit is basically a hint: if it is clear,
- * it may still be the case that every tuple on the page is visible to all
- * transactions; we just don't know that for certain. The difficulty is that
- * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
- * on the page itself, and the visibility map bit. If a crash occurs after the
- * visibility map page makes it to disk and before the updated heap page makes
- * it to disk, redo must set the bit on the heap page. Otherwise, the next
- * insert, update, or delete on the heap page will fail to realize that the
- * visibility map bit must be cleared, possibly causing index-only scans to
- * return wrong answers.
+ * it may still be the case that every tuple on the page is visible or frozen
+ * to all transactions; we just don't know that for certain. The difficulty is
+ * that there are two bits which are typically set together: the PD_ALL_VISIBLE
+ * or PD_ALL_FROZEN bit on the page itself, and the visibility map bit. If a
+ * crash occurs after the visibility map page makes it to disk and before the
+ * updated heap page makes it to disk, redo must set the bit on the heap page.
+ * Otherwise, the next insert, update, or delete on the heap page will fail to
+ * realize that the visibility map bit must be cleared, possibly causing index-only
+ * scans to return wrong answers.
*
* VACUUM will normally skip pages for which the visibility map bit is set;
* such pages can't contain any dead tuples and therefore don't need vacuuming.
- * The visibility map is not used for anti-wraparound vacuums, because
+ * The visibility map is not used for anti-wraparound vacuums before 9.5, because
* an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
* present in the table, even on pages that don't have any dead tuples.
+ * 9.6 or later, the visibility map has a additional bit which indicates all tuple
+ * on single page has been completely forzen, so the visibility map is also used for
+ * anti-wraparound vacuums.
+ *
*
* LOCKING
*
@@ -58,14 +69,14 @@
* section that logs the page modification. However, we don't want to hold
* the buffer lock over any I/O that may be required to read in the visibility
* map page. To avoid this, we examine the heap page before locking it;
- * if the page-level PD_ALL_VISIBLE bit is set, we pin the visibility map
- * bit. Then, we lock the buffer. But this creates a race condition: there
- * is a possibility that in the time it takes to lock the buffer, the
- * PD_ALL_VISIBLE bit gets set. If that happens, we have to unlock the
- * buffer, pin the visibility map page, and relock the buffer. This shouldn't
- * happen often, because only VACUUM currently sets visibility map bits,
- * and the race will only occur if VACUUM processes a given page at almost
- * exactly the same time that someone tries to further modify it.
+ * if the page-level PD_ALL_VISIBLE or PD_ALL_FROZEN bit is set, we pin the
+ * visibility map bit. Then, we lock the buffer. But this creates a race
+ * condition: there is a possibility that in the time it takes to lock the
+ * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that happens,
+ * we have to unlock the buffer, pin the visibility map page, and relock the
+ * buffer. This shouldn't happen often, because only VACUUM currently sets
+ * visibility map bits, and the race will only occur if VACUUM processes a given
+ * page at almost exactly the same time that someone tries to further modify it.
*
* To set a bit, you need to hold a lock on the heap page. That prevents
* the race condition where VACUUM sees that all tuples on the page are
@@ -101,11 +112,14 @@
*/
#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
-/* Number of bits allocated for each heap block. */
-#define BITS_PER_HEAPBLOCK 1
+/*
+ * Number of bits allocated for each heap block.
+ * One for all-visible, other for all-frozen.
+*/
+#define BITS_PER_HEAPBLOCK 2
/* Number of heap blocks we can represent in one byte. */
-#define HEAPBLOCKS_PER_BYTE 8
+#define HEAPBLOCKS_PER_BYTE 4
/* Number of heap blocks we can represent in one visibility map page. */
#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
@@ -115,24 +129,42 @@
#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
#define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
-/* table for fast counting of set bits */
-static const uint8 number_of_ones[256] = {
- 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
+/* tables for fast counting of set bits for visible and freeze */
+static const uint8 number_of_ones_for_visible[256] = {
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4
+};
+static const uint8 number_of_ones_for_frozen[256] = {
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4
};
/* prototypes for internal routines */
@@ -141,7 +173,7 @@ static void vm_extend(Relation rel, BlockNumber nvmblocks);
/*
- * visibilitymap_clear - clear a bit in visibility map
+ * visibilitymap_clear - clear all bits in visibility map
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -153,7 +185,8 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
int mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- uint8 mask = 1 << mapBit;
+ uint8 mask = (VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN) <<
+ (BITS_PER_HEAPBLOCK * mapBit);
char *map;
#ifdef TRACE_VISIBILITYMAP
@@ -225,7 +258,7 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
}
/*
- * visibilitymap_set - set a bit on a previously pinned page
+ * visibilitymap_set - set bit(s) on a previously pinned page
*
* recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
* or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
@@ -234,10 +267,11 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
* marked all-visible; it is needed for Hot Standby, and can be
* InvalidTransactionId if the page contains no tuples.
*
- * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
- * this function. Except in recovery, caller should also pass the heap
- * buffer. When checksums are enabled and we're not in recovery, we must add
- * the heap buffer to the WAL chain to protect it from being torn.
+ * Caller is expected to set the heap page's PD_ALL_VISIBLE or PD_ALL_FROZEN
+ * bit before calling this function. Except in recovery, caller should also
+ * pass the heap buffer and flags which indicates what flag we want to set.
+ * When checksums are enabled and we're not in recovery, we must add the heap
+ * buffer to the WAL chain to protect it from being torn.
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -245,7 +279,8 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
*/
void
visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid)
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -254,7 +289,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_set %s %d %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
@@ -272,11 +307,11 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map = PageGetContents(page);
LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
- if (!(map[mapByte] & (1 << mapBit)))
+ if (flags != (map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))))
{
START_CRIT_SECTION();
- map[mapByte] |= (1 << mapBit);
+ map[mapByte] |= (flags << (BITS_PER_HEAPBLOCK * mapBit));
MarkBufferDirty(vmBuf);
if (RelationNeedsWAL(rel))
@@ -285,7 +320,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Assert(!InRecovery);
recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
- cutoff_xid);
+ cutoff_xid, flags);
/*
* If data checksums are enabled (or wal_log_hints=on), we
@@ -295,11 +330,15 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Page heapPage = BufferGetPage(heapBuf);
- /* caller is expected to set PD_ALL_VISIBLE first */
- Assert(PageIsAllVisible(heapPage));
+ /*
+ * caller is expected to set PD_ALL_VISIBLE or
+ * PD_ALL_FROZEN first.
+ */
+ Assert(PageIsAllVisible(heapPage) || PageIsAllFrozen(heapPage));
PageSetLSN(heapPage, recptr);
}
}
+
PageSetLSN(page, recptr);
}
@@ -310,15 +349,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
}
/*
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_test - test if bit(s) is set
*
- * Are all tuples on heapBlk visible to all, according to the visibility map?
+ * Are all tuples on heapBlk visible or frozen to all, according to the visibility map?
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by an
* earlier call to visibilitymap_pin or visibilitymap_test on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk, or InvalidBuffer. The caller is responsible for
- * releasing *buf after it's done testing and setting bits.
+ * releasing *buf after it's done testing and setting bits, and must set flags
+ * which indicates what flag we want to test.
*
* NOTE: This function is typically called without a lock on the heap page,
* so somebody else could change the bit just after we look at it. In fact,
@@ -328,7 +368,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
* all concurrency issues!
*/
bool
-visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
+visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf, uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -337,7 +377,7 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_test %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_test %s %d %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
/* Reuse the old pinned buffer if possible */
@@ -360,11 +400,12 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
map = PageGetContents(BufferGetPage(*buf));
/*
- * A single-bit read is atomic. There could be memory-ordering effects
+ * A single or double bit read is atomic. There could be memory-ordering effects
* here, but for performance reasons we make it the caller's job to worry
* about that.
*/
- result = (map[mapByte] & (1 << mapBit)) ? true : false;
+ result = (map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))) ?
+ true : false;
return result;
}
@@ -374,10 +415,11 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
*
* Note: we ignore the possibility of race conditions when the table is being
* extended concurrently with the call. New pages added to the table aren't
- * going to be marked all-visible, so they won't affect the result.
+ * going to be marked all-visible or all-frozen, so they won't affect the result.
+ * The caller must set the flags which indicates what flag we want to count.
*/
BlockNumber
-visibilitymap_count(Relation rel)
+visibilitymap_count(Relation rel, uint8 flags)
{
BlockNumber result = 0;
BlockNumber mapBlock;
@@ -406,7 +448,10 @@ visibilitymap_count(Relation rel)
for (i = 0; i < MAPSIZE; i++)
{
- result += number_of_ones[map[i]];
+ if (flags & VISIBILITYMAP_ALL_VISIBLE)
+ result += number_of_ones_for_visible[map[i]];
+ if (flags & VISIBILITYMAP_ALL_FROZEN)
+ result += number_of_ones_for_frozen[map[i]];
}
ReleaseBuffer(mapBuffer);
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 4246554..015bfb8 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1919,11 +1919,18 @@ index_update_stats(Relation rel,
{
BlockNumber relpages = RelationGetNumberOfBlocks(rel);
BlockNumber relallvisible;
+ BlockNumber relallfrozen;
if (rd_rel->relkind != RELKIND_INDEX)
- relallvisible = visibilitymap_count(rel);
+ {
+ relallvisible = visibilitymap_count(rel, VISIBILITYMAP_ALL_VISIBLE);
+ relallfrozen = visibilitymap_count(rel, VISIBILITYMAP_ALL_FROZEN);
+ }
else /* don't bother for indexes */
+ {
relallvisible = 0;
+ relallfrozen = 0;
+ }
if (rd_rel->relpages != (int32) relpages)
{
@@ -1940,6 +1947,11 @@ index_update_stats(Relation rel,
rd_rel->relallvisible = (int32) relallvisible;
dirty = true;
}
+ if (rd_rel->relallfrozen != (int32) relallfrozen)
+ {
+ rd_rel->relallfrozen = (int32) relallfrozen;
+ dirty = true;
+ }
}
/*
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 861048f..392c2a4 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -572,7 +572,8 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
vac_update_relstats(onerel,
relpages,
totalrows,
- visibilitymap_count(onerel),
+ visibilitymap_count(onerel, VISIBILITYMAP_ALL_VISIBLE),
+ visibilitymap_count(onerel, VISIBILITYMAP_ALL_FROZEN),
hasindex,
InvalidTransactionId,
InvalidMultiXactId,
@@ -595,6 +596,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
RelationGetNumberOfBlocks(Irel[ind]),
totalindexrows,
0,
+ 0,
false,
InvalidTransactionId,
InvalidMultiXactId,
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 7ab4874..d3725dd 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -22,6 +22,7 @@
#include "access/rewriteheap.h"
#include "access/transam.h"
#include "access/tuptoaster.h"
+#include "access/visibilitymap.h"
#include "access/xact.h"
#include "access/xlog.h"
#include "catalog/catalog.h"
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index baf66f1..d68c7c4 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -744,6 +744,7 @@ void
vac_update_relstats(Relation relation,
BlockNumber num_pages, double num_tuples,
BlockNumber num_all_visible_pages,
+ BlockNumber num_all_frozen_pages,
bool hasindex, TransactionId frozenxid,
MultiXactId minmulti,
bool in_outer_xact)
@@ -781,6 +782,11 @@ vac_update_relstats(Relation relation,
pgcform->relallvisible = (int32) num_all_visible_pages;
dirty = true;
}
+ if (pgcform->relallfrozen != (int32) num_all_frozen_pages)
+ {
+ pgcform->relallfrozen = (int32) num_all_frozen_pages;
+ dirty = true;
+ }
/* Apply DDL updates, but not inside an outer transaction (see above) */
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index a01cfb4..12322a4 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -106,6 +106,8 @@ typedef struct LVRelStats
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* number of pages we examined */
BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber vmskipped_frozen_pages; /* # of pages we skipped by all-frozen bit
+ of visibility map */
double scanned_tuples; /* counts only tuples on scanned pages */
double old_rel_tuples; /* previous value of pg_class.reltuples */
double new_rel_tuples; /* new estimated total # of tuples */
@@ -156,7 +158,7 @@ static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
static int vac_cmp_itemptr(const void *left, const void *right);
static bool heap_page_is_all_visible(Relation rel, Buffer buf,
- TransactionId *visibility_cutoff_xid);
+ TransactionId *visibility_cutoff_xid, bool *all_frozen);
/*
@@ -188,7 +190,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
MultiXactId mxactFullScanLimit;
BlockNumber new_rel_pages;
double new_rel_tuples;
- BlockNumber new_rel_allvisible;
+ BlockNumber new_rel_allvisible,
+ new_rel_allfrozen;
double new_live_tuples;
TransactionId new_frozen_xid;
MultiXactId new_min_multi;
@@ -222,6 +225,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* than or equal to the requested Xid full-table scan limit; or if the
* table's minimum MultiXactId is older than or equal to the requested
* mxid full-table scan limit.
+ * Even if scan_all is set so far, we could skip to scan some pages
+ * according by frozen map.
*/
scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
xidFullScanLimit);
@@ -253,13 +258,16 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
- if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages + vacrelstats->vmskipped_frozen_pages)
+ < vacrelstats->rel_pages)
{
Assert(!scan_all);
scanned_all = false;
}
else
+ {
scanned_all = true;
+ }
/*
* Optionally truncate the relation.
@@ -301,10 +309,14 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_tuples = vacrelstats->old_rel_tuples;
}
- new_rel_allvisible = visibilitymap_count(onerel);
+ new_rel_allvisible = visibilitymap_count(onerel, VISIBILITYMAP_ALL_VISIBLE);
if (new_rel_allvisible > new_rel_pages)
new_rel_allvisible = new_rel_pages;
+ new_rel_allfrozen = visibilitymap_count(onerel, VISIBILITYMAP_ALL_FROZEN);
+ if (new_rel_allfrozen > new_rel_pages)
+ new_rel_allfrozen = new_rel_pages;
+
new_frozen_xid = scanned_all ? FreezeLimit : InvalidTransactionId;
new_min_multi = scanned_all ? MultiXactCutoff : InvalidMultiXactId;
@@ -312,6 +324,7 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_pages,
new_rel_tuples,
new_rel_allvisible,
+ new_rel_allfrozen,
vacrelstats->hasindex,
new_frozen_xid,
new_min_multi,
@@ -360,10 +373,11 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel),
vacrelstats->num_index_scans);
- appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins\n"),
+ appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped frozen page according to visibility map\n"),
vacrelstats->pages_removed,
vacrelstats->rel_pages,
- vacrelstats->pinskipped_pages);
+ vacrelstats->pinskipped_pages,
+ vacrelstats->vmskipped_frozen_pages);
appendStringInfo(&buf,
_("tuples: %.0f removed, %.0f remain, %.0f are dead but not yet removable\n"),
vacrelstats->tuples_deleted,
@@ -486,9 +500,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* consecutive pages. Since we're reading sequentially, the OS should be
* doing readahead for us, so there's no gain in skipping a page now and
* then; that's likely to disable readahead and so be counterproductive.
- * Also, skipping even a single page means that we can't update
- * relfrozenxid, so we only want to do it if we can skip a goodly number
- * of pages.
+ * Also, skipping even a single page accorinding to all-visible bit of
+ * visibility map means that we can't update relfrozenxid, so we only want
+ * to do it if we can skip a goodly number. On the other hand, we count
+ * both how many pages we skipped according to all-frozen bit of visibility
+ * map and how many pages we freeze page, so we can update relfrozenxid if
+ * the sum of them is as many as pages of table.
*
* Before entering the main loop, establish the invariant that
* next_not_all_visible_block is the next block number >= blkno that's not
@@ -515,7 +532,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer))
+ if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer,
+ VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN))
break;
vacuum_delay_point();
}
@@ -533,7 +551,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
bool tupgone,
hastup;
int prev_dead_count;
- int nfrozen;
+ int nfrozen; /* # of tuples is frozen */
+ int nalready_frozen; /* # of tuples is already frozen */
+ int ntotal_frozen; /* # of tuples is in single page */
+ int ntup_per_page;
Size freespace;
bool all_visible_according_to_vm;
bool all_visible;
@@ -548,7 +569,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block++)
{
if (!visibilitymap_test(onerel, next_not_all_visible_block,
- &vmbuffer))
+ &vmbuffer,
+ VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN))
break;
vacuum_delay_point();
}
@@ -566,9 +588,25 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
}
else
{
- /* Current block is all-visible */
- if (skipping_all_visible_blocks && !scan_all)
- continue;
+ /*
+ * This block is at least all-visible according to visibility map.
+ * We check whehter this block is all-frozen to skip to vacuum this
+ * page even though scanning whole page is required.
+ */
+ if (scan_all)
+ {
+ if (visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_FROZEN))
+ {
+ vacrelstats->vmskipped_frozen_pages++;
+ continue;
+ }
+ }
+ else
+ {
+ if (skipping_all_visible_blocks)
+ continue;
+ }
+
all_visible_according_to_vm = true;
}
@@ -740,7 +778,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
PageSetAllVisible(page);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, InvalidTransactionId);
+ vmbuffer, InvalidTransactionId,
+ VISIBILITYMAP_ALL_VISIBLE);
END_CRIT_SECTION();
}
@@ -764,6 +803,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
all_visible = true;
has_dead_tuples = false;
nfrozen = 0;
+ nalready_frozen = 0;
+ ntup_per_page = 0;
hastup = false;
prev_dead_count = vacrelstats->num_dead_tuples;
maxoff = PageGetMaxOffsetNumber(page);
@@ -918,8 +959,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
{
num_tuples += 1;
+ ntup_per_page += 1;;
hastup = true;
+ /* Check whether this tuple is alrady frozen or not */
+ if (HeapTupleHeaderXminFrozen(tuple.t_data))
+ nalready_frozen += 1;
+
/*
* Each non-removable tuple must be checked to see if it needs
* freezing. Note we already have exclusive buffer lock.
@@ -931,9 +977,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
} /* scan along page */
/*
- * If we froze any tuples, mark the buffer dirty, and write a WAL
- * record recording the changes. We must log the changes to be
- * crash-safe against future truncation of CLOG.
+ * If we froze any tuples or any tuples are already frozen,
+ * mark the buffer dirty, and write a WAL record recording the changes.
+ * We must log the changes to be crash-safe against future truncation
+ * of CLOG.
*/
if (nfrozen > 0)
{
@@ -966,6 +1013,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
END_CRIT_SECTION();
}
+ /* Compute the number of frozen tuples in a page */
+ ntotal_frozen = nfrozen + nalready_frozen;
+
/*
* If there are no indexes then we can vacuum the page right now
* instead of doing a second scan.
@@ -988,26 +1038,47 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
freespace = PageGetHeapFreeSpace(page);
- /* mark page all-visible, if appropriate */
- if (all_visible && !all_visible_according_to_vm)
+ /* This page is all visible */
+ if (all_visible)
{
- /*
- * It should never be the case that the visibility map page is set
- * while the page-level bit is clear, but the reverse is allowed
- * (if checksums are not enabled). Regardless, set the both bits
- * so that we get back in sync.
- *
- * NB: If the heap page is all-visible but the VM bit is not set,
- * we don't need to dirty the heap page. However, if checksums
- * are enabled, we do need to make sure that the heap page is
- * dirtied before passing it to visibilitymap_set(), because it
- * may be logged. Given that this situation should only happen in
- * rare cases after a crash, it is not worth optimizing.
- */
- PageSetAllVisible(page);
- MarkBufferDirty(buf);
- visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ /* mark page all-visible, if appropriate */
+ if (!all_visible_according_to_vm)
+ {
+ /*
+ * It should never be the case that the visibility map page is set
+ * while the page-level bit is clear, but the reverse is allowed
+ * (if checksums are not enabled). Regardless, set the both bits
+ * so that we get back in sync.
+ *
+ * NB: If the heap page is all-visible but the VM bit is not set,
+ * we don't need to dirty the heap page. However, if checksums
+ * are enabled, we do need to make sure that the heap page is
+ * dirtied before passing it to visibilitymap_set(), because it
+ * may be logged. Given that this situation should only happen in
+ * rare cases after a crash, it is not worth optimizing.
+ */
+ PageSetAllVisible(page);
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+ }
+
+ /* mark page all-frozen, if all tuples are frozen in total */
+ if ((ntotal_frozen == ntup_per_page) &&
+ !visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_FROZEN))
+ {
+ Assert(PageIsAllVisible(page));
+
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ {
+ MarkBufferDirty(buf);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+ vmbuffer, visibility_cutoff_xid, flags);
+ }
}
/*
@@ -1018,7 +1089,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* that something bad has happened.
*/
else if (all_visible_according_to_vm && !PageIsAllVisible(page)
- && visibilitymap_test(onerel, blkno, &vmbuffer))
+ && visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
relname, blkno);
@@ -1047,6 +1118,17 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
visibilitymap_clear(onerel, blkno, vmbuffer);
}
+ /*
+ * As a result of scanning a page, we set VM all-frozen bit and page header
+ * if all tuples of single page are frozen.
+ */
+ if (ntotal_frozen == ntup_per_page)
+ {
+ PageSetAllFrozen(page);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr, vmbuffer,
+ InvalidTransactionId, VISIBILITYMAP_ALL_FROZEN);
+ }
+
UnlockReleaseBuffer(buf);
/* Remember the location of the last page with nonremovable tuples */
@@ -1078,7 +1160,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
num_tuples);
/*
- * Release any remaining pin on visibility map page.
+ * Release any remaining pin on visibility map and frozen map page.
*/
if (BufferIsValid(vmbuffer))
{
@@ -1126,6 +1208,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
nunused);
appendStringInfo(&buf, _("Skipped %u pages due to buffer pins.\n"),
vacrelstats->pinskipped_pages);
+ appendStringInfo(&buf, _("Skipped %u frozen pages according to visibility map.\n"),
+ vacrelstats->vmskipped_frozen_pages);
appendStringInfo(&buf, _("%u pages are entirely empty.\n"),
empty_pages);
appendStringInfo(&buf, _("%s."),
@@ -1226,6 +1310,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
OffsetNumber unused[MaxOffsetNumber];
int uncnt = 0;
TransactionId visibility_cutoff_xid;
+ bool all_frozen;
START_CRIT_SECTION();
@@ -1277,19 +1362,31 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
* dirty, exclusively locked, and, if needed, a full page image has been
* emitted in the log_heap_clean() above.
*/
- if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid))
+ if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid, &all_frozen))
PageSetAllVisible(page);
/*
* All the changes to the heap page have been done. If the all-visible
- * flag is now set, also set the VM bit.
+ * flag is now set, also set the VM all-visible bit.
+ * Also, if this page is all-frozen, set VM all-frozen bit and flag.
*/
- if (PageIsAllVisible(page) &&
- !visibilitymap_test(onerel, blkno, vmbuffer))
+ if (PageIsAllVisible(page))
{
- Assert(BufferIsValid(*vmbuffer));
- visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
- visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ if (!visibilitymap_test(onerel, blkno, vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+
+ /* mark page all-frozen, and set VM all-frozen bit */
+ if (all_frozen)
+ {
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
+ visibility_cutoff_xid, flags);
}
return tupindex;
@@ -1408,6 +1505,7 @@ lazy_cleanup_index(Relation indrel,
stats->num_pages,
stats->num_index_tuples,
0,
+ 0,
false,
InvalidTransactionId,
InvalidMultiXactId,
@@ -1782,7 +1880,8 @@ vac_cmp_itemptr(const void *left, const void *right)
* xmin amongst the visible tuples.
*/
static bool
-heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid)
+heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid,
+ bool *all_frozen)
{
Page page = BufferGetPage(buf);
BlockNumber blockno = BufferGetBlockNumber(buf);
@@ -1791,6 +1890,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
bool all_visible = true;
*visibility_cutoff_xid = InvalidTransactionId;
+ *all_frozen = true;
/*
* This is a stripped down version of the line pointer scan in
@@ -1814,7 +1914,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/*
* Dead line pointers can have index pointers pointing to them. So
- * they can't be treated as visible
+ * they can't be treated as visible and frozen.
*/
if (ItemIdIsDead(itemid))
{
@@ -1855,6 +1955,10 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
+
+ /* Check whether this tuple is alrady frozen or not */
+ if (!HeapTupleHeaderXminFrozen(tuple.t_data))
+ *all_frozen = false;
}
break;
@@ -1863,6 +1967,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
case HEAPTUPLE_INSERT_IN_PROGRESS:
case HEAPTUPLE_DELETE_IN_PROGRESS:
all_visible = false;
+ *all_frozen = false;
break;
default:
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 9f54c46..08df289 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -116,7 +116,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
*/
if (!visibilitymap_test(scandesc->heapRelation,
ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
+ &node->ioss_VMBuffer, VISIBILITYMAP_ALL_VISIBLE))
{
/*
* Rats, we have to visit the heap to check visibility.
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 874ca6a..376841a 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -127,7 +127,7 @@ ExecCheckPlanOutput(Relation resultRel, List *targetList)
if (attno != resultDesc->natts)
ereport(ERROR,
(errcode(ERRCODE_DATATYPE_MISMATCH),
- errmsg("table row type and query-specified row type do not match"),
+ errmsg("table row type and query-specified row type do not match"),
errdetail("Query has too few columns.")));
}
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index caa0f14..93afb10 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -320,9 +320,10 @@ typedef struct xl_heap_freeze_page
typedef struct xl_heap_visible
{
TransactionId cutoff_xid;
+ uint8 flags;
} xl_heap_visible;
-#define SizeOfHeapVisible (offsetof(xl_heap_visible, cutoff_xid) + sizeof(TransactionId))
+#define SizeOfHeapVisible (offsetof(xl_heap_visible, flags) + sizeof(uint8))
typedef struct xl_heap_new_cid
{
@@ -389,6 +390,6 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
xl_heap_freeze_tuple *xlrec_tp);
extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
- Buffer vm_buffer, TransactionId cutoff_xid);
+ Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
#endif /* HEAPAM_XLOG_H */
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 0c0e0ef..7270609 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -19,15 +19,20 @@
#include "storage/buf.h"
#include "utils/relcache.h"
-extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk,
- Buffer vmbuf);
+/* Flags for bit map */
+#define VISIBILITYMAP_ALL_VISIBLE 0x01
+#define VISIBILITYMAP_ALL_FROZEN 0x02
+
+extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
Buffer *vmbuf);
extern bool visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid);
-extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
-extern BlockNumber visibilitymap_count(Relation rel);
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags);
+extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf,
+ uint8 flags);
+extern BlockNumber visibilitymap_count(Relation rel, uint8 flags);
extern void visibilitymap_truncate(Relation rel, BlockNumber nheapblocks);
#endif /* VISIBILITYMAP_H */
diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
index e526cd9..ea0f7c1 100644
--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -47,6 +47,8 @@ CATALOG(pg_class,1259) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83) BKI_SCHEMA_MACRO
float4 reltuples; /* # of tuples (not always up-to-date) */
int32 relallvisible; /* # of all-visible blocks (not always
* up-to-date) */
+ int32 relallfrozen; /* # of all-frozen blocks (not always
+ up-to-date) */
Oid reltoastrelid; /* OID of toast table; 0 if none */
bool relhasindex; /* T if has (or has had) any indexes */
bool relisshared; /* T if shared across databases */
@@ -95,7 +97,7 @@ typedef FormData_pg_class *Form_pg_class;
* ----------------
*/
-#define Natts_pg_class 30
+#define Natts_pg_class 31
#define Anum_pg_class_relname 1
#define Anum_pg_class_relnamespace 2
#define Anum_pg_class_reltype 3
@@ -107,25 +109,26 @@ typedef FormData_pg_class *Form_pg_class;
#define Anum_pg_class_relpages 9
#define Anum_pg_class_reltuples 10
#define Anum_pg_class_relallvisible 11
-#define Anum_pg_class_reltoastrelid 12
-#define Anum_pg_class_relhasindex 13
-#define Anum_pg_class_relisshared 14
-#define Anum_pg_class_relpersistence 15
-#define Anum_pg_class_relkind 16
-#define Anum_pg_class_relnatts 17
-#define Anum_pg_class_relchecks 18
-#define Anum_pg_class_relhasoids 19
-#define Anum_pg_class_relhaspkey 20
-#define Anum_pg_class_relhasrules 21
-#define Anum_pg_class_relhastriggers 22
-#define Anum_pg_class_relhassubclass 23
-#define Anum_pg_class_relrowsecurity 24
-#define Anum_pg_class_relispopulated 25
-#define Anum_pg_class_relreplident 26
-#define Anum_pg_class_relfrozenxid 27
-#define Anum_pg_class_relminmxid 28
-#define Anum_pg_class_relacl 29
-#define Anum_pg_class_reloptions 30
+#define Anum_pg_class_relallfrozen 12
+#define Anum_pg_class_reltoastrelid 13
+#define Anum_pg_class_relhasindex 14
+#define Anum_pg_class_relisshared 15
+#define Anum_pg_class_relpersistence 16
+#define Anum_pg_class_relkind 17
+#define Anum_pg_class_relnatts 18
+#define Anum_pg_class_relchecks 19
+#define Anum_pg_class_relhasoids 20
+#define Anum_pg_class_relhaspkey 21
+#define Anum_pg_class_relhasrules 22
+#define Anum_pg_class_relhastriggers 23
+#define Anum_pg_class_relhassubclass 24
+#define Anum_pg_class_relrowsecurity 25
+#define Anum_pg_class_relispopulated 26
+#define Anum_pg_class_relreplident 27
+#define Anum_pg_class_relfrozenxid 28
+#define Anum_pg_class_relminmxid 29
+#define Anum_pg_class_relacl 30
+#define Anum_pg_class_reloptions 31
/* ----------------
* initial contents of pg_class
@@ -140,13 +143,13 @@ typedef FormData_pg_class *Form_pg_class;
* Note: "3" in the relfrozenxid column stands for FirstNormalTransactionId;
* similarly, "1" in relminmxid stands for FirstMultiXactId
*/
-DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 f f p r 28 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 0 f f p r 28 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 0 f f p r 31 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 6fd1278..7d9b93f 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3213,6 +3213,12 @@ DESCR("sleep until the specified time");
DATA(insert OID = 2971 ( text PGNSP PGUID 12 1 0 0 0 f f f f t f i 1 0 25 "16" _null_ _null_ _null_ _null_ _null_ booltext _null_ _null_ _null_ ));
DESCR("convert boolean to text");
+DATA(insert OID = 3298 ( pg_is_all_visible PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 16 "2205 20" _null_ _null_ _null_ _null_ _null_ pg_is_all_visible _null_ _null_ _null_ ));
+DESCR("true if the page is all visible");
+DATA(insert OID = 3299 ( pg_is_all_frozen PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 16 "2205 20" _null_ _null_ _null_ _null_ _null_ pg_is_all_frozen _null_ _null_ _null_ ));
+DESCR("true if the page is all frozen");
+
+
/* Aggregates (moved here from pg_aggregate for 7.3) */
DATA(insert OID = 2100 ( avg PGNSP PGUID 12 1 0 0 0 t f f f f f i 1 0 1700 "20" _null_ _null_ _null_ _null_ _null_ aggregate_dummy _null_ _null_ _null_ ));
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index e3a31af..d2bae2d 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -172,6 +172,7 @@ extern void vac_update_relstats(Relation relation,
BlockNumber num_pages,
double num_tuples,
BlockNumber num_all_visible_pages,
+ BlockNumber num_all_frozen_pages,
bool hasindex,
TransactionId frozenxid,
MultiXactId minmulti,
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index a2f78ee..7bf2718 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -178,8 +178,10 @@ typedef PageHeaderData *PageHeader;
* tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
+#define PD_ALL_FROZEN 0x0008 /* all tuples on page are completely
+ frozen */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x000F /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -369,6 +371,13 @@ typedef PageHeaderData *PageHeader;
#define PageClearAllVisible(page) \
(((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
+#define PageIsAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags & PD_ALL_FROZEN)
+#define PageSetAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags |= PD_ALL_FROZEN)
+#define PageClearAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags &= ~PD_ALL_FROZEN)
+
#define PageIsPrunable(page, oldestxmin) \
( \
AssertMacro(TransactionIdIsNormal(oldestxmin)), \
On Wed, Jul 8, 2015 at 10:10 PM, Sawada Masahiko <sawada.mshk@gmail.com>
wrote:
On Thu, Jul 9, 2015 at 4:31 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Fri, Jul 3, 2015 at 1:25 AM, Sawada Masahiko <sawada.mshk@gmail.com>
wrote:It's impossible to have VM bits set to frozen but not visible.
These bit are controlled independently. But eventually, when
all-frozen bit is set, all-visible is also set.If that combination is currently impossible, could it be used indicate
that
the page is all empty?
Yeah, the status of that VM bits set to frozen but not visible is
impossible, so we could use this status for another something status
of the page.Having a crash-proof bitmap of all-empty pages would make vacuum
truncation
scans much more efficient.
The empty page is always marked all-visible by vacuum today, it's not
enough?
The "current" vacuum can just remember that they were empty as well as
all-visible.
But the next vacuum that occurs on the table won't know that they are
empty, just that they are all-visible, so it can't truncate them away
without having to read each one first.
It is a minor thing, but if there is no other use for this fourth
"bit-space", it seems a shame to waste it when there is some use for it. I
haven't looked at the code around this area to know how hard it would be to
implement the setting and clearing of the bit.
Cheers,
Jeff
On Fri, Jul 10, 2015 at 3:05 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
Also something for pg_upgrade is also not yet.
TODO
- Test case for this feature
- pg_upgrade support.
I had forgotten to change the fork name of visibility map to "vfm".
Attached latest v7 patch.
Please review it.
Regards,
--
Sawada Masahiko
Attachments:
000_add_frozen_bit_into_visibilitymap_v7.patchtext/x-diff; charset=US-ASCII; name=000_add_frozen_bit_into_visibilitymap_v7.patchDownload
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 22c5f7a..b1b6a06 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -87,7 +87,7 @@ statapprox_heap(Relation rel, output_type *stat)
* If the page has only visible tuples, then we can find out the free
* space from the FSM and move on.
*/
- if (visibilitymap_test(rel, blkno, &vmbuffer))
+ if (visibilitymap_test(rel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
freespace = GetRecordedFreeSpace(rel, blkno);
stat->tuple_len += BLCKSZ - freespace;
diff --git a/src/backend/access/heap/Makefile b/src/backend/access/heap/Makefile
index b83d496..806ce27 100644
--- a/src/backend/access/heap/Makefile
+++ b/src/backend/access/heap/Makefile
@@ -12,6 +12,7 @@ subdir = src/backend/access/heap
top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
-OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o visibilitymap.o
+OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o visibilitymap.o \
+ heapfuncs.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 86a2e6b..ac74100 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -88,7 +88,7 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
HeapTuple newtup, HeapTuple old_key_tup,
- bool all_visible_cleared, bool new_all_visible_cleared);
+ bool all_visible_cleared, bool new_all_visible_cleared);
static void HeapSatisfiesHOTandKeyUpdate(Relation relation,
Bitmapset *hot_attrs,
Bitmapset *key_attrs, Bitmapset *id_attrs,
@@ -2131,8 +2131,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
/*
- * Find buffer to insert this tuple into. If the page is all visible,
- * this will also pin the requisite visibility map page.
+ * Find buffer to insert this tuple into. If the page is all visible
+ * or all frozen, this will also pin the requisite visibility map and
+ * frozen map page.
*/
buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
InvalidBuffer, options, bistate,
@@ -2147,10 +2148,13 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
if (PageIsAllVisible(BufferGetPage(buffer)))
{
all_visible_cleared = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(BufferGetPage(buffer));
+ PageClearAllFrozen(BufferGetPage(buffer));
+
visibilitymap_clear(relation,
- ItemPointerGetBlockNumber(&(heaptup->t_self)),
- vmbuffer);
+ ItemPointerGetBlockNumber(&(heaptup->t_self)), vmbuffer);
}
/*
@@ -2448,10 +2452,12 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
if (PageIsAllVisible(page))
{
all_visible_cleared = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(page);
- visibilitymap_clear(relation,
- BufferGetBlockNumber(buffer),
- vmbuffer);
+ PageClearAllFrozen(page);
+
+ visibilitymap_clear(relation, BufferGetBlockNumber(buffer), vmbuffer);
}
/*
@@ -2495,7 +2501,9 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
/* the rest of the scratch space is used for tuple data */
tupledata = scratchptr;
- xlrec->flags = all_visible_cleared ? XLH_INSERT_ALL_VISIBLE_CLEARED : 0;
+ if (all_visible_cleared)
+ xlrec->flags = XLH_INSERT_ALL_VISIBLE_CLEARED;
+
xlrec->ntuples = nthispage;
/*
@@ -2731,9 +2739,9 @@ heap_delete(Relation relation, ItemPointer tid,
/*
* If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, we'll have to unlock and
- * re-lock, to avoid holding the buffer lock across an I/O. That's a bit
- * unfortunate, but hopefully shouldn't happen often.
+ * visible or all frozen while we were busy locking the buffer, we'll
+ * have to unlock and re-lock, to avoid holding the buffer lock across an
+ * I/O. That's a bit unfortunate, but hopefully shouldn't happen often.
*/
if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
{
@@ -2925,12 +2933,16 @@ l1:
*/
PageSetPrunable(page, xid);
+ /* clear PD_ALL_VISIBLE and PD_ALL_FORZEN flags */
if (PageIsAllVisible(page))
{
all_visible_cleared = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(page);
- visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
- vmbuffer);
+ PageClearAllFrozen(page);
+
+ visibilitymap_clear(relation, BufferGetBlockNumber(buffer), vmbuffer);
}
/* store transaction information of xact deleting the tuple */
@@ -2961,7 +2973,9 @@ l1:
if (RelationIsAccessibleInLogicalDecoding(relation))
log_heap_new_cid(relation, &tp);
- xlrec.flags = all_visible_cleared ? XLH_DELETE_ALL_VISIBLE_CLEARED : 0;
+ if (all_visible_cleared)
+ xlrec.flags = XLH_DELETE_ALL_VISIBLE_CLEARED;
+
xlrec.infobits_set = compute_infobits(tp.t_data->t_infomask,
tp.t_data->t_infomask2);
xlrec.offnum = ItemPointerGetOffsetNumber(&tp.t_self);
@@ -3207,7 +3221,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
* in the middle of changing this, so we'll need to recheck after we have
* the lock.
*/
- if (PageIsAllVisible(page))
+ if (PageIsAllVisible(page) || PageIsAllFrozen(page))
visibilitymap_pin(relation, block, &vmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
@@ -3801,16 +3815,22 @@ l2:
if (PageIsAllVisible(BufferGetPage(buffer)))
{
all_visible_cleared = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(BufferGetPage(buffer));
- visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
- vmbuffer);
+ PageClearAllFrozen(BufferGetPage(buffer));
+
+ visibilitymap_clear(relation, BufferGetBlockNumber(buffer), vmbuffer);
}
if (newbuf != buffer && PageIsAllVisible(BufferGetPage(newbuf)))
{
all_visible_cleared_new = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(BufferGetPage(newbuf));
- visibilitymap_clear(relation, BufferGetBlockNumber(newbuf),
- vmbuffer_new);
+ PageClearAllFrozen(BufferGetPage(newbuf));
+
+ visibilitymap_clear(relation, BufferGetBlockNumber(newbuf), vmbuffer_new);
}
if (newbuf != buffer)
@@ -6893,7 +6913,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
*/
XLogRecPtr
log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
- TransactionId cutoff_xid)
+ TransactionId cutoff_xid, uint8 vmflags)
{
xl_heap_visible xlrec;
XLogRecPtr recptr;
@@ -6903,6 +6923,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
Assert(BufferIsValid(vm_buffer));
xlrec.cutoff_xid = cutoff_xid;
+ xlrec.flags = vmflags;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
@@ -7492,8 +7513,14 @@ heap_xlog_visible(XLogReaderState *record)
* the subsequent update won't be replayed to clear the flag.
*/
page = BufferGetPage(buffer);
- PageSetAllVisible(page);
+
+ if (xlrec->flags & VISIBILITYMAP_ALL_VISIBLE)
+ PageSetAllVisible(page);
+ if (xlrec->flags & VISIBILITYMAP_ALL_FROZEN)
+ PageSetAllFrozen(page);
+
MarkBufferDirty(buffer);
+
}
else if (action == BLK_RESTORED)
{
@@ -7544,7 +7571,7 @@ heap_xlog_visible(XLogReaderState *record)
*/
if (lsn > PageGetLSN(vmpage))
visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
- xlrec->cutoff_xid);
+ xlrec->cutoff_xid, xlrec->flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
@@ -7694,7 +7721,10 @@ heap_xlog_delete(XLogReaderState *record)
PageSetPrunable(page, XLogRecGetXid(record));
if (xlrec->flags & XLH_DELETE_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
/* Make sure there is no forward chain link in t_ctid */
htup->t_ctid = target_tid;
@@ -7798,7 +7828,10 @@ heap_xlog_insert(XLogReaderState *record)
PageSetLSN(page, lsn);
if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
MarkBufferDirty(buffer);
}
@@ -7937,7 +7970,10 @@ heap_xlog_multi_insert(XLogReaderState *record)
PageSetLSN(page, lsn);
if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
MarkBufferDirty(buffer);
}
@@ -8065,7 +8101,10 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
PageSetPrunable(page, XLogRecGetXid(record));
if (xlrec->flags & XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
PageSetLSN(page, lsn);
MarkBufferDirty(obuffer);
@@ -8200,7 +8239,10 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
elog(PANIC, "heap_update_redo: failed to add tuple");
if (xlrec->flags & XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 7c38772..13ad5b1 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -21,33 +21,44 @@
*
* NOTES
*
- * The visibility map is a bitmap with one bit per heap page. A set bit means
- * that all tuples on the page are known visible to all transactions, and
- * therefore the page doesn't need to be vacuumed. The map is conservative in
- * the sense that we make sure that whenever a bit is set, we know the
- * condition is true, but if a bit is not set, it might or might not be true.
+ * The visibility map is a bitmap with two bits (all-visible and all-frozen)
+ * per heap page. A set all-visible bit means that all tuples on the page are
+ * known visible to all transactions, and therefore the page doesn't need to
+ * be vacuumed. A set all-frozen bit means that all tuples on the page are
+ * completely frozen, and therefor the page doesn't need to be vacuumed even
+ * if whole table scanning vacuum is required. A all-frozen bit must be set only
+ * when the page is already all-visible. That is, all-frozen bit is always set
+ * with all-visible bit.
+ * The map is conservative in the sense that we make sure that whenever a bit
+ * is set, we know the condition is true, but if a bit is not set, it might or
+ * might not be true.
*
* Clearing a visibility map bit is not separately WAL-logged. The callers
* must make sure that whenever a bit is cleared, the bit is cleared on WAL
- * replay of the updating operation as well.
+ * replay of the updating operation as well. And all-frozen bit must be
+ * cleared with all-visible at the same time.
*
* When we *set* a visibility map during VACUUM, we must write WAL. This may
* seem counterintuitive, since the bit is basically a hint: if it is clear,
- * it may still be the case that every tuple on the page is visible to all
- * transactions; we just don't know that for certain. The difficulty is that
- * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
- * on the page itself, and the visibility map bit. If a crash occurs after the
- * visibility map page makes it to disk and before the updated heap page makes
- * it to disk, redo must set the bit on the heap page. Otherwise, the next
- * insert, update, or delete on the heap page will fail to realize that the
- * visibility map bit must be cleared, possibly causing index-only scans to
- * return wrong answers.
+ * it may still be the case that every tuple on the page is visible or frozen
+ * to all transactions; we just don't know that for certain. The difficulty is
+ * that there are two bits which are typically set together: the PD_ALL_VISIBLE
+ * or PD_ALL_FROZEN bit on the page itself, and the visibility map bit. If a
+ * crash occurs after the visibility map page makes it to disk and before the
+ * updated heap page makes it to disk, redo must set the bit on the heap page.
+ * Otherwise, the next insert, update, or delete on the heap page will fail to
+ * realize that the visibility map bit must be cleared, possibly causing index-only
+ * scans to return wrong answers.
*
* VACUUM will normally skip pages for which the visibility map bit is set;
* such pages can't contain any dead tuples and therefore don't need vacuuming.
- * The visibility map is not used for anti-wraparound vacuums, because
+ * The visibility map is not used for anti-wraparound vacuums before 9.5, because
* an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
* present in the table, even on pages that don't have any dead tuples.
+ * 9.6 or later, the visibility map has a additional bit which indicates all tuple
+ * on single page has been completely forzen, so the visibility map is also used for
+ * anti-wraparound vacuums.
+ *
*
* LOCKING
*
@@ -58,14 +69,14 @@
* section that logs the page modification. However, we don't want to hold
* the buffer lock over any I/O that may be required to read in the visibility
* map page. To avoid this, we examine the heap page before locking it;
- * if the page-level PD_ALL_VISIBLE bit is set, we pin the visibility map
- * bit. Then, we lock the buffer. But this creates a race condition: there
- * is a possibility that in the time it takes to lock the buffer, the
- * PD_ALL_VISIBLE bit gets set. If that happens, we have to unlock the
- * buffer, pin the visibility map page, and relock the buffer. This shouldn't
- * happen often, because only VACUUM currently sets visibility map bits,
- * and the race will only occur if VACUUM processes a given page at almost
- * exactly the same time that someone tries to further modify it.
+ * if the page-level PD_ALL_VISIBLE or PD_ALL_FROZEN bit is set, we pin the
+ * visibility map bit. Then, we lock the buffer. But this creates a race
+ * condition: there is a possibility that in the time it takes to lock the
+ * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that happens,
+ * we have to unlock the buffer, pin the visibility map page, and relock the
+ * buffer. This shouldn't happen often, because only VACUUM currently sets
+ * visibility map bits, and the race will only occur if VACUUM processes a given
+ * page at almost exactly the same time that someone tries to further modify it.
*
* To set a bit, you need to hold a lock on the heap page. That prevents
* the race condition where VACUUM sees that all tuples on the page are
@@ -101,11 +112,14 @@
*/
#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
-/* Number of bits allocated for each heap block. */
-#define BITS_PER_HEAPBLOCK 1
+/*
+ * Number of bits allocated for each heap block.
+ * One for all-visible, other for all-frozen.
+*/
+#define BITS_PER_HEAPBLOCK 2
/* Number of heap blocks we can represent in one byte. */
-#define HEAPBLOCKS_PER_BYTE 8
+#define HEAPBLOCKS_PER_BYTE 4
/* Number of heap blocks we can represent in one visibility map page. */
#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
@@ -115,24 +129,42 @@
#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
#define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
-/* table for fast counting of set bits */
-static const uint8 number_of_ones[256] = {
- 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
+/* tables for fast counting of set bits for visible and freeze */
+static const uint8 number_of_ones_for_visible[256] = {
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4
+};
+static const uint8 number_of_ones_for_frozen[256] = {
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4
};
/* prototypes for internal routines */
@@ -141,7 +173,7 @@ static void vm_extend(Relation rel, BlockNumber nvmblocks);
/*
- * visibilitymap_clear - clear a bit in visibility map
+ * visibilitymap_clear - clear all bits in visibility map
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -153,7 +185,8 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
int mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- uint8 mask = 1 << mapBit;
+ uint8 mask = (VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN) <<
+ (BITS_PER_HEAPBLOCK * mapBit);
char *map;
#ifdef TRACE_VISIBILITYMAP
@@ -225,7 +258,7 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
}
/*
- * visibilitymap_set - set a bit on a previously pinned page
+ * visibilitymap_set - set bit(s) on a previously pinned page
*
* recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
* or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
@@ -234,10 +267,11 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
* marked all-visible; it is needed for Hot Standby, and can be
* InvalidTransactionId if the page contains no tuples.
*
- * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
- * this function. Except in recovery, caller should also pass the heap
- * buffer. When checksums are enabled and we're not in recovery, we must add
- * the heap buffer to the WAL chain to protect it from being torn.
+ * Caller is expected to set the heap page's PD_ALL_VISIBLE or PD_ALL_FROZEN
+ * bit before calling this function. Except in recovery, caller should also
+ * pass the heap buffer and flags which indicates what flag we want to set.
+ * When checksums are enabled and we're not in recovery, we must add the heap
+ * buffer to the WAL chain to protect it from being torn.
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -245,7 +279,8 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
*/
void
visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid)
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -254,7 +289,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_set %s %d %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
@@ -272,11 +307,11 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map = PageGetContents(page);
LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
- if (!(map[mapByte] & (1 << mapBit)))
+ if (flags != (map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))))
{
START_CRIT_SECTION();
- map[mapByte] |= (1 << mapBit);
+ map[mapByte] |= (flags << (BITS_PER_HEAPBLOCK * mapBit));
MarkBufferDirty(vmBuf);
if (RelationNeedsWAL(rel))
@@ -285,7 +320,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Assert(!InRecovery);
recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
- cutoff_xid);
+ cutoff_xid, flags);
/*
* If data checksums are enabled (or wal_log_hints=on), we
@@ -295,11 +330,15 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Page heapPage = BufferGetPage(heapBuf);
- /* caller is expected to set PD_ALL_VISIBLE first */
- Assert(PageIsAllVisible(heapPage));
+ /*
+ * caller is expected to set PD_ALL_VISIBLE or
+ * PD_ALL_FROZEN first.
+ */
+ Assert(PageIsAllVisible(heapPage) || PageIsAllFrozen(heapPage));
PageSetLSN(heapPage, recptr);
}
}
+
PageSetLSN(page, recptr);
}
@@ -310,15 +349,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
}
/*
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_test - test if bit(s) is set
*
- * Are all tuples on heapBlk visible to all, according to the visibility map?
+ * Are all tuples on heapBlk visible or frozen to all, according to the visibility map?
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by an
* earlier call to visibilitymap_pin or visibilitymap_test on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk, or InvalidBuffer. The caller is responsible for
- * releasing *buf after it's done testing and setting bits.
+ * releasing *buf after it's done testing and setting bits, and must set flags
+ * which indicates what flag we want to test.
*
* NOTE: This function is typically called without a lock on the heap page,
* so somebody else could change the bit just after we look at it. In fact,
@@ -328,7 +368,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
* all concurrency issues!
*/
bool
-visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
+visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf, uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -337,7 +377,7 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_test %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_test %s %d %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
/* Reuse the old pinned buffer if possible */
@@ -360,11 +400,12 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
map = PageGetContents(BufferGetPage(*buf));
/*
- * A single-bit read is atomic. There could be memory-ordering effects
+ * A single or double bit read is atomic. There could be memory-ordering effects
* here, but for performance reasons we make it the caller's job to worry
* about that.
*/
- result = (map[mapByte] & (1 << mapBit)) ? true : false;
+ result = (map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))) ?
+ true : false;
return result;
}
@@ -374,10 +415,11 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
*
* Note: we ignore the possibility of race conditions when the table is being
* extended concurrently with the call. New pages added to the table aren't
- * going to be marked all-visible, so they won't affect the result.
+ * going to be marked all-visible or all-frozen, so they won't affect the result.
+ * The caller must set the flags which indicates what flag we want to count.
*/
BlockNumber
-visibilitymap_count(Relation rel)
+visibilitymap_count(Relation rel, uint8 flags)
{
BlockNumber result = 0;
BlockNumber mapBlock;
@@ -406,7 +448,10 @@ visibilitymap_count(Relation rel)
for (i = 0; i < MAPSIZE; i++)
{
- result += number_of_ones[map[i]];
+ if (flags & VISIBILITYMAP_ALL_VISIBLE)
+ result += number_of_ones_for_visible[map[i]];
+ if (flags & VISIBILITYMAP_ALL_FROZEN)
+ result += number_of_ones_for_frozen[map[i]];
}
ReleaseBuffer(mapBuffer);
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 4246554..015bfb8 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1919,11 +1919,18 @@ index_update_stats(Relation rel,
{
BlockNumber relpages = RelationGetNumberOfBlocks(rel);
BlockNumber relallvisible;
+ BlockNumber relallfrozen;
if (rd_rel->relkind != RELKIND_INDEX)
- relallvisible = visibilitymap_count(rel);
+ {
+ relallvisible = visibilitymap_count(rel, VISIBILITYMAP_ALL_VISIBLE);
+ relallfrozen = visibilitymap_count(rel, VISIBILITYMAP_ALL_FROZEN);
+ }
else /* don't bother for indexes */
+ {
relallvisible = 0;
+ relallfrozen = 0;
+ }
if (rd_rel->relpages != (int32) relpages)
{
@@ -1940,6 +1947,11 @@ index_update_stats(Relation rel,
rd_rel->relallvisible = (int32) relallvisible;
dirty = true;
}
+ if (rd_rel->relallfrozen != (int32) relallfrozen)
+ {
+ rd_rel->relallfrozen = (int32) relallfrozen;
+ dirty = true;
+ }
}
/*
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 861048f..392c2a4 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -572,7 +572,8 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
vac_update_relstats(onerel,
relpages,
totalrows,
- visibilitymap_count(onerel),
+ visibilitymap_count(onerel, VISIBILITYMAP_ALL_VISIBLE),
+ visibilitymap_count(onerel, VISIBILITYMAP_ALL_FROZEN),
hasindex,
InvalidTransactionId,
InvalidMultiXactId,
@@ -595,6 +596,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
RelationGetNumberOfBlocks(Irel[ind]),
totalindexrows,
0,
+ 0,
false,
InvalidTransactionId,
InvalidMultiXactId,
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 7ab4874..d3725dd 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -22,6 +22,7 @@
#include "access/rewriteheap.h"
#include "access/transam.h"
#include "access/tuptoaster.h"
+#include "access/visibilitymap.h"
#include "access/xact.h"
#include "access/xlog.h"
#include "catalog/catalog.h"
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index baf66f1..d68c7c4 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -744,6 +744,7 @@ void
vac_update_relstats(Relation relation,
BlockNumber num_pages, double num_tuples,
BlockNumber num_all_visible_pages,
+ BlockNumber num_all_frozen_pages,
bool hasindex, TransactionId frozenxid,
MultiXactId minmulti,
bool in_outer_xact)
@@ -781,6 +782,11 @@ vac_update_relstats(Relation relation,
pgcform->relallvisible = (int32) num_all_visible_pages;
dirty = true;
}
+ if (pgcform->relallfrozen != (int32) num_all_frozen_pages)
+ {
+ pgcform->relallfrozen = (int32) num_all_frozen_pages;
+ dirty = true;
+ }
/* Apply DDL updates, but not inside an outer transaction (see above) */
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index a01cfb4..12322a4 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -106,6 +106,8 @@ typedef struct LVRelStats
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* number of pages we examined */
BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber vmskipped_frozen_pages; /* # of pages we skipped by all-frozen bit
+ of visibility map */
double scanned_tuples; /* counts only tuples on scanned pages */
double old_rel_tuples; /* previous value of pg_class.reltuples */
double new_rel_tuples; /* new estimated total # of tuples */
@@ -156,7 +158,7 @@ static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
static int vac_cmp_itemptr(const void *left, const void *right);
static bool heap_page_is_all_visible(Relation rel, Buffer buf,
- TransactionId *visibility_cutoff_xid);
+ TransactionId *visibility_cutoff_xid, bool *all_frozen);
/*
@@ -188,7 +190,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
MultiXactId mxactFullScanLimit;
BlockNumber new_rel_pages;
double new_rel_tuples;
- BlockNumber new_rel_allvisible;
+ BlockNumber new_rel_allvisible,
+ new_rel_allfrozen;
double new_live_tuples;
TransactionId new_frozen_xid;
MultiXactId new_min_multi;
@@ -222,6 +225,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* than or equal to the requested Xid full-table scan limit; or if the
* table's minimum MultiXactId is older than or equal to the requested
* mxid full-table scan limit.
+ * Even if scan_all is set so far, we could skip to scan some pages
+ * according by frozen map.
*/
scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
xidFullScanLimit);
@@ -253,13 +258,16 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
- if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages + vacrelstats->vmskipped_frozen_pages)
+ < vacrelstats->rel_pages)
{
Assert(!scan_all);
scanned_all = false;
}
else
+ {
scanned_all = true;
+ }
/*
* Optionally truncate the relation.
@@ -301,10 +309,14 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_tuples = vacrelstats->old_rel_tuples;
}
- new_rel_allvisible = visibilitymap_count(onerel);
+ new_rel_allvisible = visibilitymap_count(onerel, VISIBILITYMAP_ALL_VISIBLE);
if (new_rel_allvisible > new_rel_pages)
new_rel_allvisible = new_rel_pages;
+ new_rel_allfrozen = visibilitymap_count(onerel, VISIBILITYMAP_ALL_FROZEN);
+ if (new_rel_allfrozen > new_rel_pages)
+ new_rel_allfrozen = new_rel_pages;
+
new_frozen_xid = scanned_all ? FreezeLimit : InvalidTransactionId;
new_min_multi = scanned_all ? MultiXactCutoff : InvalidMultiXactId;
@@ -312,6 +324,7 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_pages,
new_rel_tuples,
new_rel_allvisible,
+ new_rel_allfrozen,
vacrelstats->hasindex,
new_frozen_xid,
new_min_multi,
@@ -360,10 +373,11 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel),
vacrelstats->num_index_scans);
- appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins\n"),
+ appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped frozen page according to visibility map\n"),
vacrelstats->pages_removed,
vacrelstats->rel_pages,
- vacrelstats->pinskipped_pages);
+ vacrelstats->pinskipped_pages,
+ vacrelstats->vmskipped_frozen_pages);
appendStringInfo(&buf,
_("tuples: %.0f removed, %.0f remain, %.0f are dead but not yet removable\n"),
vacrelstats->tuples_deleted,
@@ -486,9 +500,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* consecutive pages. Since we're reading sequentially, the OS should be
* doing readahead for us, so there's no gain in skipping a page now and
* then; that's likely to disable readahead and so be counterproductive.
- * Also, skipping even a single page means that we can't update
- * relfrozenxid, so we only want to do it if we can skip a goodly number
- * of pages.
+ * Also, skipping even a single page accorinding to all-visible bit of
+ * visibility map means that we can't update relfrozenxid, so we only want
+ * to do it if we can skip a goodly number. On the other hand, we count
+ * both how many pages we skipped according to all-frozen bit of visibility
+ * map and how many pages we freeze page, so we can update relfrozenxid if
+ * the sum of them is as many as pages of table.
*
* Before entering the main loop, establish the invariant that
* next_not_all_visible_block is the next block number >= blkno that's not
@@ -515,7 +532,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer))
+ if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer,
+ VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN))
break;
vacuum_delay_point();
}
@@ -533,7 +551,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
bool tupgone,
hastup;
int prev_dead_count;
- int nfrozen;
+ int nfrozen; /* # of tuples is frozen */
+ int nalready_frozen; /* # of tuples is already frozen */
+ int ntotal_frozen; /* # of tuples is in single page */
+ int ntup_per_page;
Size freespace;
bool all_visible_according_to_vm;
bool all_visible;
@@ -548,7 +569,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block++)
{
if (!visibilitymap_test(onerel, next_not_all_visible_block,
- &vmbuffer))
+ &vmbuffer,
+ VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN))
break;
vacuum_delay_point();
}
@@ -566,9 +588,25 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
}
else
{
- /* Current block is all-visible */
- if (skipping_all_visible_blocks && !scan_all)
- continue;
+ /*
+ * This block is at least all-visible according to visibility map.
+ * We check whehter this block is all-frozen to skip to vacuum this
+ * page even though scanning whole page is required.
+ */
+ if (scan_all)
+ {
+ if (visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_FROZEN))
+ {
+ vacrelstats->vmskipped_frozen_pages++;
+ continue;
+ }
+ }
+ else
+ {
+ if (skipping_all_visible_blocks)
+ continue;
+ }
+
all_visible_according_to_vm = true;
}
@@ -740,7 +778,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
PageSetAllVisible(page);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, InvalidTransactionId);
+ vmbuffer, InvalidTransactionId,
+ VISIBILITYMAP_ALL_VISIBLE);
END_CRIT_SECTION();
}
@@ -764,6 +803,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
all_visible = true;
has_dead_tuples = false;
nfrozen = 0;
+ nalready_frozen = 0;
+ ntup_per_page = 0;
hastup = false;
prev_dead_count = vacrelstats->num_dead_tuples;
maxoff = PageGetMaxOffsetNumber(page);
@@ -918,8 +959,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
{
num_tuples += 1;
+ ntup_per_page += 1;;
hastup = true;
+ /* Check whether this tuple is alrady frozen or not */
+ if (HeapTupleHeaderXminFrozen(tuple.t_data))
+ nalready_frozen += 1;
+
/*
* Each non-removable tuple must be checked to see if it needs
* freezing. Note we already have exclusive buffer lock.
@@ -931,9 +977,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
} /* scan along page */
/*
- * If we froze any tuples, mark the buffer dirty, and write a WAL
- * record recording the changes. We must log the changes to be
- * crash-safe against future truncation of CLOG.
+ * If we froze any tuples or any tuples are already frozen,
+ * mark the buffer dirty, and write a WAL record recording the changes.
+ * We must log the changes to be crash-safe against future truncation
+ * of CLOG.
*/
if (nfrozen > 0)
{
@@ -966,6 +1013,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
END_CRIT_SECTION();
}
+ /* Compute the number of frozen tuples in a page */
+ ntotal_frozen = nfrozen + nalready_frozen;
+
/*
* If there are no indexes then we can vacuum the page right now
* instead of doing a second scan.
@@ -988,26 +1038,47 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
freespace = PageGetHeapFreeSpace(page);
- /* mark page all-visible, if appropriate */
- if (all_visible && !all_visible_according_to_vm)
+ /* This page is all visible */
+ if (all_visible)
{
- /*
- * It should never be the case that the visibility map page is set
- * while the page-level bit is clear, but the reverse is allowed
- * (if checksums are not enabled). Regardless, set the both bits
- * so that we get back in sync.
- *
- * NB: If the heap page is all-visible but the VM bit is not set,
- * we don't need to dirty the heap page. However, if checksums
- * are enabled, we do need to make sure that the heap page is
- * dirtied before passing it to visibilitymap_set(), because it
- * may be logged. Given that this situation should only happen in
- * rare cases after a crash, it is not worth optimizing.
- */
- PageSetAllVisible(page);
- MarkBufferDirty(buf);
- visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ /* mark page all-visible, if appropriate */
+ if (!all_visible_according_to_vm)
+ {
+ /*
+ * It should never be the case that the visibility map page is set
+ * while the page-level bit is clear, but the reverse is allowed
+ * (if checksums are not enabled). Regardless, set the both bits
+ * so that we get back in sync.
+ *
+ * NB: If the heap page is all-visible but the VM bit is not set,
+ * we don't need to dirty the heap page. However, if checksums
+ * are enabled, we do need to make sure that the heap page is
+ * dirtied before passing it to visibilitymap_set(), because it
+ * may be logged. Given that this situation should only happen in
+ * rare cases after a crash, it is not worth optimizing.
+ */
+ PageSetAllVisible(page);
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+ }
+
+ /* mark page all-frozen, if all tuples are frozen in total */
+ if ((ntotal_frozen == ntup_per_page) &&
+ !visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_FROZEN))
+ {
+ Assert(PageIsAllVisible(page));
+
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ {
+ MarkBufferDirty(buf);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+ vmbuffer, visibility_cutoff_xid, flags);
+ }
}
/*
@@ -1018,7 +1089,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* that something bad has happened.
*/
else if (all_visible_according_to_vm && !PageIsAllVisible(page)
- && visibilitymap_test(onerel, blkno, &vmbuffer))
+ && visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
relname, blkno);
@@ -1047,6 +1118,17 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
visibilitymap_clear(onerel, blkno, vmbuffer);
}
+ /*
+ * As a result of scanning a page, we set VM all-frozen bit and page header
+ * if all tuples of single page are frozen.
+ */
+ if (ntotal_frozen == ntup_per_page)
+ {
+ PageSetAllFrozen(page);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr, vmbuffer,
+ InvalidTransactionId, VISIBILITYMAP_ALL_FROZEN);
+ }
+
UnlockReleaseBuffer(buf);
/* Remember the location of the last page with nonremovable tuples */
@@ -1078,7 +1160,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
num_tuples);
/*
- * Release any remaining pin on visibility map page.
+ * Release any remaining pin on visibility map and frozen map page.
*/
if (BufferIsValid(vmbuffer))
{
@@ -1126,6 +1208,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
nunused);
appendStringInfo(&buf, _("Skipped %u pages due to buffer pins.\n"),
vacrelstats->pinskipped_pages);
+ appendStringInfo(&buf, _("Skipped %u frozen pages according to visibility map.\n"),
+ vacrelstats->vmskipped_frozen_pages);
appendStringInfo(&buf, _("%u pages are entirely empty.\n"),
empty_pages);
appendStringInfo(&buf, _("%s."),
@@ -1226,6 +1310,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
OffsetNumber unused[MaxOffsetNumber];
int uncnt = 0;
TransactionId visibility_cutoff_xid;
+ bool all_frozen;
START_CRIT_SECTION();
@@ -1277,19 +1362,31 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
* dirty, exclusively locked, and, if needed, a full page image has been
* emitted in the log_heap_clean() above.
*/
- if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid))
+ if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid, &all_frozen))
PageSetAllVisible(page);
/*
* All the changes to the heap page have been done. If the all-visible
- * flag is now set, also set the VM bit.
+ * flag is now set, also set the VM all-visible bit.
+ * Also, if this page is all-frozen, set VM all-frozen bit and flag.
*/
- if (PageIsAllVisible(page) &&
- !visibilitymap_test(onerel, blkno, vmbuffer))
+ if (PageIsAllVisible(page))
{
- Assert(BufferIsValid(*vmbuffer));
- visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
- visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ if (!visibilitymap_test(onerel, blkno, vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+
+ /* mark page all-frozen, and set VM all-frozen bit */
+ if (all_frozen)
+ {
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
+ visibility_cutoff_xid, flags);
}
return tupindex;
@@ -1408,6 +1505,7 @@ lazy_cleanup_index(Relation indrel,
stats->num_pages,
stats->num_index_tuples,
0,
+ 0,
false,
InvalidTransactionId,
InvalidMultiXactId,
@@ -1782,7 +1880,8 @@ vac_cmp_itemptr(const void *left, const void *right)
* xmin amongst the visible tuples.
*/
static bool
-heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid)
+heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid,
+ bool *all_frozen)
{
Page page = BufferGetPage(buf);
BlockNumber blockno = BufferGetBlockNumber(buf);
@@ -1791,6 +1890,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
bool all_visible = true;
*visibility_cutoff_xid = InvalidTransactionId;
+ *all_frozen = true;
/*
* This is a stripped down version of the line pointer scan in
@@ -1814,7 +1914,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/*
* Dead line pointers can have index pointers pointing to them. So
- * they can't be treated as visible
+ * they can't be treated as visible and frozen.
*/
if (ItemIdIsDead(itemid))
{
@@ -1855,6 +1955,10 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
+
+ /* Check whether this tuple is alrady frozen or not */
+ if (!HeapTupleHeaderXminFrozen(tuple.t_data))
+ *all_frozen = false;
}
break;
@@ -1863,6 +1967,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
case HEAPTUPLE_INSERT_IN_PROGRESS:
case HEAPTUPLE_DELETE_IN_PROGRESS:
all_visible = false;
+ *all_frozen = false;
break;
default:
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 9f54c46..08df289 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -116,7 +116,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
*/
if (!visibilitymap_test(scandesc->heapRelation,
ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
+ &node->ioss_VMBuffer, VISIBILITYMAP_ALL_VISIBLE))
{
/*
* Rats, we have to visit the heap to check visibility.
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 874ca6a..376841a 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -127,7 +127,7 @@ ExecCheckPlanOutput(Relation resultRel, List *targetList)
if (attno != resultDesc->natts)
ereport(ERROR,
(errcode(ERRCODE_DATATYPE_MISMATCH),
- errmsg("table row type and query-specified row type do not match"),
+ errmsg("table row type and query-specified row type do not match"),
errdetail("Query has too few columns.")));
}
diff --git a/src/common/relpath.c b/src/common/relpath.c
index 66dfef1..5898f1b 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -30,11 +30,14 @@
* If you add a new entry, remember to update the errhint in
* forkname_to_number() below, and update the SGML documentation for
* pg_relation_size().
+ * 9.6 or later, the visibility map fork name is changed from "vm" to
+ * "vfm" bacause visibility map has not only information about all-visible
+ * but also information about all-frozen.
*/
const char *const forkNames[] = {
"main", /* MAIN_FORKNUM */
"fsm", /* FSM_FORKNUM */
- "vm", /* VISIBILITYMAP_FORKNUM */
+ "vfm", /* VISIBILITYMAP_FORKNUM */
"init" /* INIT_FORKNUM */
};
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index caa0f14..93afb10 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -320,9 +320,10 @@ typedef struct xl_heap_freeze_page
typedef struct xl_heap_visible
{
TransactionId cutoff_xid;
+ uint8 flags;
} xl_heap_visible;
-#define SizeOfHeapVisible (offsetof(xl_heap_visible, cutoff_xid) + sizeof(TransactionId))
+#define SizeOfHeapVisible (offsetof(xl_heap_visible, flags) + sizeof(uint8))
typedef struct xl_heap_new_cid
{
@@ -389,6 +390,6 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
xl_heap_freeze_tuple *xlrec_tp);
extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
- Buffer vm_buffer, TransactionId cutoff_xid);
+ Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
#endif /* HEAPAM_XLOG_H */
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 0c0e0ef..7270609 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -19,15 +19,20 @@
#include "storage/buf.h"
#include "utils/relcache.h"
-extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk,
- Buffer vmbuf);
+/* Flags for bit map */
+#define VISIBILITYMAP_ALL_VISIBLE 0x01
+#define VISIBILITYMAP_ALL_FROZEN 0x02
+
+extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
Buffer *vmbuf);
extern bool visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid);
-extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
-extern BlockNumber visibilitymap_count(Relation rel);
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags);
+extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf,
+ uint8 flags);
+extern BlockNumber visibilitymap_count(Relation rel, uint8 flags);
extern void visibilitymap_truncate(Relation rel, BlockNumber nheapblocks);
#endif /* VISIBILITYMAP_H */
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 44ce2b3..1645add 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 201507021
+#define CATALOG_VERSION_NO 201507101
#endif
diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
index e526cd9..ea0f7c1 100644
--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -47,6 +47,8 @@ CATALOG(pg_class,1259) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83) BKI_SCHEMA_MACRO
float4 reltuples; /* # of tuples (not always up-to-date) */
int32 relallvisible; /* # of all-visible blocks (not always
* up-to-date) */
+ int32 relallfrozen; /* # of all-frozen blocks (not always
+ up-to-date) */
Oid reltoastrelid; /* OID of toast table; 0 if none */
bool relhasindex; /* T if has (or has had) any indexes */
bool relisshared; /* T if shared across databases */
@@ -95,7 +97,7 @@ typedef FormData_pg_class *Form_pg_class;
* ----------------
*/
-#define Natts_pg_class 30
+#define Natts_pg_class 31
#define Anum_pg_class_relname 1
#define Anum_pg_class_relnamespace 2
#define Anum_pg_class_reltype 3
@@ -107,25 +109,26 @@ typedef FormData_pg_class *Form_pg_class;
#define Anum_pg_class_relpages 9
#define Anum_pg_class_reltuples 10
#define Anum_pg_class_relallvisible 11
-#define Anum_pg_class_reltoastrelid 12
-#define Anum_pg_class_relhasindex 13
-#define Anum_pg_class_relisshared 14
-#define Anum_pg_class_relpersistence 15
-#define Anum_pg_class_relkind 16
-#define Anum_pg_class_relnatts 17
-#define Anum_pg_class_relchecks 18
-#define Anum_pg_class_relhasoids 19
-#define Anum_pg_class_relhaspkey 20
-#define Anum_pg_class_relhasrules 21
-#define Anum_pg_class_relhastriggers 22
-#define Anum_pg_class_relhassubclass 23
-#define Anum_pg_class_relrowsecurity 24
-#define Anum_pg_class_relispopulated 25
-#define Anum_pg_class_relreplident 26
-#define Anum_pg_class_relfrozenxid 27
-#define Anum_pg_class_relminmxid 28
-#define Anum_pg_class_relacl 29
-#define Anum_pg_class_reloptions 30
+#define Anum_pg_class_relallfrozen 12
+#define Anum_pg_class_reltoastrelid 13
+#define Anum_pg_class_relhasindex 14
+#define Anum_pg_class_relisshared 15
+#define Anum_pg_class_relpersistence 16
+#define Anum_pg_class_relkind 17
+#define Anum_pg_class_relnatts 18
+#define Anum_pg_class_relchecks 19
+#define Anum_pg_class_relhasoids 20
+#define Anum_pg_class_relhaspkey 21
+#define Anum_pg_class_relhasrules 22
+#define Anum_pg_class_relhastriggers 23
+#define Anum_pg_class_relhassubclass 24
+#define Anum_pg_class_relrowsecurity 25
+#define Anum_pg_class_relispopulated 26
+#define Anum_pg_class_relreplident 27
+#define Anum_pg_class_relfrozenxid 28
+#define Anum_pg_class_relminmxid 29
+#define Anum_pg_class_relacl 30
+#define Anum_pg_class_reloptions 31
/* ----------------
* initial contents of pg_class
@@ -140,13 +143,13 @@ typedef FormData_pg_class *Form_pg_class;
* Note: "3" in the relfrozenxid column stands for FirstNormalTransactionId;
* similarly, "1" in relminmxid stands for FirstMultiXactId
*/
-DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 f f p r 28 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 0 f f p r 28 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 0 f f p r 31 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 6fd1278..7d9b93f 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3213,6 +3213,12 @@ DESCR("sleep until the specified time");
DATA(insert OID = 2971 ( text PGNSP PGUID 12 1 0 0 0 f f f f t f i 1 0 25 "16" _null_ _null_ _null_ _null_ _null_ booltext _null_ _null_ _null_ ));
DESCR("convert boolean to text");
+DATA(insert OID = 3298 ( pg_is_all_visible PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 16 "2205 20" _null_ _null_ _null_ _null_ _null_ pg_is_all_visible _null_ _null_ _null_ ));
+DESCR("true if the page is all visible");
+DATA(insert OID = 3299 ( pg_is_all_frozen PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 16 "2205 20" _null_ _null_ _null_ _null_ _null_ pg_is_all_frozen _null_ _null_ _null_ ));
+DESCR("true if the page is all frozen");
+
+
/* Aggregates (moved here from pg_aggregate for 7.3) */
DATA(insert OID = 2100 ( avg PGNSP PGUID 12 1 0 0 0 t f f f f f i 1 0 1700 "20" _null_ _null_ _null_ _null_ _null_ aggregate_dummy _null_ _null_ _null_ ));
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index e3a31af..d2bae2d 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -172,6 +172,7 @@ extern void vac_update_relstats(Relation relation,
BlockNumber num_pages,
double num_tuples,
BlockNumber num_all_visible_pages,
+ BlockNumber num_all_frozen_pages,
bool hasindex,
TransactionId frozenxid,
MultiXactId minmulti,
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index a2f78ee..7bf2718 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -178,8 +178,10 @@ typedef PageHeaderData *PageHeader;
* tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
+#define PD_ALL_FROZEN 0x0008 /* all tuples on page are completely
+ frozen */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x000F /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -369,6 +371,13 @@ typedef PageHeaderData *PageHeader;
#define PageClearAllVisible(page) \
(((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
+#define PageIsAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags & PD_ALL_FROZEN)
+#define PageSetAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags |= PD_ALL_FROZEN)
+#define PageClearAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags &= ~PD_ALL_FROZEN)
+
#define PageIsPrunable(page, oldestxmin) \
( \
AssertMacro(TransactionIdIsNormal(oldestxmin)), \
On Fri, Jul 10, 2015 at 3:42 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Wed, Jul 8, 2015 at 10:10 PM, Sawada Masahiko <sawada.mshk@gmail.com>
wrote:On Thu, Jul 9, 2015 at 4:31 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Fri, Jul 3, 2015 at 1:25 AM, Sawada Masahiko <sawada.mshk@gmail.com>
wrote:It's impossible to have VM bits set to frozen but not visible.
These bit are controlled independently. But eventually, when
all-frozen bit is set, all-visible is also set.If that combination is currently impossible, could it be used indicate
that
the page is all empty?Yeah, the status of that VM bits set to frozen but not visible is
impossible, so we could use this status for another something status
of the page.Having a crash-proof bitmap of all-empty pages would make vacuum
truncation
scans much more efficient.The empty page is always marked all-visible by vacuum today, it's not
enough?The "current" vacuum can just remember that they were empty as well as
all-visible.But the next vacuum that occurs on the table won't know that they are empty,
just that they are all-visible, so it can't truncate them away without
having to read each one first.
Yeah, it would be effective for vacuum empty page.
It is a minor thing, but if there is no other use for this fourth
"bit-space", it seems a shame to waste it when there is some use for it. I
haven't looked at the code around this area to know how hard it would be to
implement the setting and clearing of the bit.
I think so too, we would be able to use unused fourth status of bits
efficiently.
Should I include these improvement into this patch?
This topic should be discussed on another thread after this feature is
committed, I think.
Regards,
--
Sawada Masahiko
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 10 July 2015 at 09:49, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
It is a minor thing, but if there is no other use for this fourth
"bit-space", it seems a shame to waste it when there is some use forit. I
haven't looked at the code around this area to know how hard it would be
to
implement the setting and clearing of the bit.
I think so too, we would be able to use unused fourth status of bits
efficiently.
Should I include these improvement into this patch?
This topic should be discussed on another thread after this feature is
committed, I think.
The impossible state acts as a diagnostic check for us to ensure the bitmap
is not itself corrupt.
-1 for using it for another purpose.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Jul 10, 2015 at 2:41 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
On Fri, Jul 10, 2015 at 3:05 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
Also something for pg_upgrade is also not yet.
TODO
- Test case for this feature
- pg_upgrade support.I had forgotten to change the fork name of visibility map to "vfm".
Attached latest v7 patch.
Please review it.
The compilation failed on my machine...
gcc -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -g -O0 -I../../../../src/include -D_GNU_SOURCE -c -o
visibilitymap.o visibilitymap.c
make[4]: *** No rule to make target `heapfuncs.o', needed by
`objfiles.txt'. Stop.
make[4]: *** Waiting for unfinished jobs....
( echo src/backend/access/index/genam.o
src/backend/access/index/indexam.o ) >objfiles.txt
make[4]: Leaving directory `/home/postgres/pgsql/git/src/backend/access/index'
gcc -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -g -O0 -I../../../src/include -D_GNU_SOURCE -c -o
tablespace.o tablespace.c
gcc -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -g -O0 -I../../../src/include -D_GNU_SOURCE -c -o
instrument.o instrument.c
make[4]: Leaving directory `/home/postgres/pgsql/git/src/backend/access/heap'
make[3]: *** [heap-recursive] Error 2
make[3]: Leaving directory `/home/postgres/pgsql/git/src/backend/access'
make[2]: *** [access-recursive] Error 2
make[2]: *** Waiting for unfinished jobs....
Regards,
--
Fujii Masao
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Jul 10, 2015 at 10:43 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
On Fri, Jul 10, 2015 at 2:41 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
On Fri, Jul 10, 2015 at 3:05 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
Also something for pg_upgrade is also not yet.
TODO
- Test case for this feature
- pg_upgrade support.I had forgotten to change the fork name of visibility map to "vfm".
Attached latest v7 patch.
Please review it.The compilation failed on my machine...
gcc -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -g -O0 -I../../../../src/include -D_GNU_SOURCE -c -o
visibilitymap.o visibilitymap.c
make[4]: *** No rule to make target `heapfuncs.o', needed by
`objfiles.txt'. Stop.
make[4]: *** Waiting for unfinished jobs....
( echo src/backend/access/index/genam.o
src/backend/access/index/indexam.o ) >objfiles.txt
make[4]: Leaving directory `/home/postgres/pgsql/git/src/backend/access/index'
gcc -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -g -O0 -I../../../src/include -D_GNU_SOURCE -c -o
tablespace.o tablespace.c
gcc -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -g -O0 -I../../../src/include -D_GNU_SOURCE -c -o
instrument.o instrument.c
make[4]: Leaving directory `/home/postgres/pgsql/git/src/backend/access/heap'
make[3]: *** [heap-recursive] Error 2
make[3]: Leaving directory `/home/postgres/pgsql/git/src/backend/access'
make[2]: *** [access-recursive] Error 2
make[2]: *** Waiting for unfinished jobs....
Oops, I had forgotten to add new file heapfuncs.c.
Latest patch is attached.
Regards,
--
Sawada Masahiko
Attachments:
000_add_frozen_bit_into_visibilitymap_v8.patchapplication/octet-stream; name=000_add_frozen_bit_into_visibilitymap_v8.patchDownload
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 22c5f7a..b1b6a06 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -87,7 +87,7 @@ statapprox_heap(Relation rel, output_type *stat)
* If the page has only visible tuples, then we can find out the free
* space from the FSM and move on.
*/
- if (visibilitymap_test(rel, blkno, &vmbuffer))
+ if (visibilitymap_test(rel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
freespace = GetRecordedFreeSpace(rel, blkno);
stat->tuple_len += BLCKSZ - freespace;
diff --git a/src/backend/access/heap/Makefile b/src/backend/access/heap/Makefile
index b83d496..806ce27 100644
--- a/src/backend/access/heap/Makefile
+++ b/src/backend/access/heap/Makefile
@@ -12,6 +12,7 @@ subdir = src/backend/access/heap
top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
-OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o visibilitymap.o
+OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o visibilitymap.o \
+ heapfuncs.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 86a2e6b..ac74100 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -88,7 +88,7 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
HeapTuple newtup, HeapTuple old_key_tup,
- bool all_visible_cleared, bool new_all_visible_cleared);
+ bool all_visible_cleared, bool new_all_visible_cleared);
static void HeapSatisfiesHOTandKeyUpdate(Relation relation,
Bitmapset *hot_attrs,
Bitmapset *key_attrs, Bitmapset *id_attrs,
@@ -2131,8 +2131,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
/*
- * Find buffer to insert this tuple into. If the page is all visible,
- * this will also pin the requisite visibility map page.
+ * Find buffer to insert this tuple into. If the page is all visible
+ * or all frozen, this will also pin the requisite visibility map and
+ * frozen map page.
*/
buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
InvalidBuffer, options, bistate,
@@ -2147,10 +2148,13 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
if (PageIsAllVisible(BufferGetPage(buffer)))
{
all_visible_cleared = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(BufferGetPage(buffer));
+ PageClearAllFrozen(BufferGetPage(buffer));
+
visibilitymap_clear(relation,
- ItemPointerGetBlockNumber(&(heaptup->t_self)),
- vmbuffer);
+ ItemPointerGetBlockNumber(&(heaptup->t_self)), vmbuffer);
}
/*
@@ -2448,10 +2452,12 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
if (PageIsAllVisible(page))
{
all_visible_cleared = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(page);
- visibilitymap_clear(relation,
- BufferGetBlockNumber(buffer),
- vmbuffer);
+ PageClearAllFrozen(page);
+
+ visibilitymap_clear(relation, BufferGetBlockNumber(buffer), vmbuffer);
}
/*
@@ -2495,7 +2501,9 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
/* the rest of the scratch space is used for tuple data */
tupledata = scratchptr;
- xlrec->flags = all_visible_cleared ? XLH_INSERT_ALL_VISIBLE_CLEARED : 0;
+ if (all_visible_cleared)
+ xlrec->flags = XLH_INSERT_ALL_VISIBLE_CLEARED;
+
xlrec->ntuples = nthispage;
/*
@@ -2731,9 +2739,9 @@ heap_delete(Relation relation, ItemPointer tid,
/*
* If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, we'll have to unlock and
- * re-lock, to avoid holding the buffer lock across an I/O. That's a bit
- * unfortunate, but hopefully shouldn't happen often.
+ * visible or all frozen while we were busy locking the buffer, we'll
+ * have to unlock and re-lock, to avoid holding the buffer lock across an
+ * I/O. That's a bit unfortunate, but hopefully shouldn't happen often.
*/
if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
{
@@ -2925,12 +2933,16 @@ l1:
*/
PageSetPrunable(page, xid);
+ /* clear PD_ALL_VISIBLE and PD_ALL_FORZEN flags */
if (PageIsAllVisible(page))
{
all_visible_cleared = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(page);
- visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
- vmbuffer);
+ PageClearAllFrozen(page);
+
+ visibilitymap_clear(relation, BufferGetBlockNumber(buffer), vmbuffer);
}
/* store transaction information of xact deleting the tuple */
@@ -2961,7 +2973,9 @@ l1:
if (RelationIsAccessibleInLogicalDecoding(relation))
log_heap_new_cid(relation, &tp);
- xlrec.flags = all_visible_cleared ? XLH_DELETE_ALL_VISIBLE_CLEARED : 0;
+ if (all_visible_cleared)
+ xlrec.flags = XLH_DELETE_ALL_VISIBLE_CLEARED;
+
xlrec.infobits_set = compute_infobits(tp.t_data->t_infomask,
tp.t_data->t_infomask2);
xlrec.offnum = ItemPointerGetOffsetNumber(&tp.t_self);
@@ -3207,7 +3221,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
* in the middle of changing this, so we'll need to recheck after we have
* the lock.
*/
- if (PageIsAllVisible(page))
+ if (PageIsAllVisible(page) || PageIsAllFrozen(page))
visibilitymap_pin(relation, block, &vmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
@@ -3801,16 +3815,22 @@ l2:
if (PageIsAllVisible(BufferGetPage(buffer)))
{
all_visible_cleared = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(BufferGetPage(buffer));
- visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
- vmbuffer);
+ PageClearAllFrozen(BufferGetPage(buffer));
+
+ visibilitymap_clear(relation, BufferGetBlockNumber(buffer), vmbuffer);
}
if (newbuf != buffer && PageIsAllVisible(BufferGetPage(newbuf)))
{
all_visible_cleared_new = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(BufferGetPage(newbuf));
- visibilitymap_clear(relation, BufferGetBlockNumber(newbuf),
- vmbuffer_new);
+ PageClearAllFrozen(BufferGetPage(newbuf));
+
+ visibilitymap_clear(relation, BufferGetBlockNumber(newbuf), vmbuffer_new);
}
if (newbuf != buffer)
@@ -6893,7 +6913,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
*/
XLogRecPtr
log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
- TransactionId cutoff_xid)
+ TransactionId cutoff_xid, uint8 vmflags)
{
xl_heap_visible xlrec;
XLogRecPtr recptr;
@@ -6903,6 +6923,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
Assert(BufferIsValid(vm_buffer));
xlrec.cutoff_xid = cutoff_xid;
+ xlrec.flags = vmflags;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
@@ -7492,8 +7513,14 @@ heap_xlog_visible(XLogReaderState *record)
* the subsequent update won't be replayed to clear the flag.
*/
page = BufferGetPage(buffer);
- PageSetAllVisible(page);
+
+ if (xlrec->flags & VISIBILITYMAP_ALL_VISIBLE)
+ PageSetAllVisible(page);
+ if (xlrec->flags & VISIBILITYMAP_ALL_FROZEN)
+ PageSetAllFrozen(page);
+
MarkBufferDirty(buffer);
+
}
else if (action == BLK_RESTORED)
{
@@ -7544,7 +7571,7 @@ heap_xlog_visible(XLogReaderState *record)
*/
if (lsn > PageGetLSN(vmpage))
visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
- xlrec->cutoff_xid);
+ xlrec->cutoff_xid, xlrec->flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
@@ -7694,7 +7721,10 @@ heap_xlog_delete(XLogReaderState *record)
PageSetPrunable(page, XLogRecGetXid(record));
if (xlrec->flags & XLH_DELETE_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
/* Make sure there is no forward chain link in t_ctid */
htup->t_ctid = target_tid;
@@ -7798,7 +7828,10 @@ heap_xlog_insert(XLogReaderState *record)
PageSetLSN(page, lsn);
if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
MarkBufferDirty(buffer);
}
@@ -7937,7 +7970,10 @@ heap_xlog_multi_insert(XLogReaderState *record)
PageSetLSN(page, lsn);
if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
MarkBufferDirty(buffer);
}
@@ -8065,7 +8101,10 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
PageSetPrunable(page, XLogRecGetXid(record));
if (xlrec->flags & XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
PageSetLSN(page, lsn);
MarkBufferDirty(obuffer);
@@ -8200,7 +8239,10 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
elog(PANIC, "heap_update_redo: failed to add tuple");
if (xlrec->flags & XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */
diff --git a/src/backend/access/heap/heapfuncs.c b/src/backend/access/heap/heapfuncs.c
new file mode 100644
index 0000000..a0cc165
--- /dev/null
+++ b/src/backend/access/heap/heapfuncs.c
@@ -0,0 +1,80 @@
+/*-------------------------------------------------------------------------
+ *
+ * heapfuncs.c
+ * Functions for accessing the related heap page
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/heap/heapfuncs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/visibilitymap.h"
+#include "funcapi.h"
+#include "storage/freespace.h"
+#include "storage/bufmgr.h"
+
+/* Functions for visibilitymap */
+extern Datum pg_is_all_visible(PG_FUNCTION_ARGS);
+extern Datum pg_is_all_frozen(PG_FUNCTION_ARGS);
+
+static bool visibilitymap_test_internal(Oid relid, uint64 blkno, uint8);
+
+/*
+ * Return the page is all-visible or not, according to the visibility map.
+ */
+Datum
+pg_is_all_visible(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int64 blkno = PG_GETARG_INT64(1);
+ bool all_visible;
+
+ all_visible = visibilitymap_test_internal(relid, blkno, VISIBILITYMAP_ALL_VISIBLE);
+
+ PG_RETURN_BOOL(all_visible);
+}
+
+/*
+ * Return the page is all-frozen or not, according to the visibility map.
+ */
+Datum
+pg_is_all_frozen(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int64 blkno = PG_GETARG_INT64(1);
+ bool all_frozen;
+
+ all_frozen = visibilitymap_test_internal(relid, blkno, VISIBILITYMAP_ALL_FROZEN);
+
+ PG_RETURN_BOOL(all_frozen);
+}
+
+static bool
+visibilitymap_test_internal(Oid relid, uint64 blkno, uint8 flag)
+{
+
+ Relation rel;
+ Buffer vmbuffer = InvalidBuffer;
+ bool result;
+
+ rel = relation_open(relid, AccessShareLock);
+
+ if (blkno < 0 || blkno > MaxBlockNumber)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid block number")));
+
+ result = visibilitymap_test(rel, blkno, &vmbuffer, flag);
+
+ ReleaseBuffer(vmbuffer);
+ relation_close(rel, AccessShareLock);
+
+ return result;
+}
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 7c38772..13ad5b1 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -21,33 +21,44 @@
*
* NOTES
*
- * The visibility map is a bitmap with one bit per heap page. A set bit means
- * that all tuples on the page are known visible to all transactions, and
- * therefore the page doesn't need to be vacuumed. The map is conservative in
- * the sense that we make sure that whenever a bit is set, we know the
- * condition is true, but if a bit is not set, it might or might not be true.
+ * The visibility map is a bitmap with two bits (all-visible and all-frozen)
+ * per heap page. A set all-visible bit means that all tuples on the page are
+ * known visible to all transactions, and therefore the page doesn't need to
+ * be vacuumed. A set all-frozen bit means that all tuples on the page are
+ * completely frozen, and therefor the page doesn't need to be vacuumed even
+ * if whole table scanning vacuum is required. A all-frozen bit must be set only
+ * when the page is already all-visible. That is, all-frozen bit is always set
+ * with all-visible bit.
+ * The map is conservative in the sense that we make sure that whenever a bit
+ * is set, we know the condition is true, but if a bit is not set, it might or
+ * might not be true.
*
* Clearing a visibility map bit is not separately WAL-logged. The callers
* must make sure that whenever a bit is cleared, the bit is cleared on WAL
- * replay of the updating operation as well.
+ * replay of the updating operation as well. And all-frozen bit must be
+ * cleared with all-visible at the same time.
*
* When we *set* a visibility map during VACUUM, we must write WAL. This may
* seem counterintuitive, since the bit is basically a hint: if it is clear,
- * it may still be the case that every tuple on the page is visible to all
- * transactions; we just don't know that for certain. The difficulty is that
- * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
- * on the page itself, and the visibility map bit. If a crash occurs after the
- * visibility map page makes it to disk and before the updated heap page makes
- * it to disk, redo must set the bit on the heap page. Otherwise, the next
- * insert, update, or delete on the heap page will fail to realize that the
- * visibility map bit must be cleared, possibly causing index-only scans to
- * return wrong answers.
+ * it may still be the case that every tuple on the page is visible or frozen
+ * to all transactions; we just don't know that for certain. The difficulty is
+ * that there are two bits which are typically set together: the PD_ALL_VISIBLE
+ * or PD_ALL_FROZEN bit on the page itself, and the visibility map bit. If a
+ * crash occurs after the visibility map page makes it to disk and before the
+ * updated heap page makes it to disk, redo must set the bit on the heap page.
+ * Otherwise, the next insert, update, or delete on the heap page will fail to
+ * realize that the visibility map bit must be cleared, possibly causing index-only
+ * scans to return wrong answers.
*
* VACUUM will normally skip pages for which the visibility map bit is set;
* such pages can't contain any dead tuples and therefore don't need vacuuming.
- * The visibility map is not used for anti-wraparound vacuums, because
+ * The visibility map is not used for anti-wraparound vacuums before 9.5, because
* an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
* present in the table, even on pages that don't have any dead tuples.
+ * 9.6 or later, the visibility map has a additional bit which indicates all tuple
+ * on single page has been completely forzen, so the visibility map is also used for
+ * anti-wraparound vacuums.
+ *
*
* LOCKING
*
@@ -58,14 +69,14 @@
* section that logs the page modification. However, we don't want to hold
* the buffer lock over any I/O that may be required to read in the visibility
* map page. To avoid this, we examine the heap page before locking it;
- * if the page-level PD_ALL_VISIBLE bit is set, we pin the visibility map
- * bit. Then, we lock the buffer. But this creates a race condition: there
- * is a possibility that in the time it takes to lock the buffer, the
- * PD_ALL_VISIBLE bit gets set. If that happens, we have to unlock the
- * buffer, pin the visibility map page, and relock the buffer. This shouldn't
- * happen often, because only VACUUM currently sets visibility map bits,
- * and the race will only occur if VACUUM processes a given page at almost
- * exactly the same time that someone tries to further modify it.
+ * if the page-level PD_ALL_VISIBLE or PD_ALL_FROZEN bit is set, we pin the
+ * visibility map bit. Then, we lock the buffer. But this creates a race
+ * condition: there is a possibility that in the time it takes to lock the
+ * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that happens,
+ * we have to unlock the buffer, pin the visibility map page, and relock the
+ * buffer. This shouldn't happen often, because only VACUUM currently sets
+ * visibility map bits, and the race will only occur if VACUUM processes a given
+ * page at almost exactly the same time that someone tries to further modify it.
*
* To set a bit, you need to hold a lock on the heap page. That prevents
* the race condition where VACUUM sees that all tuples on the page are
@@ -101,11 +112,14 @@
*/
#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
-/* Number of bits allocated for each heap block. */
-#define BITS_PER_HEAPBLOCK 1
+/*
+ * Number of bits allocated for each heap block.
+ * One for all-visible, other for all-frozen.
+*/
+#define BITS_PER_HEAPBLOCK 2
/* Number of heap blocks we can represent in one byte. */
-#define HEAPBLOCKS_PER_BYTE 8
+#define HEAPBLOCKS_PER_BYTE 4
/* Number of heap blocks we can represent in one visibility map page. */
#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
@@ -115,24 +129,42 @@
#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
#define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
-/* table for fast counting of set bits */
-static const uint8 number_of_ones[256] = {
- 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
+/* tables for fast counting of set bits for visible and freeze */
+static const uint8 number_of_ones_for_visible[256] = {
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4
+};
+static const uint8 number_of_ones_for_frozen[256] = {
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4
};
/* prototypes for internal routines */
@@ -141,7 +173,7 @@ static void vm_extend(Relation rel, BlockNumber nvmblocks);
/*
- * visibilitymap_clear - clear a bit in visibility map
+ * visibilitymap_clear - clear all bits in visibility map
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -153,7 +185,8 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
int mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- uint8 mask = 1 << mapBit;
+ uint8 mask = (VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN) <<
+ (BITS_PER_HEAPBLOCK * mapBit);
char *map;
#ifdef TRACE_VISIBILITYMAP
@@ -225,7 +258,7 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
}
/*
- * visibilitymap_set - set a bit on a previously pinned page
+ * visibilitymap_set - set bit(s) on a previously pinned page
*
* recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
* or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
@@ -234,10 +267,11 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
* marked all-visible; it is needed for Hot Standby, and can be
* InvalidTransactionId if the page contains no tuples.
*
- * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
- * this function. Except in recovery, caller should also pass the heap
- * buffer. When checksums are enabled and we're not in recovery, we must add
- * the heap buffer to the WAL chain to protect it from being torn.
+ * Caller is expected to set the heap page's PD_ALL_VISIBLE or PD_ALL_FROZEN
+ * bit before calling this function. Except in recovery, caller should also
+ * pass the heap buffer and flags which indicates what flag we want to set.
+ * When checksums are enabled and we're not in recovery, we must add the heap
+ * buffer to the WAL chain to protect it from being torn.
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -245,7 +279,8 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
*/
void
visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid)
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -254,7 +289,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_set %s %d %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
@@ -272,11 +307,11 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map = PageGetContents(page);
LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
- if (!(map[mapByte] & (1 << mapBit)))
+ if (flags != (map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))))
{
START_CRIT_SECTION();
- map[mapByte] |= (1 << mapBit);
+ map[mapByte] |= (flags << (BITS_PER_HEAPBLOCK * mapBit));
MarkBufferDirty(vmBuf);
if (RelationNeedsWAL(rel))
@@ -285,7 +320,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Assert(!InRecovery);
recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
- cutoff_xid);
+ cutoff_xid, flags);
/*
* If data checksums are enabled (or wal_log_hints=on), we
@@ -295,11 +330,15 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Page heapPage = BufferGetPage(heapBuf);
- /* caller is expected to set PD_ALL_VISIBLE first */
- Assert(PageIsAllVisible(heapPage));
+ /*
+ * caller is expected to set PD_ALL_VISIBLE or
+ * PD_ALL_FROZEN first.
+ */
+ Assert(PageIsAllVisible(heapPage) || PageIsAllFrozen(heapPage));
PageSetLSN(heapPage, recptr);
}
}
+
PageSetLSN(page, recptr);
}
@@ -310,15 +349,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
}
/*
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_test - test if bit(s) is set
*
- * Are all tuples on heapBlk visible to all, according to the visibility map?
+ * Are all tuples on heapBlk visible or frozen to all, according to the visibility map?
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by an
* earlier call to visibilitymap_pin or visibilitymap_test on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk, or InvalidBuffer. The caller is responsible for
- * releasing *buf after it's done testing and setting bits.
+ * releasing *buf after it's done testing and setting bits, and must set flags
+ * which indicates what flag we want to test.
*
* NOTE: This function is typically called without a lock on the heap page,
* so somebody else could change the bit just after we look at it. In fact,
@@ -328,7 +368,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
* all concurrency issues!
*/
bool
-visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
+visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf, uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -337,7 +377,7 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_test %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_test %s %d %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
/* Reuse the old pinned buffer if possible */
@@ -360,11 +400,12 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
map = PageGetContents(BufferGetPage(*buf));
/*
- * A single-bit read is atomic. There could be memory-ordering effects
+ * A single or double bit read is atomic. There could be memory-ordering effects
* here, but for performance reasons we make it the caller's job to worry
* about that.
*/
- result = (map[mapByte] & (1 << mapBit)) ? true : false;
+ result = (map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))) ?
+ true : false;
return result;
}
@@ -374,10 +415,11 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
*
* Note: we ignore the possibility of race conditions when the table is being
* extended concurrently with the call. New pages added to the table aren't
- * going to be marked all-visible, so they won't affect the result.
+ * going to be marked all-visible or all-frozen, so they won't affect the result.
+ * The caller must set the flags which indicates what flag we want to count.
*/
BlockNumber
-visibilitymap_count(Relation rel)
+visibilitymap_count(Relation rel, uint8 flags)
{
BlockNumber result = 0;
BlockNumber mapBlock;
@@ -406,7 +448,10 @@ visibilitymap_count(Relation rel)
for (i = 0; i < MAPSIZE; i++)
{
- result += number_of_ones[map[i]];
+ if (flags & VISIBILITYMAP_ALL_VISIBLE)
+ result += number_of_ones_for_visible[map[i]];
+ if (flags & VISIBILITYMAP_ALL_FROZEN)
+ result += number_of_ones_for_frozen[map[i]];
}
ReleaseBuffer(mapBuffer);
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 4246554..015bfb8 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1919,11 +1919,18 @@ index_update_stats(Relation rel,
{
BlockNumber relpages = RelationGetNumberOfBlocks(rel);
BlockNumber relallvisible;
+ BlockNumber relallfrozen;
if (rd_rel->relkind != RELKIND_INDEX)
- relallvisible = visibilitymap_count(rel);
+ {
+ relallvisible = visibilitymap_count(rel, VISIBILITYMAP_ALL_VISIBLE);
+ relallfrozen = visibilitymap_count(rel, VISIBILITYMAP_ALL_FROZEN);
+ }
else /* don't bother for indexes */
+ {
relallvisible = 0;
+ relallfrozen = 0;
+ }
if (rd_rel->relpages != (int32) relpages)
{
@@ -1940,6 +1947,11 @@ index_update_stats(Relation rel,
rd_rel->relallvisible = (int32) relallvisible;
dirty = true;
}
+ if (rd_rel->relallfrozen != (int32) relallfrozen)
+ {
+ rd_rel->relallfrozen = (int32) relallfrozen;
+ dirty = true;
+ }
}
/*
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 861048f..392c2a4 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -572,7 +572,8 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
vac_update_relstats(onerel,
relpages,
totalrows,
- visibilitymap_count(onerel),
+ visibilitymap_count(onerel, VISIBILITYMAP_ALL_VISIBLE),
+ visibilitymap_count(onerel, VISIBILITYMAP_ALL_FROZEN),
hasindex,
InvalidTransactionId,
InvalidMultiXactId,
@@ -595,6 +596,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
RelationGetNumberOfBlocks(Irel[ind]),
totalindexrows,
0,
+ 0,
false,
InvalidTransactionId,
InvalidMultiXactId,
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 7ab4874..d3725dd 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -22,6 +22,7 @@
#include "access/rewriteheap.h"
#include "access/transam.h"
#include "access/tuptoaster.h"
+#include "access/visibilitymap.h"
#include "access/xact.h"
#include "access/xlog.h"
#include "catalog/catalog.h"
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index baf66f1..d68c7c4 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -744,6 +744,7 @@ void
vac_update_relstats(Relation relation,
BlockNumber num_pages, double num_tuples,
BlockNumber num_all_visible_pages,
+ BlockNumber num_all_frozen_pages,
bool hasindex, TransactionId frozenxid,
MultiXactId minmulti,
bool in_outer_xact)
@@ -781,6 +782,11 @@ vac_update_relstats(Relation relation,
pgcform->relallvisible = (int32) num_all_visible_pages;
dirty = true;
}
+ if (pgcform->relallfrozen != (int32) num_all_frozen_pages)
+ {
+ pgcform->relallfrozen = (int32) num_all_frozen_pages;
+ dirty = true;
+ }
/* Apply DDL updates, but not inside an outer transaction (see above) */
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index a01cfb4..12322a4 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -106,6 +106,8 @@ typedef struct LVRelStats
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* number of pages we examined */
BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber vmskipped_frozen_pages; /* # of pages we skipped by all-frozen bit
+ of visibility map */
double scanned_tuples; /* counts only tuples on scanned pages */
double old_rel_tuples; /* previous value of pg_class.reltuples */
double new_rel_tuples; /* new estimated total # of tuples */
@@ -156,7 +158,7 @@ static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
static int vac_cmp_itemptr(const void *left, const void *right);
static bool heap_page_is_all_visible(Relation rel, Buffer buf,
- TransactionId *visibility_cutoff_xid);
+ TransactionId *visibility_cutoff_xid, bool *all_frozen);
/*
@@ -188,7 +190,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
MultiXactId mxactFullScanLimit;
BlockNumber new_rel_pages;
double new_rel_tuples;
- BlockNumber new_rel_allvisible;
+ BlockNumber new_rel_allvisible,
+ new_rel_allfrozen;
double new_live_tuples;
TransactionId new_frozen_xid;
MultiXactId new_min_multi;
@@ -222,6 +225,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* than or equal to the requested Xid full-table scan limit; or if the
* table's minimum MultiXactId is older than or equal to the requested
* mxid full-table scan limit.
+ * Even if scan_all is set so far, we could skip to scan some pages
+ * according by frozen map.
*/
scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
xidFullScanLimit);
@@ -253,13 +258,16 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
- if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages + vacrelstats->vmskipped_frozen_pages)
+ < vacrelstats->rel_pages)
{
Assert(!scan_all);
scanned_all = false;
}
else
+ {
scanned_all = true;
+ }
/*
* Optionally truncate the relation.
@@ -301,10 +309,14 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_tuples = vacrelstats->old_rel_tuples;
}
- new_rel_allvisible = visibilitymap_count(onerel);
+ new_rel_allvisible = visibilitymap_count(onerel, VISIBILITYMAP_ALL_VISIBLE);
if (new_rel_allvisible > new_rel_pages)
new_rel_allvisible = new_rel_pages;
+ new_rel_allfrozen = visibilitymap_count(onerel, VISIBILITYMAP_ALL_FROZEN);
+ if (new_rel_allfrozen > new_rel_pages)
+ new_rel_allfrozen = new_rel_pages;
+
new_frozen_xid = scanned_all ? FreezeLimit : InvalidTransactionId;
new_min_multi = scanned_all ? MultiXactCutoff : InvalidMultiXactId;
@@ -312,6 +324,7 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_pages,
new_rel_tuples,
new_rel_allvisible,
+ new_rel_allfrozen,
vacrelstats->hasindex,
new_frozen_xid,
new_min_multi,
@@ -360,10 +373,11 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel),
vacrelstats->num_index_scans);
- appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins\n"),
+ appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped frozen page according to visibility map\n"),
vacrelstats->pages_removed,
vacrelstats->rel_pages,
- vacrelstats->pinskipped_pages);
+ vacrelstats->pinskipped_pages,
+ vacrelstats->vmskipped_frozen_pages);
appendStringInfo(&buf,
_("tuples: %.0f removed, %.0f remain, %.0f are dead but not yet removable\n"),
vacrelstats->tuples_deleted,
@@ -486,9 +500,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* consecutive pages. Since we're reading sequentially, the OS should be
* doing readahead for us, so there's no gain in skipping a page now and
* then; that's likely to disable readahead and so be counterproductive.
- * Also, skipping even a single page means that we can't update
- * relfrozenxid, so we only want to do it if we can skip a goodly number
- * of pages.
+ * Also, skipping even a single page accorinding to all-visible bit of
+ * visibility map means that we can't update relfrozenxid, so we only want
+ * to do it if we can skip a goodly number. On the other hand, we count
+ * both how many pages we skipped according to all-frozen bit of visibility
+ * map and how many pages we freeze page, so we can update relfrozenxid if
+ * the sum of them is as many as pages of table.
*
* Before entering the main loop, establish the invariant that
* next_not_all_visible_block is the next block number >= blkno that's not
@@ -515,7 +532,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer))
+ if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer,
+ VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN))
break;
vacuum_delay_point();
}
@@ -533,7 +551,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
bool tupgone,
hastup;
int prev_dead_count;
- int nfrozen;
+ int nfrozen; /* # of tuples is frozen */
+ int nalready_frozen; /* # of tuples is already frozen */
+ int ntotal_frozen; /* # of tuples is in single page */
+ int ntup_per_page;
Size freespace;
bool all_visible_according_to_vm;
bool all_visible;
@@ -548,7 +569,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block++)
{
if (!visibilitymap_test(onerel, next_not_all_visible_block,
- &vmbuffer))
+ &vmbuffer,
+ VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN))
break;
vacuum_delay_point();
}
@@ -566,9 +588,25 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
}
else
{
- /* Current block is all-visible */
- if (skipping_all_visible_blocks && !scan_all)
- continue;
+ /*
+ * This block is at least all-visible according to visibility map.
+ * We check whehter this block is all-frozen to skip to vacuum this
+ * page even though scanning whole page is required.
+ */
+ if (scan_all)
+ {
+ if (visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_FROZEN))
+ {
+ vacrelstats->vmskipped_frozen_pages++;
+ continue;
+ }
+ }
+ else
+ {
+ if (skipping_all_visible_blocks)
+ continue;
+ }
+
all_visible_according_to_vm = true;
}
@@ -740,7 +778,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
PageSetAllVisible(page);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, InvalidTransactionId);
+ vmbuffer, InvalidTransactionId,
+ VISIBILITYMAP_ALL_VISIBLE);
END_CRIT_SECTION();
}
@@ -764,6 +803,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
all_visible = true;
has_dead_tuples = false;
nfrozen = 0;
+ nalready_frozen = 0;
+ ntup_per_page = 0;
hastup = false;
prev_dead_count = vacrelstats->num_dead_tuples;
maxoff = PageGetMaxOffsetNumber(page);
@@ -918,8 +959,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
{
num_tuples += 1;
+ ntup_per_page += 1;;
hastup = true;
+ /* Check whether this tuple is alrady frozen or not */
+ if (HeapTupleHeaderXminFrozen(tuple.t_data))
+ nalready_frozen += 1;
+
/*
* Each non-removable tuple must be checked to see if it needs
* freezing. Note we already have exclusive buffer lock.
@@ -931,9 +977,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
} /* scan along page */
/*
- * If we froze any tuples, mark the buffer dirty, and write a WAL
- * record recording the changes. We must log the changes to be
- * crash-safe against future truncation of CLOG.
+ * If we froze any tuples or any tuples are already frozen,
+ * mark the buffer dirty, and write a WAL record recording the changes.
+ * We must log the changes to be crash-safe against future truncation
+ * of CLOG.
*/
if (nfrozen > 0)
{
@@ -966,6 +1013,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
END_CRIT_SECTION();
}
+ /* Compute the number of frozen tuples in a page */
+ ntotal_frozen = nfrozen + nalready_frozen;
+
/*
* If there are no indexes then we can vacuum the page right now
* instead of doing a second scan.
@@ -988,26 +1038,47 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
freespace = PageGetHeapFreeSpace(page);
- /* mark page all-visible, if appropriate */
- if (all_visible && !all_visible_according_to_vm)
+ /* This page is all visible */
+ if (all_visible)
{
- /*
- * It should never be the case that the visibility map page is set
- * while the page-level bit is clear, but the reverse is allowed
- * (if checksums are not enabled). Regardless, set the both bits
- * so that we get back in sync.
- *
- * NB: If the heap page is all-visible but the VM bit is not set,
- * we don't need to dirty the heap page. However, if checksums
- * are enabled, we do need to make sure that the heap page is
- * dirtied before passing it to visibilitymap_set(), because it
- * may be logged. Given that this situation should only happen in
- * rare cases after a crash, it is not worth optimizing.
- */
- PageSetAllVisible(page);
- MarkBufferDirty(buf);
- visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ /* mark page all-visible, if appropriate */
+ if (!all_visible_according_to_vm)
+ {
+ /*
+ * It should never be the case that the visibility map page is set
+ * while the page-level bit is clear, but the reverse is allowed
+ * (if checksums are not enabled). Regardless, set the both bits
+ * so that we get back in sync.
+ *
+ * NB: If the heap page is all-visible but the VM bit is not set,
+ * we don't need to dirty the heap page. However, if checksums
+ * are enabled, we do need to make sure that the heap page is
+ * dirtied before passing it to visibilitymap_set(), because it
+ * may be logged. Given that this situation should only happen in
+ * rare cases after a crash, it is not worth optimizing.
+ */
+ PageSetAllVisible(page);
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+ }
+
+ /* mark page all-frozen, if all tuples are frozen in total */
+ if ((ntotal_frozen == ntup_per_page) &&
+ !visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_FROZEN))
+ {
+ Assert(PageIsAllVisible(page));
+
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ {
+ MarkBufferDirty(buf);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+ vmbuffer, visibility_cutoff_xid, flags);
+ }
}
/*
@@ -1018,7 +1089,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* that something bad has happened.
*/
else if (all_visible_according_to_vm && !PageIsAllVisible(page)
- && visibilitymap_test(onerel, blkno, &vmbuffer))
+ && visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
relname, blkno);
@@ -1047,6 +1118,17 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
visibilitymap_clear(onerel, blkno, vmbuffer);
}
+ /*
+ * As a result of scanning a page, we set VM all-frozen bit and page header
+ * if all tuples of single page are frozen.
+ */
+ if (ntotal_frozen == ntup_per_page)
+ {
+ PageSetAllFrozen(page);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr, vmbuffer,
+ InvalidTransactionId, VISIBILITYMAP_ALL_FROZEN);
+ }
+
UnlockReleaseBuffer(buf);
/* Remember the location of the last page with nonremovable tuples */
@@ -1078,7 +1160,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
num_tuples);
/*
- * Release any remaining pin on visibility map page.
+ * Release any remaining pin on visibility map and frozen map page.
*/
if (BufferIsValid(vmbuffer))
{
@@ -1126,6 +1208,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
nunused);
appendStringInfo(&buf, _("Skipped %u pages due to buffer pins.\n"),
vacrelstats->pinskipped_pages);
+ appendStringInfo(&buf, _("Skipped %u frozen pages according to visibility map.\n"),
+ vacrelstats->vmskipped_frozen_pages);
appendStringInfo(&buf, _("%u pages are entirely empty.\n"),
empty_pages);
appendStringInfo(&buf, _("%s."),
@@ -1226,6 +1310,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
OffsetNumber unused[MaxOffsetNumber];
int uncnt = 0;
TransactionId visibility_cutoff_xid;
+ bool all_frozen;
START_CRIT_SECTION();
@@ -1277,19 +1362,31 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
* dirty, exclusively locked, and, if needed, a full page image has been
* emitted in the log_heap_clean() above.
*/
- if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid))
+ if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid, &all_frozen))
PageSetAllVisible(page);
/*
* All the changes to the heap page have been done. If the all-visible
- * flag is now set, also set the VM bit.
+ * flag is now set, also set the VM all-visible bit.
+ * Also, if this page is all-frozen, set VM all-frozen bit and flag.
*/
- if (PageIsAllVisible(page) &&
- !visibilitymap_test(onerel, blkno, vmbuffer))
+ if (PageIsAllVisible(page))
{
- Assert(BufferIsValid(*vmbuffer));
- visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
- visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ if (!visibilitymap_test(onerel, blkno, vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+
+ /* mark page all-frozen, and set VM all-frozen bit */
+ if (all_frozen)
+ {
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
+ visibility_cutoff_xid, flags);
}
return tupindex;
@@ -1408,6 +1505,7 @@ lazy_cleanup_index(Relation indrel,
stats->num_pages,
stats->num_index_tuples,
0,
+ 0,
false,
InvalidTransactionId,
InvalidMultiXactId,
@@ -1782,7 +1880,8 @@ vac_cmp_itemptr(const void *left, const void *right)
* xmin amongst the visible tuples.
*/
static bool
-heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid)
+heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid,
+ bool *all_frozen)
{
Page page = BufferGetPage(buf);
BlockNumber blockno = BufferGetBlockNumber(buf);
@@ -1791,6 +1890,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
bool all_visible = true;
*visibility_cutoff_xid = InvalidTransactionId;
+ *all_frozen = true;
/*
* This is a stripped down version of the line pointer scan in
@@ -1814,7 +1914,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/*
* Dead line pointers can have index pointers pointing to them. So
- * they can't be treated as visible
+ * they can't be treated as visible and frozen.
*/
if (ItemIdIsDead(itemid))
{
@@ -1855,6 +1955,10 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
+
+ /* Check whether this tuple is alrady frozen or not */
+ if (!HeapTupleHeaderXminFrozen(tuple.t_data))
+ *all_frozen = false;
}
break;
@@ -1863,6 +1967,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
case HEAPTUPLE_INSERT_IN_PROGRESS:
case HEAPTUPLE_DELETE_IN_PROGRESS:
all_visible = false;
+ *all_frozen = false;
break;
default:
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 9f54c46..08df289 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -116,7 +116,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
*/
if (!visibilitymap_test(scandesc->heapRelation,
ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
+ &node->ioss_VMBuffer, VISIBILITYMAP_ALL_VISIBLE))
{
/*
* Rats, we have to visit the heap to check visibility.
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 874ca6a..376841a 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -127,7 +127,7 @@ ExecCheckPlanOutput(Relation resultRel, List *targetList)
if (attno != resultDesc->natts)
ereport(ERROR,
(errcode(ERRCODE_DATATYPE_MISMATCH),
- errmsg("table row type and query-specified row type do not match"),
+ errmsg("table row type and query-specified row type do not match"),
errdetail("Query has too few columns.")));
}
diff --git a/src/common/relpath.c b/src/common/relpath.c
index 66dfef1..5898f1b 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -30,11 +30,14 @@
* If you add a new entry, remember to update the errhint in
* forkname_to_number() below, and update the SGML documentation for
* pg_relation_size().
+ * 9.6 or later, the visibility map fork name is changed from "vm" to
+ * "vfm" bacause visibility map has not only information about all-visible
+ * but also information about all-frozen.
*/
const char *const forkNames[] = {
"main", /* MAIN_FORKNUM */
"fsm", /* FSM_FORKNUM */
- "vm", /* VISIBILITYMAP_FORKNUM */
+ "vfm", /* VISIBILITYMAP_FORKNUM */
"init" /* INIT_FORKNUM */
};
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index caa0f14..93afb10 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -320,9 +320,10 @@ typedef struct xl_heap_freeze_page
typedef struct xl_heap_visible
{
TransactionId cutoff_xid;
+ uint8 flags;
} xl_heap_visible;
-#define SizeOfHeapVisible (offsetof(xl_heap_visible, cutoff_xid) + sizeof(TransactionId))
+#define SizeOfHeapVisible (offsetof(xl_heap_visible, flags) + sizeof(uint8))
typedef struct xl_heap_new_cid
{
@@ -389,6 +390,6 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
xl_heap_freeze_tuple *xlrec_tp);
extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
- Buffer vm_buffer, TransactionId cutoff_xid);
+ Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
#endif /* HEAPAM_XLOG_H */
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 0c0e0ef..7270609 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -19,15 +19,20 @@
#include "storage/buf.h"
#include "utils/relcache.h"
-extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk,
- Buffer vmbuf);
+/* Flags for bit map */
+#define VISIBILITYMAP_ALL_VISIBLE 0x01
+#define VISIBILITYMAP_ALL_FROZEN 0x02
+
+extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
Buffer *vmbuf);
extern bool visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid);
-extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
-extern BlockNumber visibilitymap_count(Relation rel);
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags);
+extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf,
+ uint8 flags);
+extern BlockNumber visibilitymap_count(Relation rel, uint8 flags);
extern void visibilitymap_truncate(Relation rel, BlockNumber nheapblocks);
#endif /* VISIBILITYMAP_H */
diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
index e526cd9..ea0f7c1 100644
--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -47,6 +47,8 @@ CATALOG(pg_class,1259) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83) BKI_SCHEMA_MACRO
float4 reltuples; /* # of tuples (not always up-to-date) */
int32 relallvisible; /* # of all-visible blocks (not always
* up-to-date) */
+ int32 relallfrozen; /* # of all-frozen blocks (not always
+ up-to-date) */
Oid reltoastrelid; /* OID of toast table; 0 if none */
bool relhasindex; /* T if has (or has had) any indexes */
bool relisshared; /* T if shared across databases */
@@ -95,7 +97,7 @@ typedef FormData_pg_class *Form_pg_class;
* ----------------
*/
-#define Natts_pg_class 30
+#define Natts_pg_class 31
#define Anum_pg_class_relname 1
#define Anum_pg_class_relnamespace 2
#define Anum_pg_class_reltype 3
@@ -107,25 +109,26 @@ typedef FormData_pg_class *Form_pg_class;
#define Anum_pg_class_relpages 9
#define Anum_pg_class_reltuples 10
#define Anum_pg_class_relallvisible 11
-#define Anum_pg_class_reltoastrelid 12
-#define Anum_pg_class_relhasindex 13
-#define Anum_pg_class_relisshared 14
-#define Anum_pg_class_relpersistence 15
-#define Anum_pg_class_relkind 16
-#define Anum_pg_class_relnatts 17
-#define Anum_pg_class_relchecks 18
-#define Anum_pg_class_relhasoids 19
-#define Anum_pg_class_relhaspkey 20
-#define Anum_pg_class_relhasrules 21
-#define Anum_pg_class_relhastriggers 22
-#define Anum_pg_class_relhassubclass 23
-#define Anum_pg_class_relrowsecurity 24
-#define Anum_pg_class_relispopulated 25
-#define Anum_pg_class_relreplident 26
-#define Anum_pg_class_relfrozenxid 27
-#define Anum_pg_class_relminmxid 28
-#define Anum_pg_class_relacl 29
-#define Anum_pg_class_reloptions 30
+#define Anum_pg_class_relallfrozen 12
+#define Anum_pg_class_reltoastrelid 13
+#define Anum_pg_class_relhasindex 14
+#define Anum_pg_class_relisshared 15
+#define Anum_pg_class_relpersistence 16
+#define Anum_pg_class_relkind 17
+#define Anum_pg_class_relnatts 18
+#define Anum_pg_class_relchecks 19
+#define Anum_pg_class_relhasoids 20
+#define Anum_pg_class_relhaspkey 21
+#define Anum_pg_class_relhasrules 22
+#define Anum_pg_class_relhastriggers 23
+#define Anum_pg_class_relhassubclass 24
+#define Anum_pg_class_relrowsecurity 25
+#define Anum_pg_class_relispopulated 26
+#define Anum_pg_class_relreplident 27
+#define Anum_pg_class_relfrozenxid 28
+#define Anum_pg_class_relminmxid 29
+#define Anum_pg_class_relacl 30
+#define Anum_pg_class_reloptions 31
/* ----------------
* initial contents of pg_class
@@ -140,13 +143,13 @@ typedef FormData_pg_class *Form_pg_class;
* Note: "3" in the relfrozenxid column stands for FirstNormalTransactionId;
* similarly, "1" in relminmxid stands for FirstMultiXactId
*/
-DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 f f p r 28 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 0 f f p r 28 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 0 f f p r 31 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 6fd1278..7d9b93f 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3213,6 +3213,12 @@ DESCR("sleep until the specified time");
DATA(insert OID = 2971 ( text PGNSP PGUID 12 1 0 0 0 f f f f t f i 1 0 25 "16" _null_ _null_ _null_ _null_ _null_ booltext _null_ _null_ _null_ ));
DESCR("convert boolean to text");
+DATA(insert OID = 3298 ( pg_is_all_visible PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 16 "2205 20" _null_ _null_ _null_ _null_ _null_ pg_is_all_visible _null_ _null_ _null_ ));
+DESCR("true if the page is all visible");
+DATA(insert OID = 3299 ( pg_is_all_frozen PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 16 "2205 20" _null_ _null_ _null_ _null_ _null_ pg_is_all_frozen _null_ _null_ _null_ ));
+DESCR("true if the page is all frozen");
+
+
/* Aggregates (moved here from pg_aggregate for 7.3) */
DATA(insert OID = 2100 ( avg PGNSP PGUID 12 1 0 0 0 t f f f f f i 1 0 1700 "20" _null_ _null_ _null_ _null_ _null_ aggregate_dummy _null_ _null_ _null_ ));
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index e3a31af..d2bae2d 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -172,6 +172,7 @@ extern void vac_update_relstats(Relation relation,
BlockNumber num_pages,
double num_tuples,
BlockNumber num_all_visible_pages,
+ BlockNumber num_all_frozen_pages,
bool hasindex,
TransactionId frozenxid,
MultiXactId minmulti,
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index a2f78ee..7bf2718 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -178,8 +178,10 @@ typedef PageHeaderData *PageHeader;
* tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
+#define PD_ALL_FROZEN 0x0008 /* all tuples on page are completely
+ frozen */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x000F /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -369,6 +371,13 @@ typedef PageHeaderData *PageHeader;
#define PageClearAllVisible(page) \
(((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
+#define PageIsAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags & PD_ALL_FROZEN)
+#define PageSetAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags |= PD_ALL_FROZEN)
+#define PageClearAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags &= ~PD_ALL_FROZEN)
+
#define PageIsPrunable(page, oldestxmin) \
( \
AssertMacro(TransactionIdIsNormal(oldestxmin)), \
On 7/10/15 4:46 AM, Simon Riggs wrote:
On 10 July 2015 at 09:49, Sawada Masahiko <sawada.mshk@gmail.com
<mailto:sawada.mshk@gmail.com>> wrote:It is a minor thing, but if there is no other use for this fourth
"bit-space", it seems a shame to waste it when there is some use for it. I
haven't looked at the code around this area to know how hard it would be to
implement the setting and clearing of the bit.I think so too, we would be able to use unused fourth status of bits
efficiently.
Should I include these improvement into this patch?
This topic should be discussed on another thread after this feature is
committed, I think.The impossible state acts as a diagnostic check for us to ensure the
bitmap is not itself corrupt.-1 for using it for another purpose.
AFAICS empty page is only interesting for vacuum truncate, which is a
very short-term thing. It would be better to find a way to handle that
differently.
In any case, that should definitely be a separate discussion from this
patch.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jul 7, 2015 at 9:07 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 6 July 2015 at 17:28, Simon Riggs <simon@2ndquadrant.com> wrote:
I think we need something for pg_upgrade to rewrite existing VMs.
Otherwise a large read only database would suddenly require a massive
revacuum after upgrade, which seems bad. That can wait for now until we all
agree this patch is sound.Since we need to rewrite the "vm" map, I think we should call the new map
"vfm"That way we will be able to easily check whether the rewrite has been
conducted on all relations.Since the maps are just bits there is no other way to tell that a map has
been rewritten
To avoid revacuum after upgrade, you meant that we need to rewrite
each bit of vm to corresponding bits of vfm, if it's from
not-supporting vfm version(i.g., 9.5 or earlier ). right?
If so, we will need to do whole scanning table, which is expensive as well.
Clearing vm and do revacuum would be nice, rather than doing in
upgrading, I think.
Regards,
--
Masahiko Sawada
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Jul 13, 2015 at 3:39 PM, Sawada Masahiko <sawada.mshk@gmail.com>
wrote:
On Tue, Jul 7, 2015 at 9:07 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 6 July 2015 at 17:28, Simon Riggs <simon@2ndquadrant.com> wrote:
I think we need something for pg_upgrade to rewrite existing VMs.
Otherwise a large read only database would suddenly require a massive
revacuum after upgrade, which seems bad. That can wait for now until
we all
agree this patch is sound.
Since we need to rewrite the "vm" map, I think we should call the new
map
"vfm"
That way we will be able to easily check whether the rewrite has been
conducted on all relations.Since the maps are just bits there is no other way to tell that a map
has
been rewritten
To avoid revacuum after upgrade, you meant that we need to rewrite
each bit of vm to corresponding bits of vfm, if it's from
not-supporting vfm version(i.g., 9.5 or earlier ). right?
If so, we will need to do whole scanning table, which is expensive as
well.
Clearing vm and do revacuum would be nice, rather than doing in
upgrading, I think.
How will you ensure to have revacuum for all the tables after
upgrading? Till the time Vacuum is done on the tables that
have vm before upgrade, any queries on those tables can
become slower.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Mon, Jul 13, 2015 at 7:46 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Jul 13, 2015 at 3:39 PM, Sawada Masahiko <sawada.mshk@gmail.com>
wrote:On Tue, Jul 7, 2015 at 9:07 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 6 July 2015 at 17:28, Simon Riggs <simon@2ndquadrant.com> wrote:
I think we need something for pg_upgrade to rewrite existing VMs.
Otherwise a large read only database would suddenly require a massive
revacuum after upgrade, which seems bad. That can wait for now until we
all
agree this patch is sound.Since we need to rewrite the "vm" map, I think we should call the new
map
"vfm"That way we will be able to easily check whether the rewrite has been
conducted on all relations.Since the maps are just bits there is no other way to tell that a map
has
been rewrittenTo avoid revacuum after upgrade, you meant that we need to rewrite
each bit of vm to corresponding bits of vfm, if it's from
not-supporting vfm version(i.g., 9.5 or earlier ). right?
If so, we will need to do whole scanning table, which is expensive as
well.
Clearing vm and do revacuum would be nice, rather than doing in
upgrading, I think.How will you ensure to have revacuum for all the tables after
upgrading?
We use script file which are generated by pg_upgrade.
Till the time Vacuum is done on the tables that
have vm before upgrade, any queries on those tables can
become slower.
Even If we implement rewriting tool for vm into pg_upgrade, it will
take time as much as revacuum because it need whole scanning table.
I meant that we rewrite vm using by existing facility (i.g., vacuum
(freeze)), instead of implementing new rewriting tool for vm.
Regards,
--
Masahiko Sawada
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-07-13 21:03:07 +0900, Sawada Masahiko wrote:
Even If we implement rewriting tool for vm into pg_upgrade, it will
take time as much as revacuum because it need whole scanning table.
Why would it? Sure, you can only set allvisible and not the frozen bit,
but that's fine. That way the cost for freezing can be paid over time.
If we require terrabytes of data to be scanned, including possibly
rewriting large portions due to freezing, before index only scans work
and most vacuums act in a partial manner the migration to 9.6 will be a
major pain for our users.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Jul 13, 2015 at 9:03 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
On Mon, Jul 13, 2015 at 7:46 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Jul 13, 2015 at 3:39 PM, Sawada Masahiko <sawada.mshk@gmail.com>
wrote:On Tue, Jul 7, 2015 at 9:07 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 6 July 2015 at 17:28, Simon Riggs <simon@2ndquadrant.com> wrote:
I think we need something for pg_upgrade to rewrite existing VMs.
Otherwise a large read only database would suddenly require a massive
revacuum after upgrade, which seems bad. That can wait for now until we
all
agree this patch is sound.Since we need to rewrite the "vm" map, I think we should call the new
map
"vfm"That way we will be able to easily check whether the rewrite has been
conducted on all relations.Since the maps are just bits there is no other way to tell that a map
has
been rewrittenTo avoid revacuum after upgrade, you meant that we need to rewrite
each bit of vm to corresponding bits of vfm, if it's from
not-supporting vfm version(i.g., 9.5 or earlier ). right?
If so, we will need to do whole scanning table, which is expensive as
well.
Clearing vm and do revacuum would be nice, rather than doing in
upgrading, I think.How will you ensure to have revacuum for all the tables after
upgrading?We use script file which are generated by pg_upgrade.
I haven't followed this thread closely, but I am sure you recall that
vacuumdb has a parallel mode.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Michael Paquier wrote:
On Mon, Jul 13, 2015 at 9:03 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
We use script file which are generated by pg_upgrade.
I haven't followed this thread closely, but I am sure you recall that
vacuumdb has a parallel mode.
I think having to vacuum the whole database during pg_upgrade (or
immediately thereafter, which in practice means that the database is
unusable for queries until that has finished) is way too impractical.
Even in parallel mode, it could take far too long. People already
complain that our upgrading procedure takes too long as opposed to that
of other database systems.
I don't think there's any problem with rewriting the existing server's
VM file into "vfm" format during pg_upgrade, since we expect those files
to be much smaller than the data itself.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Jul 13, 2015 at 9:22 PM, Andres Freund <andres@anarazel.de> wrote:
On 2015-07-13 21:03:07 +0900, Sawada Masahiko wrote:
Even If we implement rewriting tool for vm into pg_upgrade, it will
take time as much as revacuum because it need whole scanning table.Why would it? Sure, you can only set allvisible and not the frozen bit,
but that's fine. That way the cost for freezing can be paid over time.If we require terrabytes of data to be scanned, including possibly
rewriting large portions due to freezing, before index only scans work
and most vacuums act in a partial manner the migration to 9.6 will be a
major pain for our users.
Ah, If we set all bit as not all-frozen, we don't need to whole table
scanning, only scan vm.
And I agree with this.
But please image the case where old cluster has table which is very
large, read-only and vacuum freeze is done.
In this case, the all-frozen bit of such table in new cluster will not
set, unless we do vacuum freeze again.
The information of all-frozen of such table is lacked.
Regards,
--
Masahiko Sawada
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 13 July 2015 at 15:48, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
On Mon, Jul 13, 2015 at 9:22 PM, Andres Freund <andres@anarazel.de> wrote:
On 2015-07-13 21:03:07 +0900, Sawada Masahiko wrote:
Even If we implement rewriting tool for vm into pg_upgrade, it will
take time as much as revacuum because it need whole scanning table.Why would it? Sure, you can only set allvisible and not the frozen bit,
but that's fine. That way the cost for freezing can be paid over time.If we require terrabytes of data to be scanned, including possibly
rewriting large portions due to freezing, before index only scans work
and most vacuums act in a partial manner the migration to 9.6 will be a
major pain for our users.Ah, If we set all bit as not all-frozen, we don't need to whole table
scanning, only scan vm.
And I agree with this.But please image the case where old cluster has table which is very
large, read-only and vacuum freeze is done.
In this case, the all-frozen bit of such table in new cluster will not
set, unless we do vacuum freeze again.
The information of all-frozen of such table is lacked.
The contents of the VM fork is essential to retain after an upgrade because
it is used for Index Only Scans. If we destroy that information it could
send SQL response times to unacceptable levels after upgrade.
It takes time to scan the VM and create the new VFM, but the time taken is
proportional to the size of VM, which seems like it will be acceptable.
Example calcs:
An 8TB PostgreSQL installation would need us to scan 128MB of VM into about
256MB of VFM. Probably the fsyncs will occupy the most time.
In comparison, we would need to scan all 8TB to rebuild the VMs, which will
take much longer (and fsyncs will still be needed).
Since we don't record freeze map information now it is acceptable to begin
after upgrade with all freeze info set to zero.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2015-07-13 23:48:02 +0900, Sawada Masahiko wrote:
But please image the case where old cluster has table which is very
large, read-only and vacuum freeze is done.
In this case, the all-frozen bit of such table in new cluster will not
set, unless we do vacuum freeze again.
The information of all-frozen of such table is lacked.
So what? That's the situation today… Yes, it'll trigger a
anti-wraparound vacuum at some later point, after that they map bits
will be set.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 10 July 2015 at 15:11, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
Oops, I had forgotten to add new file heapfuncs.c.
Latest patch is attached.
I think we've established the approach is desirable and defined the way
forwards for this, so this is looking good.
Some of my requests haven't been actioned yet, so I personally would not
commit this yet. I am happy to continue as reviewer/committer unless others
wish to take over.
The main missing item is pg_upgrade support, which won't happen by end of
CF1, so I am marking this as Returned With Feedback. Hopefully we can
review this again before CF2.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Jul 15, 2015 at 12:55 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 10 July 2015 at 15:11, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
Oops, I had forgotten to add new file heapfuncs.c.
Latest patch is attached.I think we've established the approach is desirable and defined the way
forwards for this, so this is looking good.
If we want to move stuff like pg_stattuple, pg_freespacemap into core,
we could move them into heapfuncs.c.
Some of my requests haven't been actioned yet, so I personally would not
commit this yet. I am happy to continue as reviewer/committer unless others
wish to take over.
The main missing item is pg_upgrade support, which won't happen by end of
CF1, so I am marking this as Returned With Feedback. Hopefully we can review
this again before CF2.
I appreciate your reviewing.
Yeah, the pg_upgrade support and regression test for VFM patch is
almost done now, I will submit the patch in this week after testing it
.
Regards,
--
Masahiko Sawada
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jul 15, 2015 at 3:07 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
On Wed, Jul 15, 2015 at 12:55 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 10 July 2015 at 15:11, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
Oops, I had forgotten to add new file heapfuncs.c.
Latest patch is attached.I think we've established the approach is desirable and defined the way
forwards for this, so this is looking good.If we want to move stuff like pg_stattuple, pg_freespacemap into core,
we could move them into heapfuncs.c.Some of my requests haven't been actioned yet, so I personally would not
commit this yet. I am happy to continue as reviewer/committer unless others
wish to take over.
The main missing item is pg_upgrade support, which won't happen by end of
CF1, so I am marking this as Returned With Feedback. Hopefully we can review
this again before CF2.I appreciate your reviewing.
Yeah, the pg_upgrade support and regression test for VFM patch is
almost done now, I will submit the patch in this week after testing it
.
Attached patch is latest v9 patch.
I added:
- regression test for visibility map (visibilitymap.sql and
visibilitymap.out files)
- pg_upgrade support (rewriting vm file to vfm file)
- regression test for pg_upgrade
Please review it.
Regards,
--
Masahiko Sawada
Attachments:
000_add_frozen_bit_into_visibilitymap_v9.patchtext/x-patch; charset=US-ASCII; name=000_add_frozen_bit_into_visibilitymap_v9.patchDownload
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 22c5f7a..b1b6a06 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -87,7 +87,7 @@ statapprox_heap(Relation rel, output_type *stat)
* If the page has only visible tuples, then we can find out the free
* space from the FSM and move on.
*/
- if (visibilitymap_test(rel, blkno, &vmbuffer))
+ if (visibilitymap_test(rel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
freespace = GetRecordedFreeSpace(rel, blkno);
stat->tuple_len += BLCKSZ - freespace;
diff --git a/src/backend/access/heap/Makefile b/src/backend/access/heap/Makefile
index b83d496..806ce27 100644
--- a/src/backend/access/heap/Makefile
+++ b/src/backend/access/heap/Makefile
@@ -12,6 +12,7 @@ subdir = src/backend/access/heap
top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
-OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o visibilitymap.o
+OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o visibilitymap.o \
+ heapfuncs.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 86a2e6b..796b76f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2131,8 +2131,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
/*
- * Find buffer to insert this tuple into. If the page is all visible,
- * this will also pin the requisite visibility map page.
+ * Find buffer to insert this tuple into. If the page is all visible
+ * or all frozen, this will also pin the requisite visibility map and
+ * frozen map page.
*/
buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
InvalidBuffer, options, bistate,
@@ -2147,7 +2148,11 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
if (PageIsAllVisible(BufferGetPage(buffer)))
{
all_visible_cleared = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(BufferGetPage(buffer));
+ PageClearAllFrozen(BufferGetPage(buffer));
+
visibilitymap_clear(relation,
ItemPointerGetBlockNumber(&(heaptup->t_self)),
vmbuffer);
@@ -2448,7 +2453,11 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
if (PageIsAllVisible(page))
{
all_visible_cleared = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+
visibilitymap_clear(relation,
BufferGetBlockNumber(buffer),
vmbuffer);
@@ -2731,9 +2740,9 @@ heap_delete(Relation relation, ItemPointer tid,
/*
* If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, we'll have to unlock and
- * re-lock, to avoid holding the buffer lock across an I/O. That's a bit
- * unfortunate, but hopefully shouldn't happen often.
+ * visible or all frozen while we were busy locking the buffer, we'll
+ * have to unlock and re-lock, to avoid holding the buffer lock across an
+ * I/O. That's a bit unfortunate, but hopefully shouldn't happen often.
*/
if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
{
@@ -2925,10 +2934,15 @@ l1:
*/
PageSetPrunable(page, xid);
+ /* clear PD_ALL_VISIBLE and PD_ALL_FORZEN flags */
if (PageIsAllVisible(page))
{
all_visible_cleared = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+
visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
vmbuffer);
}
@@ -3207,7 +3221,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
* in the middle of changing this, so we'll need to recheck after we have
* the lock.
*/
- if (PageIsAllVisible(page))
+ if (PageIsAllVisible(page) || PageIsAllFrozen(page))
visibilitymap_pin(relation, block, &vmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
@@ -3801,14 +3815,22 @@ l2:
if (PageIsAllVisible(BufferGetPage(buffer)))
{
all_visible_cleared = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(BufferGetPage(buffer));
+ PageClearAllFrozen(BufferGetPage(buffer));
+
visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
vmbuffer);
}
if (newbuf != buffer && PageIsAllVisible(BufferGetPage(newbuf)))
{
all_visible_cleared_new = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(BufferGetPage(newbuf));
+ PageClearAllFrozen(BufferGetPage(newbuf));
+
visibilitymap_clear(relation, BufferGetBlockNumber(newbuf),
vmbuffer_new);
}
@@ -6893,7 +6915,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
*/
XLogRecPtr
log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
- TransactionId cutoff_xid)
+ TransactionId cutoff_xid, uint8 vmflags)
{
xl_heap_visible xlrec;
XLogRecPtr recptr;
@@ -6903,6 +6925,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
Assert(BufferIsValid(vm_buffer));
xlrec.cutoff_xid = cutoff_xid;
+ xlrec.flags = vmflags;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
@@ -7492,8 +7515,14 @@ heap_xlog_visible(XLogReaderState *record)
* the subsequent update won't be replayed to clear the flag.
*/
page = BufferGetPage(buffer);
- PageSetAllVisible(page);
+
+ if (xlrec->flags & VISIBILITYMAP_ALL_VISIBLE)
+ PageSetAllVisible(page);
+ if (xlrec->flags & VISIBILITYMAP_ALL_FROZEN)
+ PageSetAllFrozen(page);
+
MarkBufferDirty(buffer);
+
}
else if (action == BLK_RESTORED)
{
@@ -7544,7 +7573,7 @@ heap_xlog_visible(XLogReaderState *record)
*/
if (lsn > PageGetLSN(vmpage))
visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
- xlrec->cutoff_xid);
+ xlrec->cutoff_xid, xlrec->flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
@@ -7694,7 +7723,10 @@ heap_xlog_delete(XLogReaderState *record)
PageSetPrunable(page, XLogRecGetXid(record));
if (xlrec->flags & XLH_DELETE_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
/* Make sure there is no forward chain link in t_ctid */
htup->t_ctid = target_tid;
@@ -7798,7 +7830,10 @@ heap_xlog_insert(XLogReaderState *record)
PageSetLSN(page, lsn);
if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
MarkBufferDirty(buffer);
}
@@ -7937,7 +7972,10 @@ heap_xlog_multi_insert(XLogReaderState *record)
PageSetLSN(page, lsn);
if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
MarkBufferDirty(buffer);
}
@@ -8065,7 +8103,10 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
PageSetPrunable(page, XLogRecGetXid(record));
if (xlrec->flags & XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
PageSetLSN(page, lsn);
MarkBufferDirty(obuffer);
@@ -8200,7 +8241,10 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
elog(PANIC, "heap_update_redo: failed to add tuple");
if (xlrec->flags & XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */
diff --git a/src/backend/access/heap/heapfuncs.c b/src/backend/access/heap/heapfuncs.c
new file mode 100644
index 0000000..6c3753b
--- /dev/null
+++ b/src/backend/access/heap/heapfuncs.c
@@ -0,0 +1,81 @@
+/*-------------------------------------------------------------------------
+ *
+ * heapfuncs.c
+ * Functions for accessing the related heap page
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/heap/heapfuncs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/visibilitymap.h"
+#include "funcapi.h"
+#include "storage/freespace.h"
+#include "storage/bufmgr.h"
+
+/* Functions for visibilitymap */
+extern Datum pg_is_all_visible(PG_FUNCTION_ARGS);
+extern Datum pg_is_all_frozen(PG_FUNCTION_ARGS);
+
+static bool visibilitymap_test_internal(Oid relid, uint64 blkno, uint8);
+
+/*
+ * Return the page is all-visible or not, according to the visibility map.
+ */
+Datum
+pg_is_all_visible(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int64 blkno = PG_GETARG_INT64(1);
+ bool all_visible;
+
+ all_visible = visibilitymap_test_internal(relid, blkno, VISIBILITYMAP_ALL_VISIBLE);
+
+ PG_RETURN_BOOL(all_visible);
+}
+
+/*
+ * Return the page is all-frozen or not, according to the visibility map.
+ */
+Datum
+pg_is_all_frozen(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int64 blkno = PG_GETARG_INT64(1);
+ bool all_frozen;
+
+ all_frozen = visibilitymap_test_internal(relid, blkno, VISIBILITYMAP_ALL_FROZEN);
+
+ PG_RETURN_BOOL(all_frozen);
+}
+
+static bool
+visibilitymap_test_internal(Oid relid, uint64 blkno, uint8 flag)
+{
+
+ Relation rel;
+ Buffer vmbuffer = InvalidBuffer;
+ bool result;
+
+ rel = relation_open(relid, AccessShareLock);
+
+ if (blkno < 0 || blkno > MaxBlockNumber)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid block number")));
+
+ result = visibilitymap_test(rel, blkno, &vmbuffer, flag);
+
+ if (BufferIsValid(vmbuffer))
+ ReleaseBuffer(vmbuffer);
+ relation_close(rel, AccessShareLock);
+
+ return result;
+}
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 7c38772..a284b85 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -21,33 +21,45 @@
*
* NOTES
*
- * The visibility map is a bitmap with one bit per heap page. A set bit means
- * that all tuples on the page are known visible to all transactions, and
- * therefore the page doesn't need to be vacuumed. The map is conservative in
- * the sense that we make sure that whenever a bit is set, we know the
- * condition is true, but if a bit is not set, it might or might not be true.
+ * The visibility map is a bitmap with two bits (all-visible and all-frozen)
+ * per heap page. A set all-visible bit means that all tuples on the page are
+ * known visible to all transactions, and therefore the page doesn't need to
+ * be vacuumed. A set all-frozen bit means that all tuples on the page are
+ * completely frozen, and therefore the page doesn't need to be vacuumed even
+ * if whole table scanning vacuum is required (e.g. anti-wraparound vacuum).
+ * A all-frozen bit must be set only when the page is already all-visible.
+ * That is, all-frozen bit is always set with all-visible bit.
+ *
+ * The map is conservative in the sense that we make sure that whenever a bit
+ * is set, we know the condition is true, but if a bit is not set, it might or
+ * might not be true.
*
* Clearing a visibility map bit is not separately WAL-logged. The callers
* must make sure that whenever a bit is cleared, the bit is cleared on WAL
- * replay of the updating operation as well.
+ * replay of the updating operation as well. And all-frozen bit must be
+ * cleared with all-visible at the same time.
*
* When we *set* a visibility map during VACUUM, we must write WAL. This may
* seem counterintuitive, since the bit is basically a hint: if it is clear,
- * it may still be the case that every tuple on the page is visible to all
- * transactions; we just don't know that for certain. The difficulty is that
- * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
- * on the page itself, and the visibility map bit. If a crash occurs after the
- * visibility map page makes it to disk and before the updated heap page makes
- * it to disk, redo must set the bit on the heap page. Otherwise, the next
- * insert, update, or delete on the heap page will fail to realize that the
- * visibility map bit must be cleared, possibly causing index-only scans to
- * return wrong answers.
+ * it may still be the case that every tuple on the page is visible or frozen
+ * to all transactions; we just don't know that for certain. The difficulty is
+ * that there are two bits which are typically set together: the PD_ALL_VISIBLE
+ * or PD_ALL_FROZEN bit on the page itself, and the visibility map bit. If a
+ * crash occurs after the visibility map page makes it to disk and before the
+ * updated heap page makes it to disk, redo must set the bit on the heap page.
+ * Otherwise, the next insert, update, or delete on the heap page will fail to
+ * realize that the visibility map bit must be cleared, possibly causing index-only
+ * scans to return wrong answers.
*
* VACUUM will normally skip pages for which the visibility map bit is set;
* such pages can't contain any dead tuples and therefore don't need vacuuming.
- * The visibility map is not used for anti-wraparound vacuums, because
+ * The visibility map is not used for anti-wraparound vacuums before 9.5, because
* an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
* present in the table, even on pages that don't have any dead tuples.
+ * 9.6 or later, the visibility map has a additional bit which indicates all tuple
+ * on single page has been completely forzen, so the visibility map is also used for
+ * anti-wraparound vacuums.
+ *
*
* LOCKING
*
@@ -58,14 +70,14 @@
* section that logs the page modification. However, we don't want to hold
* the buffer lock over any I/O that may be required to read in the visibility
* map page. To avoid this, we examine the heap page before locking it;
- * if the page-level PD_ALL_VISIBLE bit is set, we pin the visibility map
- * bit. Then, we lock the buffer. But this creates a race condition: there
- * is a possibility that in the time it takes to lock the buffer, the
- * PD_ALL_VISIBLE bit gets set. If that happens, we have to unlock the
- * buffer, pin the visibility map page, and relock the buffer. This shouldn't
- * happen often, because only VACUUM currently sets visibility map bits,
- * and the race will only occur if VACUUM processes a given page at almost
- * exactly the same time that someone tries to further modify it.
+ * if the page-level PD_ALL_VISIBLE or PD_ALL_FROZEN bit is set, we pin the
+ * visibility map bit. Then, we lock the buffer. But this creates a race
+ * condition: there is a possibility that in the time it takes to lock the
+ * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that happens,
+ * we have to unlock the buffer, pin the visibility map page, and relock the
+ * buffer. This shouldn't happen often, because only VACUUM currently sets
+ * visibility map bits, and the race will only occur if VACUUM processes a given
+ * page at almost exactly the same time that someone tries to further modify it.
*
* To set a bit, you need to hold a lock on the heap page. That prevents
* the race condition where VACUUM sees that all tuples on the page are
@@ -101,11 +113,14 @@
*/
#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
-/* Number of bits allocated for each heap block. */
-#define BITS_PER_HEAPBLOCK 1
+/*
+ * Number of bits allocated for each heap block.
+ * One for all-visible, other for all-frozen.
+*/
+#define BITS_PER_HEAPBLOCK 2
/* Number of heap blocks we can represent in one byte. */
-#define HEAPBLOCKS_PER_BYTE 8
+#define HEAPBLOCKS_PER_BYTE 4
/* Number of heap blocks we can represent in one visibility map page. */
#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
@@ -115,24 +130,42 @@
#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
#define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
-/* table for fast counting of set bits */
-static const uint8 number_of_ones[256] = {
- 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
+/* tables for fast counting of set bits for visible and freeze */
+static const uint8 number_of_ones_for_visible[256] = {
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4
+};
+static const uint8 number_of_ones_for_frozen[256] = {
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4
};
/* prototypes for internal routines */
@@ -141,7 +174,7 @@ static void vm_extend(Relation rel, BlockNumber nvmblocks);
/*
- * visibilitymap_clear - clear a bit in visibility map
+ * visibilitymap_clear - clear all bits in visibility map
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -153,7 +186,8 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
int mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- uint8 mask = 1 << mapBit;
+ uint8 mask = (VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN) <<
+ (BITS_PER_HEAPBLOCK * mapBit);
char *map;
#ifdef TRACE_VISIBILITYMAP
@@ -225,7 +259,7 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
}
/*
- * visibilitymap_set - set a bit on a previously pinned page
+ * visibilitymap_set - set bit(s) on a previously pinned page
*
* recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
* or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
@@ -234,10 +268,11 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
* marked all-visible; it is needed for Hot Standby, and can be
* InvalidTransactionId if the page contains no tuples.
*
- * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
- * this function. Except in recovery, caller should also pass the heap
- * buffer. When checksums are enabled and we're not in recovery, we must add
- * the heap buffer to the WAL chain to protect it from being torn.
+ * Caller is expected to set the heap page's PD_ALL_VISIBLE or PD_ALL_FROZEN
+ * bit before calling this function. Except in recovery, caller should also
+ * pass the heap buffer and flags which indicates what flag we want to set.
+ * When checksums are enabled and we're not in recovery, we must add the heap
+ * buffer to the WAL chain to protect it from being torn.
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -245,7 +280,8 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
*/
void
visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid)
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -254,7 +290,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_set %s %d %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
@@ -272,11 +308,11 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map = PageGetContents(page);
LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
- if (!(map[mapByte] & (1 << mapBit)))
+ if (flags != (map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))))
{
START_CRIT_SECTION();
- map[mapByte] |= (1 << mapBit);
+ map[mapByte] |= (flags << (BITS_PER_HEAPBLOCK * mapBit));
MarkBufferDirty(vmBuf);
if (RelationNeedsWAL(rel))
@@ -285,7 +321,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Assert(!InRecovery);
recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
- cutoff_xid);
+ cutoff_xid, flags);
/*
* If data checksums are enabled (or wal_log_hints=on), we
@@ -295,11 +331,15 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Page heapPage = BufferGetPage(heapBuf);
- /* caller is expected to set PD_ALL_VISIBLE first */
- Assert(PageIsAllVisible(heapPage));
+ /*
+ * caller is expected to set PD_ALL_VISIBLE or
+ * PD_ALL_FROZEN first.
+ */
+ Assert(PageIsAllVisible(heapPage) || PageIsAllFrozen(heapPage));
PageSetLSN(heapPage, recptr);
}
}
+
PageSetLSN(page, recptr);
}
@@ -310,15 +350,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
}
/*
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_test - test if bit(s) is set
*
- * Are all tuples on heapBlk visible to all, according to the visibility map?
+ * Are all tuples on heapBlk visible or frozen to all, according to the visibility map?
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by an
* earlier call to visibilitymap_pin or visibilitymap_test on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk, or InvalidBuffer. The caller is responsible for
- * releasing *buf after it's done testing and setting bits.
+ * releasing *buf after it's done testing and setting bits, and must set flags
+ * which indicates what flag we want to test.
*
* NOTE: This function is typically called without a lock on the heap page,
* so somebody else could change the bit just after we look at it. In fact,
@@ -328,7 +369,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
* all concurrency issues!
*/
bool
-visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
+visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf, uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -337,7 +378,7 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_test %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_test %s %d %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
/* Reuse the old pinned buffer if possible */
@@ -360,11 +401,12 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
map = PageGetContents(BufferGetPage(*buf));
/*
- * A single-bit read is atomic. There could be memory-ordering effects
+ * A single or double bit read is atomic. There could be memory-ordering effects
* here, but for performance reasons we make it the caller's job to worry
* about that.
*/
- result = (map[mapByte] & (1 << mapBit)) ? true : false;
+ result = (map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))) ?
+ true : false;
return result;
}
@@ -374,10 +416,11 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
*
* Note: we ignore the possibility of race conditions when the table is being
* extended concurrently with the call. New pages added to the table aren't
- * going to be marked all-visible, so they won't affect the result.
+ * going to be marked all-visible or all-frozen, so they won't affect the result.
+ * The caller must set the flags which indicates what flag we want to count.
*/
BlockNumber
-visibilitymap_count(Relation rel)
+visibilitymap_count(Relation rel, uint8 flags)
{
BlockNumber result = 0;
BlockNumber mapBlock;
@@ -406,7 +449,10 @@ visibilitymap_count(Relation rel)
for (i = 0; i < MAPSIZE; i++)
{
- result += number_of_ones[map[i]];
+ if (flags & VISIBILITYMAP_ALL_VISIBLE)
+ result += number_of_ones_for_visible[map[i]];
+ if (flags & VISIBILITYMAP_ALL_FROZEN)
+ result += number_of_ones_for_frozen[map[i]];
}
ReleaseBuffer(mapBuffer);
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 4246554..015bfb8 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1919,11 +1919,18 @@ index_update_stats(Relation rel,
{
BlockNumber relpages = RelationGetNumberOfBlocks(rel);
BlockNumber relallvisible;
+ BlockNumber relallfrozen;
if (rd_rel->relkind != RELKIND_INDEX)
- relallvisible = visibilitymap_count(rel);
+ {
+ relallvisible = visibilitymap_count(rel, VISIBILITYMAP_ALL_VISIBLE);
+ relallfrozen = visibilitymap_count(rel, VISIBILITYMAP_ALL_FROZEN);
+ }
else /* don't bother for indexes */
+ {
relallvisible = 0;
+ relallfrozen = 0;
+ }
if (rd_rel->relpages != (int32) relpages)
{
@@ -1940,6 +1947,11 @@ index_update_stats(Relation rel,
rd_rel->relallvisible = (int32) relallvisible;
dirty = true;
}
+ if (rd_rel->relallfrozen != (int32) relallfrozen)
+ {
+ rd_rel->relallfrozen = (int32) relallfrozen;
+ dirty = true;
+ }
}
/*
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 861048f..392c2a4 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -572,7 +572,8 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
vac_update_relstats(onerel,
relpages,
totalrows,
- visibilitymap_count(onerel),
+ visibilitymap_count(onerel, VISIBILITYMAP_ALL_VISIBLE),
+ visibilitymap_count(onerel, VISIBILITYMAP_ALL_FROZEN),
hasindex,
InvalidTransactionId,
InvalidMultiXactId,
@@ -595,6 +596,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
RelationGetNumberOfBlocks(Irel[ind]),
totalindexrows,
0,
+ 0,
false,
InvalidTransactionId,
InvalidMultiXactId,
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 7ab4874..d3725dd 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -22,6 +22,7 @@
#include "access/rewriteheap.h"
#include "access/transam.h"
#include "access/tuptoaster.h"
+#include "access/visibilitymap.h"
#include "access/xact.h"
#include "access/xlog.h"
#include "catalog/catalog.h"
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index baf66f1..d68c7c4 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -744,6 +744,7 @@ void
vac_update_relstats(Relation relation,
BlockNumber num_pages, double num_tuples,
BlockNumber num_all_visible_pages,
+ BlockNumber num_all_frozen_pages,
bool hasindex, TransactionId frozenxid,
MultiXactId minmulti,
bool in_outer_xact)
@@ -781,6 +782,11 @@ vac_update_relstats(Relation relation,
pgcform->relallvisible = (int32) num_all_visible_pages;
dirty = true;
}
+ if (pgcform->relallfrozen != (int32) num_all_frozen_pages)
+ {
+ pgcform->relallfrozen = (int32) num_all_frozen_pages;
+ dirty = true;
+ }
/* Apply DDL updates, but not inside an outer transaction (see above) */
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index a01cfb4..120de63 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -106,6 +106,8 @@ typedef struct LVRelStats
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* number of pages we examined */
BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber vmskipped_frozen_pages; /* # of pages we skipped by all-frozen bit
+ of visibility map */
double scanned_tuples; /* counts only tuples on scanned pages */
double old_rel_tuples; /* previous value of pg_class.reltuples */
double new_rel_tuples; /* new estimated total # of tuples */
@@ -156,7 +158,7 @@ static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
static int vac_cmp_itemptr(const void *left, const void *right);
static bool heap_page_is_all_visible(Relation rel, Buffer buf,
- TransactionId *visibility_cutoff_xid);
+ TransactionId *visibility_cutoff_xid, bool *all_frozen);
/*
@@ -188,7 +190,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
MultiXactId mxactFullScanLimit;
BlockNumber new_rel_pages;
double new_rel_tuples;
- BlockNumber new_rel_allvisible;
+ BlockNumber new_rel_allvisible,
+ new_rel_allfrozen;
double new_live_tuples;
TransactionId new_frozen_xid;
MultiXactId new_min_multi;
@@ -222,6 +225,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* than or equal to the requested Xid full-table scan limit; or if the
* table's minimum MultiXactId is older than or equal to the requested
* mxid full-table scan limit.
+ * Even if scan_all is set so far, we could skip to scan some pages
+ * according by frozen map.
*/
scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
xidFullScanLimit);
@@ -253,7 +258,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
- if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages + vacrelstats->vmskipped_frozen_pages)
+ < vacrelstats->rel_pages)
{
Assert(!scan_all);
scanned_all = false;
@@ -301,10 +307,14 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_tuples = vacrelstats->old_rel_tuples;
}
- new_rel_allvisible = visibilitymap_count(onerel);
+ new_rel_allvisible = visibilitymap_count(onerel, VISIBILITYMAP_ALL_VISIBLE);
if (new_rel_allvisible > new_rel_pages)
new_rel_allvisible = new_rel_pages;
+ new_rel_allfrozen = visibilitymap_count(onerel, VISIBILITYMAP_ALL_FROZEN);
+ if (new_rel_allfrozen > new_rel_pages)
+ new_rel_allfrozen = new_rel_pages;
+
new_frozen_xid = scanned_all ? FreezeLimit : InvalidTransactionId;
new_min_multi = scanned_all ? MultiXactCutoff : InvalidMultiXactId;
@@ -312,6 +322,7 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_pages,
new_rel_tuples,
new_rel_allvisible,
+ new_rel_allfrozen,
vacrelstats->hasindex,
new_frozen_xid,
new_min_multi,
@@ -360,10 +371,11 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel),
vacrelstats->num_index_scans);
- appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins\n"),
+ appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped according to vm\n"),
vacrelstats->pages_removed,
vacrelstats->rel_pages,
- vacrelstats->pinskipped_pages);
+ vacrelstats->pinskipped_pages,
+ vacrelstats->vmskipped_frozen_pages);
appendStringInfo(&buf,
_("tuples: %.0f removed, %.0f remain, %.0f are dead but not yet removable\n"),
vacrelstats->tuples_deleted,
@@ -486,9 +498,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* consecutive pages. Since we're reading sequentially, the OS should be
* doing readahead for us, so there's no gain in skipping a page now and
* then; that's likely to disable readahead and so be counterproductive.
- * Also, skipping even a single page means that we can't update
- * relfrozenxid, so we only want to do it if we can skip a goodly number
- * of pages.
+ * Also, skipping even a single page accorinding to all-visible bit of
+ * visibility map means that we can't update relfrozenxid, so we only want
+ * to do it if we can skip a goodly number. On the other hand, we count
+ * both how many pages we skipped according to all-frozen bit of visibility
+ * map and how many pages we freeze page, so we can update relfrozenxid if
+ * the sum of them is as many as pages of table.
*
* Before entering the main loop, establish the invariant that
* next_not_all_visible_block is the next block number >= blkno that's not
@@ -515,7 +530,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer))
+ if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer,
+ VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN))
break;
vacuum_delay_point();
}
@@ -533,7 +549,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
bool tupgone,
hastup;
int prev_dead_count;
- int nfrozen;
+ int nfrozen; /* # of tuples is frozen */
+ int nalready_frozen; /* # of tuples is already frozen */
+ int ntotal_frozen; /* # of tuples is in single page */
+ int ntup_per_page;
Size freespace;
bool all_visible_according_to_vm;
bool all_visible;
@@ -548,7 +567,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block++)
{
if (!visibilitymap_test(onerel, next_not_all_visible_block,
- &vmbuffer))
+ &vmbuffer,
+ VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN))
break;
vacuum_delay_point();
}
@@ -566,9 +586,25 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
}
else
{
- /* Current block is all-visible */
- if (skipping_all_visible_blocks && !scan_all)
- continue;
+ /*
+ * This block is at least all-visible according to visibility map.
+ * We check whehter this block is all-frozen to skip to vacuum this
+ * page even if scanning whole page is required.
+ */
+ if (scan_all)
+ {
+ if (visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_FROZEN))
+ {
+ vacrelstats->vmskipped_frozen_pages++;
+ continue;
+ }
+ }
+ else
+ {
+ if (skipping_all_visible_blocks)
+ continue;
+ }
+
all_visible_according_to_vm = true;
}
@@ -740,7 +776,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
PageSetAllVisible(page);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, InvalidTransactionId);
+ vmbuffer, InvalidTransactionId,
+ VISIBILITYMAP_ALL_VISIBLE);
END_CRIT_SECTION();
}
@@ -764,6 +801,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
all_visible = true;
has_dead_tuples = false;
nfrozen = 0;
+ nalready_frozen = 0;
+ ntup_per_page = 0;
hastup = false;
prev_dead_count = vacrelstats->num_dead_tuples;
maxoff = PageGetMaxOffsetNumber(page);
@@ -918,8 +957,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
{
num_tuples += 1;
+ ntup_per_page += 1;;
hastup = true;
+ /* Check whether this tuple is alrady frozen or not */
+ if (HeapTupleHeaderXminFrozen(tuple.t_data))
+ nalready_frozen += 1;
+
/*
* Each non-removable tuple must be checked to see if it needs
* freezing. Note we already have exclusive buffer lock.
@@ -931,9 +975,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
} /* scan along page */
/*
- * If we froze any tuples, mark the buffer dirty, and write a WAL
- * record recording the changes. We must log the changes to be
- * crash-safe against future truncation of CLOG.
+ * If we froze any tuples or any tuples are already frozen,
+ * mark the buffer dirty, and write a WAL record recording the changes.
+ * We must log the changes to be crash-safe against future truncation
+ * of CLOG.
*/
if (nfrozen > 0)
{
@@ -966,6 +1011,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
END_CRIT_SECTION();
}
+ /* Compute the number of frozen tuples in a page */
+ ntotal_frozen = nfrozen + nalready_frozen;
+
/*
* If there are no indexes then we can vacuum the page right now
* instead of doing a second scan.
@@ -988,26 +1036,47 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
freespace = PageGetHeapFreeSpace(page);
- /* mark page all-visible, if appropriate */
- if (all_visible && !all_visible_according_to_vm)
+ /* This page is all visible */
+ if (all_visible)
{
- /*
- * It should never be the case that the visibility map page is set
- * while the page-level bit is clear, but the reverse is allowed
- * (if checksums are not enabled). Regardless, set the both bits
- * so that we get back in sync.
- *
- * NB: If the heap page is all-visible but the VM bit is not set,
- * we don't need to dirty the heap page. However, if checksums
- * are enabled, we do need to make sure that the heap page is
- * dirtied before passing it to visibilitymap_set(), because it
- * may be logged. Given that this situation should only happen in
- * rare cases after a crash, it is not worth optimizing.
- */
- PageSetAllVisible(page);
- MarkBufferDirty(buf);
- visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ /* mark page all-visible, if appropriate */
+ if (!all_visible_according_to_vm)
+ {
+ /*
+ * It should never be the case that the visibility map page is set
+ * while the page-level bit is clear, but the reverse is allowed
+ * (if checksums are not enabled). Regardless, set the both bits
+ * so that we get back in sync.
+ *
+ * NB: If the heap page is all-visible but the VM bit is not set,
+ * we don't need to dirty the heap page. However, if checksums
+ * are enabled, we do need to make sure that the heap page is
+ * dirtied before passing it to visibilitymap_set(), because it
+ * may be logged. Given that this situation should only happen in
+ * rare cases after a crash, it is not worth optimizing.
+ */
+ PageSetAllVisible(page);
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+ }
+
+ /* mark page all-frozen, if all tuples are frozen in total */
+ if ((ntotal_frozen == ntup_per_page) &&
+ !visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_FROZEN))
+ {
+ Assert(PageIsAllVisible(page));
+
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ {
+ MarkBufferDirty(buf);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+ vmbuffer, visibility_cutoff_xid, flags);
+ }
}
/*
@@ -1018,7 +1087,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* that something bad has happened.
*/
else if (all_visible_according_to_vm && !PageIsAllVisible(page)
- && visibilitymap_test(onerel, blkno, &vmbuffer))
+ && visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
relname, blkno);
@@ -1047,6 +1116,17 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
visibilitymap_clear(onerel, blkno, vmbuffer);
}
+ /*
+ * As a result of scanning a page, we set VM all-frozen bit and page header
+ * if all tuples of single page are frozen.
+ */
+ if (ntotal_frozen == ntup_per_page)
+ {
+ PageSetAllFrozen(page);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr, vmbuffer,
+ InvalidTransactionId, VISIBILITYMAP_ALL_FROZEN);
+ }
+
UnlockReleaseBuffer(buf);
/* Remember the location of the last page with nonremovable tuples */
@@ -1078,7 +1158,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
num_tuples);
/*
- * Release any remaining pin on visibility map page.
+ * Release any remaining pin on visibility map and frozen map page.
*/
if (BufferIsValid(vmbuffer))
{
@@ -1115,6 +1195,14 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
tups_vacuumed, vacuumed_pages)));
/*
+ * This information would be effective for how much effect all-frozen bit
+ * of VM had for freezing tuples.
+ */
+ ereport(elevel,
+ (errmsg("Skipped %d frozen pages acoording to visibility map",
+ vacrelstats->vmskipped_frozen_pages)));
+
+ /*
* This is pretty messy, but we split it up so that we can skip emitting
* individual parts of the message when not applicable.
*/
@@ -1226,6 +1314,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
OffsetNumber unused[MaxOffsetNumber];
int uncnt = 0;
TransactionId visibility_cutoff_xid;
+ bool all_frozen;
START_CRIT_SECTION();
@@ -1277,19 +1366,31 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
* dirty, exclusively locked, and, if needed, a full page image has been
* emitted in the log_heap_clean() above.
*/
- if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid))
+ if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid, &all_frozen))
PageSetAllVisible(page);
/*
* All the changes to the heap page have been done. If the all-visible
- * flag is now set, also set the VM bit.
+ * flag is now set, also set the VM all-visible bit.
+ * Also, if this page is all-frozen, set VM all-frozen bit and flag.
*/
- if (PageIsAllVisible(page) &&
- !visibilitymap_test(onerel, blkno, vmbuffer))
+ if (PageIsAllVisible(page))
{
- Assert(BufferIsValid(*vmbuffer));
- visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
- visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ if (!visibilitymap_test(onerel, blkno, vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+
+ /* mark page all-frozen, and set VM all-frozen bit */
+ if (all_frozen)
+ {
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
+ visibility_cutoff_xid, flags);
}
return tupindex;
@@ -1408,6 +1509,7 @@ lazy_cleanup_index(Relation indrel,
stats->num_pages,
stats->num_index_tuples,
0,
+ 0,
false,
InvalidTransactionId,
InvalidMultiXactId,
@@ -1782,7 +1884,8 @@ vac_cmp_itemptr(const void *left, const void *right)
* xmin amongst the visible tuples.
*/
static bool
-heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid)
+heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid,
+ bool *all_frozen)
{
Page page = BufferGetPage(buf);
BlockNumber blockno = BufferGetBlockNumber(buf);
@@ -1791,6 +1894,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
bool all_visible = true;
*visibility_cutoff_xid = InvalidTransactionId;
+ *all_frozen = true;
/*
* This is a stripped down version of the line pointer scan in
@@ -1814,7 +1918,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/*
* Dead line pointers can have index pointers pointing to them. So
- * they can't be treated as visible
+ * they can't be treated as visible and frozen.
*/
if (ItemIdIsDead(itemid))
{
@@ -1855,6 +1959,10 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
+
+ /* Check whether this tuple is alrady frozen or not */
+ if (!HeapTupleHeaderXminFrozen(tuple.t_data))
+ *all_frozen = false;
}
break;
@@ -1863,6 +1971,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
case HEAPTUPLE_INSERT_IN_PROGRESS:
case HEAPTUPLE_DELETE_IN_PROGRESS:
all_visible = false;
+ *all_frozen = false;
break;
default:
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 9f54c46..08df289 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -116,7 +116,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
*/
if (!visibilitymap_test(scandesc->heapRelation,
ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
+ &node->ioss_VMBuffer, VISIBILITYMAP_ALL_VISIBLE))
{
/*
* Rats, we have to visit the heap to check visibility.
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 874ca6a..376841a 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -127,7 +127,7 @@ ExecCheckPlanOutput(Relation resultRel, List *targetList)
if (attno != resultDesc->natts)
ereport(ERROR,
(errcode(ERRCODE_DATATYPE_MISMATCH),
- errmsg("table row type and query-specified row type do not match"),
+ errmsg("table row type and query-specified row type do not match"),
errdetail("Query has too few columns.")));
}
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 79d9390..8fededc 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -10,6 +10,7 @@
#include "postgres_fe.h"
#include "pg_upgrade.h"
+#include "storage/bufpage.h"
#include <fcntl.h>
@@ -21,6 +22,27 @@ static int copy_file(const char *fromfile, const char *tofile, bool force);
static int win32_pghardlink(const char *src, const char *dst);
#endif
+static int rewrite_vm_to_vfm(const char *fromfile, const char *tofile, bool force);
+
+/* table for fast rewriting vm file to vfm file */
+static const uint16 rewrite_vm_to_vfm_table[256] = {
+ 0, 1, 4, 5, 16, 17, 20, 21, 64, 65, 68, 69, 80, 81, 84, 85,
+ 256, 257, 260, 261, 272, 273, 276, 277, 320, 321, 324, 325, 336, 337, 340, 341,
+ 1024, 1025, 1028, 1029, 1040, 1041, 1044, 1045, 1088, 1089, 1092, 1093, 1104, 1105, 1108, 1109,
+ 1280, 1281, 1284, 1285, 1296, 1297, 1300, 1301, 1344, 1345, 1348, 1349, 1360, 1361, 1364, 1365,
+ 4096, 4097, 4100, 4101, 4112, 4113, 4116, 4117, 4160, 4161, 4164, 4165, 4176, 4177, 4180, 4181,
+ 4352, 4353, 4356, 4357, 4368, 4369, 4372, 4373, 4416, 4417, 4420, 4421, 4432, 4433, 4436, 4437,
+ 5120, 5121, 5124, 5125, 5136, 5137, 5140, 5141, 5184, 5185, 5188, 5189, 5200, 5201, 5204, 5205,
+ 5376, 5377, 5380, 5381, 5392, 5393, 5396, 5397, 5440, 5441, 5444, 5445, 5456, 5457, 5460, 5461,
+ 16384, 16385, 16388, 16389, 16400, 16401, 16404, 16405, 16448, 16449, 16452, 16453, 16464, 16465, 16468, 16469,
+ 16640, 16641, 16644, 16645, 16656, 16657, 16660, 16661, 16704, 16705, 16708, 16709, 16720, 16721, 16724, 16725,
+ 17408, 17409, 17412, 17413, 17424, 17425, 17428, 17429, 17472, 17473, 17476, 17477, 17488, 17489, 17492, 17493,
+ 17664, 17665, 17668, 17669, 17680, 17681, 17684, 17685, 17728, 17729, 17732, 17733, 17744, 17745, 17748, 17749,
+ 20480, 20481, 20484, 20485, 20496, 20497, 20500, 20501, 20544, 20545, 20548, 20549, 20560, 20561, 20564, 20565,
+ 20736, 20737, 20740, 20741, 20752, 20753, 20756, 20757, 20800, 20801, 20804, 20805, 20816, 20817, 20820, 20821,
+ 21504, 21505, 21508, 21509, 21520, 21521, 21524, 21525, 21568, 21569, 21572, 21573, 21584, 21585, 21588, 21589,
+ 21760, 21761, 21764, 21765, 21776, 21777, 21780, 21781, 21824, 21825, 21828, 21829, 21840, 21841, 21844, 21845
+};
/*
* copyAndUpdateFile()
@@ -30,11 +52,19 @@ static int win32_pghardlink(const char *src, const char *dst);
*/
const char *
copyAndUpdateFile(pageCnvCtx *pageConverter,
- const char *src, const char *dst, bool force)
+ const char *src, const char *dst, bool force, bool rewrite_vm)
{
+
if (pageConverter == NULL)
{
- if (pg_copy_file(src, dst, force) == -1)
+ int ret;
+
+ if (rewrite_vm)
+ ret = rewrite_vm_to_vfm(src, dst, force);
+ else
+ ret = pg_copy_file(src, dst, force);
+
+ if (ret)
return getErrorText(errno);
else
return NULL;
@@ -99,7 +129,6 @@ copyAndUpdateFile(pageCnvCtx *pageConverter,
}
}
-
/*
* linkAndUpdateFile()
*
@@ -201,6 +230,110 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
#endif
+/*
+ * rewriteVisibiiltyMap()
+ *
+ * A additional bit which indicates that all tuples on page is completely
+ * frozen is added into visibility map at PG 9.6. So the format of visibiilty
+ * map has been changed.
+ * Copies a visibility map file while adding all-frozen bit(0) into each bit.
+ */
+static int
+rewrite_vm_to_vfm(const char *fromfile, const char *tofile, bool force)
+{
+#define REWRITE_BUF_SIZE (50 * BLCKSZ)
+#define BITS_PER_HEAPBLOCK 2
+
+ int src_fd, dst_fd;
+ uint16 vfm_bits;
+ ssize_t nbytes;
+ char *buffer;
+ int ret = 0;
+ int save_errno = 0;
+
+ if ((fromfile == NULL) || (tofile == NULL))
+ {
+ errno = EINVAL;
+ return -1;
+ }
+
+ if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
+ return -1;
+
+ if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+ {
+ save_errno = errno;
+ if (src_fd != 0)
+ close(src_fd);
+
+ errno = save_errno;
+ return -1;
+ }
+
+ buffer = (char *) pg_malloc(REWRITE_BUF_SIZE);
+
+ /* Copy page header data in advance */
+ if ((nbytes = read(src_fd, buffer, MAXALIGN(SizeOfPageHeaderData))) <= 0)
+ {
+ save_errno = errno;
+ return -1;
+ }
+
+ if (write(dst_fd, buffer, nbytes) != nbytes)
+ {
+ /* if write didn't set errno, assume problem is no disk space */
+ if (errno == 0)
+ errno = ENOSPC;
+ save_errno = errno;
+ return -1;
+ }
+
+ /* perform data rewriting i.e read src srouce, write to destination */
+ while (true)
+ {
+ ssize_t nbytes = read(src_fd, buffer, REWRITE_BUF_SIZE);
+ char *cur, *end;
+
+ if (nbytes < 0)
+ {
+ ret = -1;
+ break;
+ }
+
+ if (nbytes == 0)
+ break;
+
+ cur = buffer;
+ end = buffer + nbytes;
+
+ /*
+ * Rewrite a byte and write dest_fd per BITS_PER_HEAPBLOCK bytes.
+ */
+ while (end > cur)
+ {
+ /* Get rewritten bit from table and its string representation */
+ vfm_bits = rewrite_vm_to_vfm_table[(uint8) *cur];
+
+ if (write(dst_fd, &vfm_bits, BITS_PER_HEAPBLOCK) != BITS_PER_HEAPBLOCK)
+ {
+ ret = -1;
+ break;
+ }
+ cur++;
+ }
+ }
+
+ pg_free(buffer);
+
+ if (src_fd != 0)
+ close(src_fd);
+
+ if (dst_fd != 0)
+ close(dst_fd);
+
+ return ret;
+}
+
void
check_hard_link(void)
{
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 13aa891..d957581 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -112,6 +112,11 @@ extern char *output_files[];
#define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031
/*
+ * The format of visibility map changed with this 9.6 commit,
+ *
+ */
+#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201507161
+/*
* pg_multixact format changed in 9.3 commit 0ac5ad5134f2769ccbaefec73844f85,
* ("Improve concurrency of foreign key locking") which also updated catalog
* version to this value. pg_upgrade behavior depends on whether old and new
@@ -397,7 +402,7 @@ typedef void *pageCnvCtx;
#endif
const char *copyAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
- const char *dst, bool force);
+ const char *dst, bool force, bool rewrite_vm);
const char *linkAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
const char *dst);
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index c22df42..766a473 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -18,7 +18,7 @@
static void transfer_single_new_db(pageCnvCtx *pageConverter,
FileNameMap *maps, int size, char *old_tablespace);
static void transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *suffix);
+ const char *type_old_suffix, const char *type_new_suffix);
/*
@@ -171,6 +171,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
{
int mapnum;
bool vm_crashsafe_match = true;
+ bool vm_rewrite_needed = false;
/*
* Do the old and new cluster disagree on the crash-safetiness of the vm
@@ -180,13 +181,20 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
vm_crashsafe_match = false;
+ /*
+ * Do we need to rewrite "vm" to "vfm".
+ */
+ if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ vm_rewrite_needed = true;
+
for (mapnum = 0; mapnum < size; mapnum++)
{
if (old_tablespace == NULL ||
strcmp(maps[mapnum].old_tablespace, old_tablespace) == 0)
{
/* transfer primary file */
- transfer_relfile(pageConverter, &maps[mapnum], "");
+ transfer_relfile(pageConverter, &maps[mapnum], "", "");
/* fsm/vm files added in PG 8.4 */
if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)
@@ -194,9 +202,17 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
/*
* Copy/link any fsm and vm files, if they exist
*/
- transfer_relfile(pageConverter, &maps[mapnum], "_fsm");
+ transfer_relfile(pageConverter, &maps[mapnum], "_fsm", "_fsm");
if (vm_crashsafe_match)
- transfer_relfile(pageConverter, &maps[mapnum], "_vm");
+ {
+ /*
+ * vm file is changed to vfm file in PG 9.6.
+ */
+ if (vm_rewrite_needed)
+ transfer_relfile(pageConverter, &maps[mapnum], "_vm", "_vfm");
+ else
+ transfer_relfile(pageConverter, &maps[mapnum], "_vm", "_vm");
+ }
}
}
}
@@ -210,7 +226,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
*/
static void
transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *type_suffix)
+ const char *type_old_suffix, const char *type_new_suffix)
{
const char *msg;
char old_file[MAXPGPATH];
@@ -218,6 +234,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
int fd;
int segno;
char extent_suffix[65];
+ bool rewrite_vm = false;
/*
* Now copy/link any related segments as well. Remember, PG breaks large
@@ -236,18 +253,18 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
map->old_tablespace_suffix,
map->old_db_oid,
map->old_relfilenode,
- type_suffix,
+ type_old_suffix,
extent_suffix);
snprintf(new_file, sizeof(new_file), "%s%s/%u/%u%s%s",
map->new_tablespace,
map->new_tablespace_suffix,
map->new_db_oid,
map->new_relfilenode,
- type_suffix,
+ type_new_suffix,
extent_suffix);
/* Is it an extent, fsm, or vm file? */
- if (type_suffix[0] != '\0' || segno != 0)
+ if (type_old_suffix[0] != '\0' || segno != 0)
{
/* Did file open fail? */
if ((fd = open(old_file, O_RDONLY, 0)) == -1)
@@ -276,7 +293,11 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
+ /* We need to rewrite vm file to vfm file. */
+ if (strcmp(type_old_suffix, type_new_suffix) != 0)
+ rewrite_vm = true;
+
+ if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true, rewrite_vm)) != NULL)
pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
diff --git a/src/bin/pg_upgrade/test.sh b/src/bin/pg_upgrade/test.sh
index f4e5d9a..53b8b2f 100644
--- a/src/bin/pg_upgrade/test.sh
+++ b/src/bin/pg_upgrade/test.sh
@@ -171,6 +171,11 @@ if "$MAKE" -C "$oldsrc" installcheck; then
mv "$temp_root"/dump1.sql "$temp_root"/dump1.sql.orig
sed "s;$oldsrc;$newsrc;g" "$temp_root"/dump1.sql.orig >"$temp_root"/dump1.sql
fi
+
+ vm_sql="SELECT c.relname, c.relallvisible FROM pg_class as c, pg_namespace as n WHERE c.relnamespace = n.oid AND n.nspname NOT IN ('information_schema', 'pg_toast', 'pg_catalog') ORDER BY c.relname;"
+ # Test for rewriting visibility map
+ vacuumdb -d regression || visibilitymap_vacuum1_status=$?
+ psql -d regression -c "$vm_sql" > "$temp_root"/vm_test1.txt || visibilitymap_test1_status=$?
else
make_installcheck_status=$?
fi
@@ -185,6 +190,14 @@ if [ -n "$pg_dumpall1_status" ]; then
echo "pg_dumpall of pre-upgrade database cluster failed"
exit 1
fi
+if [ -n "$visibilitymap_vacuum1_status" ];then
+ echo "VACUUM of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+if [ -n "$visibilitymap_test1_status" ];then
+ echo "SELECT of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
PGDATA=$BASE_PGDATA
@@ -200,6 +213,8 @@ case $testhost in
esac
pg_dumpall -f "$temp_root"/dump2.sql || pg_dumpall2_status=$?
+vacuumdb -d regression || visibilitymap_vacuum2_status=$?
+psql -d regression -c "$vm_sql" > "$temp_root"/vm_test2.txt || visibilitymap_test2_status=$?
pg_ctl -m fast stop
# no need to echo commands anymore
@@ -211,11 +226,26 @@ if [ -n "$pg_dumpall2_status" ]; then
exit 1
fi
+if [ -n "$visibilitymap_vacuum2_status" ];then
+ echo "VACUUM of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
+if [ -n "$visibilitymap_test2_status" ];then
+ echo "SELECT of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
case $testhost in
MINGW*) cmd /c delete_old_cluster.bat ;;
*) sh ./delete_old_cluster.sh ;;
esac
+if ! diff "$temp_root"/vm_test1.txt "$temp_root"/vm_test2.txt >/dev/null; then
+ echo "Visibility map rewriting test failed"
+ exit 1
+fi
+
if diff "$temp_root"/dump1.sql "$temp_root"/dump2.sql >/dev/null; then
echo PASSED
exit 0
diff --git a/src/common/relpath.c b/src/common/relpath.c
index 66dfef1..5898f1b 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -30,11 +30,14 @@
* If you add a new entry, remember to update the errhint in
* forkname_to_number() below, and update the SGML documentation for
* pg_relation_size().
+ * 9.6 or later, the visibility map fork name is changed from "vm" to
+ * "vfm" bacause visibility map has not only information about all-visible
+ * but also information about all-frozen.
*/
const char *const forkNames[] = {
"main", /* MAIN_FORKNUM */
"fsm", /* FSM_FORKNUM */
- "vm", /* VISIBILITYMAP_FORKNUM */
+ "vfm", /* VISIBILITYMAP_FORKNUM */
"init" /* INIT_FORKNUM */
};
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index caa0f14..93afb10 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -320,9 +320,10 @@ typedef struct xl_heap_freeze_page
typedef struct xl_heap_visible
{
TransactionId cutoff_xid;
+ uint8 flags;
} xl_heap_visible;
-#define SizeOfHeapVisible (offsetof(xl_heap_visible, cutoff_xid) + sizeof(TransactionId))
+#define SizeOfHeapVisible (offsetof(xl_heap_visible, flags) + sizeof(uint8))
typedef struct xl_heap_new_cid
{
@@ -389,6 +390,6 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
xl_heap_freeze_tuple *xlrec_tp);
extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
- Buffer vm_buffer, TransactionId cutoff_xid);
+ Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
#endif /* HEAPAM_XLOG_H */
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 0c0e0ef..7270609 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -19,15 +19,20 @@
#include "storage/buf.h"
#include "utils/relcache.h"
-extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk,
- Buffer vmbuf);
+/* Flags for bit map */
+#define VISIBILITYMAP_ALL_VISIBLE 0x01
+#define VISIBILITYMAP_ALL_FROZEN 0x02
+
+extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
Buffer *vmbuf);
extern bool visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid);
-extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
-extern BlockNumber visibilitymap_count(Relation rel);
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags);
+extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf,
+ uint8 flags);
+extern BlockNumber visibilitymap_count(Relation rel, uint8 flags);
extern void visibilitymap_truncate(Relation rel, BlockNumber nheapblocks);
#endif /* VISIBILITYMAP_H */
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 44ce2b3..734df9d 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 201507021
+#define CATALOG_VERSION_NO 201507161
#endif
diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
index e526cd9..ea0f7c1 100644
--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -47,6 +47,8 @@ CATALOG(pg_class,1259) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83) BKI_SCHEMA_MACRO
float4 reltuples; /* # of tuples (not always up-to-date) */
int32 relallvisible; /* # of all-visible blocks (not always
* up-to-date) */
+ int32 relallfrozen; /* # of all-frozen blocks (not always
+ up-to-date) */
Oid reltoastrelid; /* OID of toast table; 0 if none */
bool relhasindex; /* T if has (or has had) any indexes */
bool relisshared; /* T if shared across databases */
@@ -95,7 +97,7 @@ typedef FormData_pg_class *Form_pg_class;
* ----------------
*/
-#define Natts_pg_class 30
+#define Natts_pg_class 31
#define Anum_pg_class_relname 1
#define Anum_pg_class_relnamespace 2
#define Anum_pg_class_reltype 3
@@ -107,25 +109,26 @@ typedef FormData_pg_class *Form_pg_class;
#define Anum_pg_class_relpages 9
#define Anum_pg_class_reltuples 10
#define Anum_pg_class_relallvisible 11
-#define Anum_pg_class_reltoastrelid 12
-#define Anum_pg_class_relhasindex 13
-#define Anum_pg_class_relisshared 14
-#define Anum_pg_class_relpersistence 15
-#define Anum_pg_class_relkind 16
-#define Anum_pg_class_relnatts 17
-#define Anum_pg_class_relchecks 18
-#define Anum_pg_class_relhasoids 19
-#define Anum_pg_class_relhaspkey 20
-#define Anum_pg_class_relhasrules 21
-#define Anum_pg_class_relhastriggers 22
-#define Anum_pg_class_relhassubclass 23
-#define Anum_pg_class_relrowsecurity 24
-#define Anum_pg_class_relispopulated 25
-#define Anum_pg_class_relreplident 26
-#define Anum_pg_class_relfrozenxid 27
-#define Anum_pg_class_relminmxid 28
-#define Anum_pg_class_relacl 29
-#define Anum_pg_class_reloptions 30
+#define Anum_pg_class_relallfrozen 12
+#define Anum_pg_class_reltoastrelid 13
+#define Anum_pg_class_relhasindex 14
+#define Anum_pg_class_relisshared 15
+#define Anum_pg_class_relpersistence 16
+#define Anum_pg_class_relkind 17
+#define Anum_pg_class_relnatts 18
+#define Anum_pg_class_relchecks 19
+#define Anum_pg_class_relhasoids 20
+#define Anum_pg_class_relhaspkey 21
+#define Anum_pg_class_relhasrules 22
+#define Anum_pg_class_relhastriggers 23
+#define Anum_pg_class_relhassubclass 24
+#define Anum_pg_class_relrowsecurity 25
+#define Anum_pg_class_relispopulated 26
+#define Anum_pg_class_relreplident 27
+#define Anum_pg_class_relfrozenxid 28
+#define Anum_pg_class_relminmxid 29
+#define Anum_pg_class_relacl 30
+#define Anum_pg_class_reloptions 31
/* ----------------
* initial contents of pg_class
@@ -140,13 +143,13 @@ typedef FormData_pg_class *Form_pg_class;
* Note: "3" in the relfrozenxid column stands for FirstNormalTransactionId;
* similarly, "1" in relminmxid stands for FirstMultiXactId
*/
-DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 f f p r 28 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 0 f f p r 28 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 0 f f p r 31 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 6fd1278..7d9b93f 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3213,6 +3213,12 @@ DESCR("sleep until the specified time");
DATA(insert OID = 2971 ( text PGNSP PGUID 12 1 0 0 0 f f f f t f i 1 0 25 "16" _null_ _null_ _null_ _null_ _null_ booltext _null_ _null_ _null_ ));
DESCR("convert boolean to text");
+DATA(insert OID = 3298 ( pg_is_all_visible PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 16 "2205 20" _null_ _null_ _null_ _null_ _null_ pg_is_all_visible _null_ _null_ _null_ ));
+DESCR("true if the page is all visible");
+DATA(insert OID = 3299 ( pg_is_all_frozen PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 16 "2205 20" _null_ _null_ _null_ _null_ _null_ pg_is_all_frozen _null_ _null_ _null_ ));
+DESCR("true if the page is all frozen");
+
+
/* Aggregates (moved here from pg_aggregate for 7.3) */
DATA(insert OID = 2100 ( avg PGNSP PGUID 12 1 0 0 0 t f f f f f i 1 0 1700 "20" _null_ _null_ _null_ _null_ _null_ aggregate_dummy _null_ _null_ _null_ ));
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index e3a31af..d2bae2d 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -172,6 +172,7 @@ extern void vac_update_relstats(Relation relation,
BlockNumber num_pages,
double num_tuples,
BlockNumber num_all_visible_pages,
+ BlockNumber num_all_frozen_pages,
bool hasindex,
TransactionId frozenxid,
MultiXactId minmulti,
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index a2f78ee..7bf2718 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -178,8 +178,10 @@ typedef PageHeaderData *PageHeader;
* tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
+#define PD_ALL_FROZEN 0x0008 /* all tuples on page are completely
+ frozen */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x000F /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -369,6 +371,13 @@ typedef PageHeaderData *PageHeader;
#define PageClearAllVisible(page) \
(((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
+#define PageIsAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags & PD_ALL_FROZEN)
+#define PageSetAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags |= PD_ALL_FROZEN)
+#define PageClearAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags &= ~PD_ALL_FROZEN)
+
#define PageIsPrunable(page, oldestxmin) \
( \
AssertMacro(TransactionIdIsNormal(oldestxmin)), \
diff --git a/src/test/regress/expected/visibilitymap.out b/src/test/regress/expected/visibilitymap.out
new file mode 100644
index 0000000..543eeaa
--- /dev/null
+++ b/src/test/regress/expected/visibilitymap.out
@@ -0,0 +1,75 @@
+--
+-- Visibility map
+--
+CREATE FUNCTION
+ pg_visibilitymap(rel regclass, blkno OUT bigint, all_visible OUT bool, all_frozen OUT bool)
+RETURNS SETOF RECORD
+AS $$
+ SELECT blkno, pg_is_all_visible($1, blkno) AS all_visible, pg_is_all_frozen($1, blkno) AS all_frozen
+ FROM generate_series(0, pg_relation_size($1) / current_setting('block_size')::bigint - 1) AS blkno;
+$$
+LANGUAGE SQL;
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+\set VERBOSITY terse
+-- All pages are become all-visible
+VACUUM vmtest;
+SELECT count(all_visible) = (pg_relation_size('vmtest') / current_setting('block_size')::int)
+ FROM pg_visibilitymap('vmtest')
+ WHERE all_visible;
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int)
+ FROM pg_class
+ WHERE relname = 'vmtest';
+ ?column?
+----------
+ t
+(1 row)
+
+VACUUM FREEZE vmtest;
+SELECT count(all_visible) = (pg_relation_size('vmtest') / current_setting('block_size')::int)
+ FROM pg_visibilitymap('vmtest')
+ WHERE all_visible
+ GROUP BY all_visible;
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT count(all_frozen) = (pg_relation_size('vmtest') / current_setting('block_size')::int)
+ FROM pg_visibilitymap('vmtest')
+ WHERE all_frozen
+ GROUP BY all_frozen;
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int)
+ FROM pg_class
+ WHERE relname = 'vmtest';
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT relallfrozen = (pg_relation_size('vmtest') / current_setting('block_size')::int)
+ FROM pg_class
+ WHERE relname = 'vmtest';
+ ?column?
+----------
+ t
+(1 row)
+
+-- All pages are skipped acoording to VM
+VACUUM FREEZE VERBOSE vmtest;
+INFO: vacuuming "public.vmtest"
+INFO: index "vmtest_pkey" now contains 10000 row versions in 30 pages
+INFO: Skipped 45 frozen pages acoording to visibility map
+INFO: "vmtest": found 0 removable, 0 nonremovable row versions in 0 out of 45 pages
+DROP FUNCTION pg_visibilitymap(regclass);
+DROP TABLE vmtest;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 4df15de..893d773 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -108,5 +108,8 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare without_oid c
# event triggers cannot run concurrently with any test that runs DDL
test: event_trigger
+# visibility map and vacuum test cannot run concurrently with any test that runs SQL
+test: visibilitymap
+
# run stats by itself because its delay may be insufficient under heavy load
test: stats
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 3a607cf..76dbff7 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -157,3 +157,4 @@ test: xml
test: event_trigger
test: stats
test: tablesample
+test: visibilitymap
diff --git a/src/test/regress/sql/visibilitymap.sql b/src/test/regress/sql/visibilitymap.sql
new file mode 100644
index 0000000..11b552e
--- /dev/null
+++ b/src/test/regress/sql/visibilitymap.sql
@@ -0,0 +1,49 @@
+--
+-- Visibility map
+--
+
+CREATE FUNCTION
+ pg_visibilitymap(rel regclass, blkno OUT bigint, all_visible OUT bool, all_frozen OUT bool)
+RETURNS SETOF RECORD
+AS $$
+ SELECT blkno, pg_is_all_visible($1, blkno) AS all_visible, pg_is_all_frozen($1, blkno) AS all_frozen
+ FROM generate_series(0, pg_relation_size($1) / current_setting('block_size')::bigint - 1) AS blkno;
+$$
+LANGUAGE SQL;
+
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+
+\set VERBOSITY terse
+
+-- All pages are become all-visible
+VACUUM vmtest;
+SELECT count(all_visible) = (pg_relation_size('vmtest') / current_setting('block_size')::int)
+ FROM pg_visibilitymap('vmtest')
+ WHERE all_visible;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int)
+ FROM pg_class
+ WHERE relname = 'vmtest';
+
+VACUUM FREEZE vmtest;
+SELECT count(all_visible) = (pg_relation_size('vmtest') / current_setting('block_size')::int)
+ FROM pg_visibilitymap('vmtest')
+ WHERE all_visible
+ GROUP BY all_visible;
+SELECT count(all_frozen) = (pg_relation_size('vmtest') / current_setting('block_size')::int)
+ FROM pg_visibilitymap('vmtest')
+ WHERE all_frozen
+ GROUP BY all_frozen;
+
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int)
+ FROM pg_class
+ WHERE relname = 'vmtest';
+SELECT relallfrozen = (pg_relation_size('vmtest') / current_setting('block_size')::int)
+ FROM pg_class
+ WHERE relname = 'vmtest';
+
+-- All pages are skipped acoording to VM
+VACUUM FREEZE VERBOSE vmtest;
+
+DROP FUNCTION pg_visibilitymap(regclass);
+DROP TABLE vmtest;
On Thu, Jul 16, 2015 at 8:51 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
On Wed, Jul 15, 2015 at 3:07 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
On Wed, Jul 15, 2015 at 12:55 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 10 July 2015 at 15:11, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
Oops, I had forgotten to add new file heapfuncs.c.
Latest patch is attached.I think we've established the approach is desirable and defined the way
forwards for this, so this is looking good.If we want to move stuff like pg_stattuple, pg_freespacemap into core,
we could move them into heapfuncs.c.Some of my requests haven't been actioned yet, so I personally would not
commit this yet. I am happy to continue as reviewer/committer unless others
wish to take over.
The main missing item is pg_upgrade support, which won't happen by end of
CF1, so I am marking this as Returned With Feedback. Hopefully we can review
this again before CF2.I appreciate your reviewing.
Yeah, the pg_upgrade support and regression test for VFM patch is
almost done now, I will submit the patch in this week after testing it
.Attached patch is latest v9 patch.
I added:
- regression test for visibility map (visibilitymap.sql and
visibilitymap.out files)
- pg_upgrade support (rewriting vm file to vfm file)
- regression test for pg_upgrade
Previous patch has some fail to apply, so attached the rebased patch.
Catalog version is not decided yet, so we will need to rewrite
VISIBILITY_MAP_FROZEN_BIT_CAT_VER in pg_upgrade.h
Please review it.
Regards,
--
Masahiko Sawada
Attachments:
000_add_frozen_bit_into_visibilitymap_v10.patchtext/x-patch; charset=US-ASCII; name=000_add_frozen_bit_into_visibilitymap_v10.patchDownload
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 22c5f7a..b1b6a06 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -87,7 +87,7 @@ statapprox_heap(Relation rel, output_type *stat)
* If the page has only visible tuples, then we can find out the free
* space from the FSM and move on.
*/
- if (visibilitymap_test(rel, blkno, &vmbuffer))
+ if (visibilitymap_test(rel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
freespace = GetRecordedFreeSpace(rel, blkno);
stat->tuple_len += BLCKSZ - freespace;
diff --git a/src/backend/access/heap/Makefile b/src/backend/access/heap/Makefile
index b83d496..806ce27 100644
--- a/src/backend/access/heap/Makefile
+++ b/src/backend/access/heap/Makefile
@@ -12,6 +12,7 @@ subdir = src/backend/access/heap
top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
-OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o visibilitymap.o
+OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o visibilitymap.o \
+ heapfuncs.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 050efdc..2dbabc8 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2176,8 +2176,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
/*
- * Find buffer to insert this tuple into. If the page is all visible,
- * this will also pin the requisite visibility map page.
+ * Find buffer to insert this tuple into. If the page is all visible
+ * or all frozen, this will also pin the requisite visibility map and
+ * frozen map page.
*/
buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
InvalidBuffer, options, bistate,
@@ -2192,7 +2193,11 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
if (PageIsAllVisible(BufferGetPage(buffer)))
{
all_visible_cleared = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(BufferGetPage(buffer));
+ PageClearAllFrozen(BufferGetPage(buffer));
+
visibilitymap_clear(relation,
ItemPointerGetBlockNumber(&(heaptup->t_self)),
vmbuffer);
@@ -2493,7 +2498,11 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
if (PageIsAllVisible(page))
{
all_visible_cleared = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+
visibilitymap_clear(relation,
BufferGetBlockNumber(buffer),
vmbuffer);
@@ -2776,9 +2785,9 @@ heap_delete(Relation relation, ItemPointer tid,
/*
* If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, we'll have to unlock and
- * re-lock, to avoid holding the buffer lock across an I/O. That's a bit
- * unfortunate, but hopefully shouldn't happen often.
+ * visible or all frozen while we were busy locking the buffer, we'll
+ * have to unlock and re-lock, to avoid holding the buffer lock across an
+ * I/O. That's a bit unfortunate, but hopefully shouldn't happen often.
*/
if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
{
@@ -2970,10 +2979,15 @@ l1:
*/
PageSetPrunable(page, xid);
+ /* clear PD_ALL_VISIBLE and PD_ALL_FORZEN flags */
if (PageIsAllVisible(page))
{
all_visible_cleared = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+
visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
vmbuffer);
}
@@ -3252,7 +3266,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
* in the middle of changing this, so we'll need to recheck after we have
* the lock.
*/
- if (PageIsAllVisible(page))
+ if (PageIsAllVisible(page) || PageIsAllFrozen(page))
visibilitymap_pin(relation, block, &vmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
@@ -3846,14 +3860,22 @@ l2:
if (PageIsAllVisible(BufferGetPage(buffer)))
{
all_visible_cleared = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(BufferGetPage(buffer));
+ PageClearAllFrozen(BufferGetPage(buffer));
+
visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
vmbuffer);
}
if (newbuf != buffer && PageIsAllVisible(BufferGetPage(newbuf)))
{
all_visible_cleared_new = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(BufferGetPage(newbuf));
+ PageClearAllFrozen(BufferGetPage(newbuf));
+
visibilitymap_clear(relation, BufferGetBlockNumber(newbuf),
vmbuffer_new);
}
@@ -6938,7 +6960,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
*/
XLogRecPtr
log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
- TransactionId cutoff_xid)
+ TransactionId cutoff_xid, uint8 vmflags)
{
xl_heap_visible xlrec;
XLogRecPtr recptr;
@@ -6948,6 +6970,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
Assert(BufferIsValid(vm_buffer));
xlrec.cutoff_xid = cutoff_xid;
+ xlrec.flags = vmflags;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
@@ -7537,8 +7560,14 @@ heap_xlog_visible(XLogReaderState *record)
* the subsequent update won't be replayed to clear the flag.
*/
page = BufferGetPage(buffer);
- PageSetAllVisible(page);
+
+ if (xlrec->flags & VISIBILITYMAP_ALL_VISIBLE)
+ PageSetAllVisible(page);
+ if (xlrec->flags & VISIBILITYMAP_ALL_FROZEN)
+ PageSetAllFrozen(page);
+
MarkBufferDirty(buffer);
+
}
else if (action == BLK_RESTORED)
{
@@ -7589,7 +7618,7 @@ heap_xlog_visible(XLogReaderState *record)
*/
if (lsn > PageGetLSN(vmpage))
visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
- xlrec->cutoff_xid);
+ xlrec->cutoff_xid, xlrec->flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
@@ -7739,7 +7768,10 @@ heap_xlog_delete(XLogReaderState *record)
PageSetPrunable(page, XLogRecGetXid(record));
if (xlrec->flags & XLH_DELETE_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
/* Make sure there is no forward chain link in t_ctid */
htup->t_ctid = target_tid;
@@ -7843,7 +7875,10 @@ heap_xlog_insert(XLogReaderState *record)
PageSetLSN(page, lsn);
if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
MarkBufferDirty(buffer);
}
@@ -7982,7 +8017,10 @@ heap_xlog_multi_insert(XLogReaderState *record)
PageSetLSN(page, lsn);
if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
MarkBufferDirty(buffer);
}
@@ -8110,7 +8148,10 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
PageSetPrunable(page, XLogRecGetXid(record));
if (xlrec->flags & XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
PageSetLSN(page, lsn);
MarkBufferDirty(obuffer);
@@ -8245,7 +8286,10 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
elog(PANIC, "heap_update_redo: failed to add tuple");
if (xlrec->flags & XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 7c38772..a284b85 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -21,33 +21,45 @@
*
* NOTES
*
- * The visibility map is a bitmap with one bit per heap page. A set bit means
- * that all tuples on the page are known visible to all transactions, and
- * therefore the page doesn't need to be vacuumed. The map is conservative in
- * the sense that we make sure that whenever a bit is set, we know the
- * condition is true, but if a bit is not set, it might or might not be true.
+ * The visibility map is a bitmap with two bits (all-visible and all-frozen)
+ * per heap page. A set all-visible bit means that all tuples on the page are
+ * known visible to all transactions, and therefore the page doesn't need to
+ * be vacuumed. A set all-frozen bit means that all tuples on the page are
+ * completely frozen, and therefore the page doesn't need to be vacuumed even
+ * if whole table scanning vacuum is required (e.g. anti-wraparound vacuum).
+ * A all-frozen bit must be set only when the page is already all-visible.
+ * That is, all-frozen bit is always set with all-visible bit.
+ *
+ * The map is conservative in the sense that we make sure that whenever a bit
+ * is set, we know the condition is true, but if a bit is not set, it might or
+ * might not be true.
*
* Clearing a visibility map bit is not separately WAL-logged. The callers
* must make sure that whenever a bit is cleared, the bit is cleared on WAL
- * replay of the updating operation as well.
+ * replay of the updating operation as well. And all-frozen bit must be
+ * cleared with all-visible at the same time.
*
* When we *set* a visibility map during VACUUM, we must write WAL. This may
* seem counterintuitive, since the bit is basically a hint: if it is clear,
- * it may still be the case that every tuple on the page is visible to all
- * transactions; we just don't know that for certain. The difficulty is that
- * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
- * on the page itself, and the visibility map bit. If a crash occurs after the
- * visibility map page makes it to disk and before the updated heap page makes
- * it to disk, redo must set the bit on the heap page. Otherwise, the next
- * insert, update, or delete on the heap page will fail to realize that the
- * visibility map bit must be cleared, possibly causing index-only scans to
- * return wrong answers.
+ * it may still be the case that every tuple on the page is visible or frozen
+ * to all transactions; we just don't know that for certain. The difficulty is
+ * that there are two bits which are typically set together: the PD_ALL_VISIBLE
+ * or PD_ALL_FROZEN bit on the page itself, and the visibility map bit. If a
+ * crash occurs after the visibility map page makes it to disk and before the
+ * updated heap page makes it to disk, redo must set the bit on the heap page.
+ * Otherwise, the next insert, update, or delete on the heap page will fail to
+ * realize that the visibility map bit must be cleared, possibly causing index-only
+ * scans to return wrong answers.
*
* VACUUM will normally skip pages for which the visibility map bit is set;
* such pages can't contain any dead tuples and therefore don't need vacuuming.
- * The visibility map is not used for anti-wraparound vacuums, because
+ * The visibility map is not used for anti-wraparound vacuums before 9.5, because
* an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
* present in the table, even on pages that don't have any dead tuples.
+ * 9.6 or later, the visibility map has a additional bit which indicates all tuple
+ * on single page has been completely forzen, so the visibility map is also used for
+ * anti-wraparound vacuums.
+ *
*
* LOCKING
*
@@ -58,14 +70,14 @@
* section that logs the page modification. However, we don't want to hold
* the buffer lock over any I/O that may be required to read in the visibility
* map page. To avoid this, we examine the heap page before locking it;
- * if the page-level PD_ALL_VISIBLE bit is set, we pin the visibility map
- * bit. Then, we lock the buffer. But this creates a race condition: there
- * is a possibility that in the time it takes to lock the buffer, the
- * PD_ALL_VISIBLE bit gets set. If that happens, we have to unlock the
- * buffer, pin the visibility map page, and relock the buffer. This shouldn't
- * happen often, because only VACUUM currently sets visibility map bits,
- * and the race will only occur if VACUUM processes a given page at almost
- * exactly the same time that someone tries to further modify it.
+ * if the page-level PD_ALL_VISIBLE or PD_ALL_FROZEN bit is set, we pin the
+ * visibility map bit. Then, we lock the buffer. But this creates a race
+ * condition: there is a possibility that in the time it takes to lock the
+ * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that happens,
+ * we have to unlock the buffer, pin the visibility map page, and relock the
+ * buffer. This shouldn't happen often, because only VACUUM currently sets
+ * visibility map bits, and the race will only occur if VACUUM processes a given
+ * page at almost exactly the same time that someone tries to further modify it.
*
* To set a bit, you need to hold a lock on the heap page. That prevents
* the race condition where VACUUM sees that all tuples on the page are
@@ -101,11 +113,14 @@
*/
#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
-/* Number of bits allocated for each heap block. */
-#define BITS_PER_HEAPBLOCK 1
+/*
+ * Number of bits allocated for each heap block.
+ * One for all-visible, other for all-frozen.
+*/
+#define BITS_PER_HEAPBLOCK 2
/* Number of heap blocks we can represent in one byte. */
-#define HEAPBLOCKS_PER_BYTE 8
+#define HEAPBLOCKS_PER_BYTE 4
/* Number of heap blocks we can represent in one visibility map page. */
#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
@@ -115,24 +130,42 @@
#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
#define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
-/* table for fast counting of set bits */
-static const uint8 number_of_ones[256] = {
- 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
+/* tables for fast counting of set bits for visible and freeze */
+static const uint8 number_of_ones_for_visible[256] = {
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4
+};
+static const uint8 number_of_ones_for_frozen[256] = {
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4
};
/* prototypes for internal routines */
@@ -141,7 +174,7 @@ static void vm_extend(Relation rel, BlockNumber nvmblocks);
/*
- * visibilitymap_clear - clear a bit in visibility map
+ * visibilitymap_clear - clear all bits in visibility map
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -153,7 +186,8 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
int mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- uint8 mask = 1 << mapBit;
+ uint8 mask = (VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN) <<
+ (BITS_PER_HEAPBLOCK * mapBit);
char *map;
#ifdef TRACE_VISIBILITYMAP
@@ -225,7 +259,7 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
}
/*
- * visibilitymap_set - set a bit on a previously pinned page
+ * visibilitymap_set - set bit(s) on a previously pinned page
*
* recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
* or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
@@ -234,10 +268,11 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
* marked all-visible; it is needed for Hot Standby, and can be
* InvalidTransactionId if the page contains no tuples.
*
- * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
- * this function. Except in recovery, caller should also pass the heap
- * buffer. When checksums are enabled and we're not in recovery, we must add
- * the heap buffer to the WAL chain to protect it from being torn.
+ * Caller is expected to set the heap page's PD_ALL_VISIBLE or PD_ALL_FROZEN
+ * bit before calling this function. Except in recovery, caller should also
+ * pass the heap buffer and flags which indicates what flag we want to set.
+ * When checksums are enabled and we're not in recovery, we must add the heap
+ * buffer to the WAL chain to protect it from being torn.
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -245,7 +280,8 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
*/
void
visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid)
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -254,7 +290,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_set %s %d %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
@@ -272,11 +308,11 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map = PageGetContents(page);
LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
- if (!(map[mapByte] & (1 << mapBit)))
+ if (flags != (map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))))
{
START_CRIT_SECTION();
- map[mapByte] |= (1 << mapBit);
+ map[mapByte] |= (flags << (BITS_PER_HEAPBLOCK * mapBit));
MarkBufferDirty(vmBuf);
if (RelationNeedsWAL(rel))
@@ -285,7 +321,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Assert(!InRecovery);
recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
- cutoff_xid);
+ cutoff_xid, flags);
/*
* If data checksums are enabled (or wal_log_hints=on), we
@@ -295,11 +331,15 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Page heapPage = BufferGetPage(heapBuf);
- /* caller is expected to set PD_ALL_VISIBLE first */
- Assert(PageIsAllVisible(heapPage));
+ /*
+ * caller is expected to set PD_ALL_VISIBLE or
+ * PD_ALL_FROZEN first.
+ */
+ Assert(PageIsAllVisible(heapPage) || PageIsAllFrozen(heapPage));
PageSetLSN(heapPage, recptr);
}
}
+
PageSetLSN(page, recptr);
}
@@ -310,15 +350,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
}
/*
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_test - test if bit(s) is set
*
- * Are all tuples on heapBlk visible to all, according to the visibility map?
+ * Are all tuples on heapBlk visible or frozen to all, according to the visibility map?
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by an
* earlier call to visibilitymap_pin or visibilitymap_test on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk, or InvalidBuffer. The caller is responsible for
- * releasing *buf after it's done testing and setting bits.
+ * releasing *buf after it's done testing and setting bits, and must set flags
+ * which indicates what flag we want to test.
*
* NOTE: This function is typically called without a lock on the heap page,
* so somebody else could change the bit just after we look at it. In fact,
@@ -328,7 +369,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
* all concurrency issues!
*/
bool
-visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
+visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf, uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -337,7 +378,7 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_test %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_test %s %d %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
/* Reuse the old pinned buffer if possible */
@@ -360,11 +401,12 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
map = PageGetContents(BufferGetPage(*buf));
/*
- * A single-bit read is atomic. There could be memory-ordering effects
+ * A single or double bit read is atomic. There could be memory-ordering effects
* here, but for performance reasons we make it the caller's job to worry
* about that.
*/
- result = (map[mapByte] & (1 << mapBit)) ? true : false;
+ result = (map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))) ?
+ true : false;
return result;
}
@@ -374,10 +416,11 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
*
* Note: we ignore the possibility of race conditions when the table is being
* extended concurrently with the call. New pages added to the table aren't
- * going to be marked all-visible, so they won't affect the result.
+ * going to be marked all-visible or all-frozen, so they won't affect the result.
+ * The caller must set the flags which indicates what flag we want to count.
*/
BlockNumber
-visibilitymap_count(Relation rel)
+visibilitymap_count(Relation rel, uint8 flags)
{
BlockNumber result = 0;
BlockNumber mapBlock;
@@ -406,7 +449,10 @@ visibilitymap_count(Relation rel)
for (i = 0; i < MAPSIZE; i++)
{
- result += number_of_ones[map[i]];
+ if (flags & VISIBILITYMAP_ALL_VISIBLE)
+ result += number_of_ones_for_visible[map[i]];
+ if (flags & VISIBILITYMAP_ALL_FROZEN)
+ result += number_of_ones_for_frozen[map[i]];
}
ReleaseBuffer(mapBuffer);
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 69f35c9..87bf0c8 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1919,11 +1919,18 @@ index_update_stats(Relation rel,
{
BlockNumber relpages = RelationGetNumberOfBlocks(rel);
BlockNumber relallvisible;
+ BlockNumber relallfrozen;
if (rd_rel->relkind != RELKIND_INDEX)
- relallvisible = visibilitymap_count(rel);
+ {
+ relallvisible = visibilitymap_count(rel, VISIBILITYMAP_ALL_VISIBLE);
+ relallfrozen = visibilitymap_count(rel, VISIBILITYMAP_ALL_FROZEN);
+ }
else /* don't bother for indexes */
+ {
relallvisible = 0;
+ relallfrozen = 0;
+ }
if (rd_rel->relpages != (int32) relpages)
{
@@ -1940,6 +1947,11 @@ index_update_stats(Relation rel,
rd_rel->relallvisible = (int32) relallvisible;
dirty = true;
}
+ if (rd_rel->relallfrozen != (int32) relallfrozen)
+ {
+ rd_rel->relallfrozen = (int32) relallfrozen;
+ dirty = true;
+ }
}
/*
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 861048f..392c2a4 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -572,7 +572,8 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
vac_update_relstats(onerel,
relpages,
totalrows,
- visibilitymap_count(onerel),
+ visibilitymap_count(onerel, VISIBILITYMAP_ALL_VISIBLE),
+ visibilitymap_count(onerel, VISIBILITYMAP_ALL_FROZEN),
hasindex,
InvalidTransactionId,
InvalidMultiXactId,
@@ -595,6 +596,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
RelationGetNumberOfBlocks(Irel[ind]),
totalindexrows,
0,
+ 0,
false,
InvalidTransactionId,
InvalidMultiXactId,
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 7ab4874..d3725dd 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -22,6 +22,7 @@
#include "access/rewriteheap.h"
#include "access/transam.h"
#include "access/tuptoaster.h"
+#include "access/visibilitymap.h"
#include "access/xact.h"
#include "access/xlog.h"
#include "catalog/catalog.h"
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index baf66f1..d68c7c4 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -744,6 +744,7 @@ void
vac_update_relstats(Relation relation,
BlockNumber num_pages, double num_tuples,
BlockNumber num_all_visible_pages,
+ BlockNumber num_all_frozen_pages,
bool hasindex, TransactionId frozenxid,
MultiXactId minmulti,
bool in_outer_xact)
@@ -781,6 +782,11 @@ vac_update_relstats(Relation relation,
pgcform->relallvisible = (int32) num_all_visible_pages;
dirty = true;
}
+ if (pgcform->relallfrozen != (int32) num_all_frozen_pages)
+ {
+ pgcform->relallfrozen = (int32) num_all_frozen_pages;
+ dirty = true;
+ }
/* Apply DDL updates, but not inside an outer transaction (see above) */
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index a01cfb4..120de63 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -106,6 +106,8 @@ typedef struct LVRelStats
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* number of pages we examined */
BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber vmskipped_frozen_pages; /* # of pages we skipped by all-frozen bit
+ of visibility map */
double scanned_tuples; /* counts only tuples on scanned pages */
double old_rel_tuples; /* previous value of pg_class.reltuples */
double new_rel_tuples; /* new estimated total # of tuples */
@@ -156,7 +158,7 @@ static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
static int vac_cmp_itemptr(const void *left, const void *right);
static bool heap_page_is_all_visible(Relation rel, Buffer buf,
- TransactionId *visibility_cutoff_xid);
+ TransactionId *visibility_cutoff_xid, bool *all_frozen);
/*
@@ -188,7 +190,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
MultiXactId mxactFullScanLimit;
BlockNumber new_rel_pages;
double new_rel_tuples;
- BlockNumber new_rel_allvisible;
+ BlockNumber new_rel_allvisible,
+ new_rel_allfrozen;
double new_live_tuples;
TransactionId new_frozen_xid;
MultiXactId new_min_multi;
@@ -222,6 +225,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* than or equal to the requested Xid full-table scan limit; or if the
* table's minimum MultiXactId is older than or equal to the requested
* mxid full-table scan limit.
+ * Even if scan_all is set so far, we could skip to scan some pages
+ * according by frozen map.
*/
scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
xidFullScanLimit);
@@ -253,7 +258,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
- if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages + vacrelstats->vmskipped_frozen_pages)
+ < vacrelstats->rel_pages)
{
Assert(!scan_all);
scanned_all = false;
@@ -301,10 +307,14 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_tuples = vacrelstats->old_rel_tuples;
}
- new_rel_allvisible = visibilitymap_count(onerel);
+ new_rel_allvisible = visibilitymap_count(onerel, VISIBILITYMAP_ALL_VISIBLE);
if (new_rel_allvisible > new_rel_pages)
new_rel_allvisible = new_rel_pages;
+ new_rel_allfrozen = visibilitymap_count(onerel, VISIBILITYMAP_ALL_FROZEN);
+ if (new_rel_allfrozen > new_rel_pages)
+ new_rel_allfrozen = new_rel_pages;
+
new_frozen_xid = scanned_all ? FreezeLimit : InvalidTransactionId;
new_min_multi = scanned_all ? MultiXactCutoff : InvalidMultiXactId;
@@ -312,6 +322,7 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_pages,
new_rel_tuples,
new_rel_allvisible,
+ new_rel_allfrozen,
vacrelstats->hasindex,
new_frozen_xid,
new_min_multi,
@@ -360,10 +371,11 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel),
vacrelstats->num_index_scans);
- appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins\n"),
+ appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped according to vm\n"),
vacrelstats->pages_removed,
vacrelstats->rel_pages,
- vacrelstats->pinskipped_pages);
+ vacrelstats->pinskipped_pages,
+ vacrelstats->vmskipped_frozen_pages);
appendStringInfo(&buf,
_("tuples: %.0f removed, %.0f remain, %.0f are dead but not yet removable\n"),
vacrelstats->tuples_deleted,
@@ -486,9 +498,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* consecutive pages. Since we're reading sequentially, the OS should be
* doing readahead for us, so there's no gain in skipping a page now and
* then; that's likely to disable readahead and so be counterproductive.
- * Also, skipping even a single page means that we can't update
- * relfrozenxid, so we only want to do it if we can skip a goodly number
- * of pages.
+ * Also, skipping even a single page accorinding to all-visible bit of
+ * visibility map means that we can't update relfrozenxid, so we only want
+ * to do it if we can skip a goodly number. On the other hand, we count
+ * both how many pages we skipped according to all-frozen bit of visibility
+ * map and how many pages we freeze page, so we can update relfrozenxid if
+ * the sum of them is as many as pages of table.
*
* Before entering the main loop, establish the invariant that
* next_not_all_visible_block is the next block number >= blkno that's not
@@ -515,7 +530,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer))
+ if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer,
+ VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN))
break;
vacuum_delay_point();
}
@@ -533,7 +549,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
bool tupgone,
hastup;
int prev_dead_count;
- int nfrozen;
+ int nfrozen; /* # of tuples is frozen */
+ int nalready_frozen; /* # of tuples is already frozen */
+ int ntotal_frozen; /* # of tuples is in single page */
+ int ntup_per_page;
Size freespace;
bool all_visible_according_to_vm;
bool all_visible;
@@ -548,7 +567,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block++)
{
if (!visibilitymap_test(onerel, next_not_all_visible_block,
- &vmbuffer))
+ &vmbuffer,
+ VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN))
break;
vacuum_delay_point();
}
@@ -566,9 +586,25 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
}
else
{
- /* Current block is all-visible */
- if (skipping_all_visible_blocks && !scan_all)
- continue;
+ /*
+ * This block is at least all-visible according to visibility map.
+ * We check whehter this block is all-frozen to skip to vacuum this
+ * page even if scanning whole page is required.
+ */
+ if (scan_all)
+ {
+ if (visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_FROZEN))
+ {
+ vacrelstats->vmskipped_frozen_pages++;
+ continue;
+ }
+ }
+ else
+ {
+ if (skipping_all_visible_blocks)
+ continue;
+ }
+
all_visible_according_to_vm = true;
}
@@ -740,7 +776,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
PageSetAllVisible(page);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, InvalidTransactionId);
+ vmbuffer, InvalidTransactionId,
+ VISIBILITYMAP_ALL_VISIBLE);
END_CRIT_SECTION();
}
@@ -764,6 +801,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
all_visible = true;
has_dead_tuples = false;
nfrozen = 0;
+ nalready_frozen = 0;
+ ntup_per_page = 0;
hastup = false;
prev_dead_count = vacrelstats->num_dead_tuples;
maxoff = PageGetMaxOffsetNumber(page);
@@ -918,8 +957,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
{
num_tuples += 1;
+ ntup_per_page += 1;;
hastup = true;
+ /* Check whether this tuple is alrady frozen or not */
+ if (HeapTupleHeaderXminFrozen(tuple.t_data))
+ nalready_frozen += 1;
+
/*
* Each non-removable tuple must be checked to see if it needs
* freezing. Note we already have exclusive buffer lock.
@@ -931,9 +975,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
} /* scan along page */
/*
- * If we froze any tuples, mark the buffer dirty, and write a WAL
- * record recording the changes. We must log the changes to be
- * crash-safe against future truncation of CLOG.
+ * If we froze any tuples or any tuples are already frozen,
+ * mark the buffer dirty, and write a WAL record recording the changes.
+ * We must log the changes to be crash-safe against future truncation
+ * of CLOG.
*/
if (nfrozen > 0)
{
@@ -966,6 +1011,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
END_CRIT_SECTION();
}
+ /* Compute the number of frozen tuples in a page */
+ ntotal_frozen = nfrozen + nalready_frozen;
+
/*
* If there are no indexes then we can vacuum the page right now
* instead of doing a second scan.
@@ -988,26 +1036,47 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
freespace = PageGetHeapFreeSpace(page);
- /* mark page all-visible, if appropriate */
- if (all_visible && !all_visible_according_to_vm)
+ /* This page is all visible */
+ if (all_visible)
{
- /*
- * It should never be the case that the visibility map page is set
- * while the page-level bit is clear, but the reverse is allowed
- * (if checksums are not enabled). Regardless, set the both bits
- * so that we get back in sync.
- *
- * NB: If the heap page is all-visible but the VM bit is not set,
- * we don't need to dirty the heap page. However, if checksums
- * are enabled, we do need to make sure that the heap page is
- * dirtied before passing it to visibilitymap_set(), because it
- * may be logged. Given that this situation should only happen in
- * rare cases after a crash, it is not worth optimizing.
- */
- PageSetAllVisible(page);
- MarkBufferDirty(buf);
- visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ /* mark page all-visible, if appropriate */
+ if (!all_visible_according_to_vm)
+ {
+ /*
+ * It should never be the case that the visibility map page is set
+ * while the page-level bit is clear, but the reverse is allowed
+ * (if checksums are not enabled). Regardless, set the both bits
+ * so that we get back in sync.
+ *
+ * NB: If the heap page is all-visible but the VM bit is not set,
+ * we don't need to dirty the heap page. However, if checksums
+ * are enabled, we do need to make sure that the heap page is
+ * dirtied before passing it to visibilitymap_set(), because it
+ * may be logged. Given that this situation should only happen in
+ * rare cases after a crash, it is not worth optimizing.
+ */
+ PageSetAllVisible(page);
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+ }
+
+ /* mark page all-frozen, if all tuples are frozen in total */
+ if ((ntotal_frozen == ntup_per_page) &&
+ !visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_FROZEN))
+ {
+ Assert(PageIsAllVisible(page));
+
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ {
+ MarkBufferDirty(buf);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+ vmbuffer, visibility_cutoff_xid, flags);
+ }
}
/*
@@ -1018,7 +1087,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* that something bad has happened.
*/
else if (all_visible_according_to_vm && !PageIsAllVisible(page)
- && visibilitymap_test(onerel, blkno, &vmbuffer))
+ && visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
relname, blkno);
@@ -1047,6 +1116,17 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
visibilitymap_clear(onerel, blkno, vmbuffer);
}
+ /*
+ * As a result of scanning a page, we set VM all-frozen bit and page header
+ * if all tuples of single page are frozen.
+ */
+ if (ntotal_frozen == ntup_per_page)
+ {
+ PageSetAllFrozen(page);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr, vmbuffer,
+ InvalidTransactionId, VISIBILITYMAP_ALL_FROZEN);
+ }
+
UnlockReleaseBuffer(buf);
/* Remember the location of the last page with nonremovable tuples */
@@ -1078,7 +1158,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
num_tuples);
/*
- * Release any remaining pin on visibility map page.
+ * Release any remaining pin on visibility map and frozen map page.
*/
if (BufferIsValid(vmbuffer))
{
@@ -1115,6 +1195,14 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
tups_vacuumed, vacuumed_pages)));
/*
+ * This information would be effective for how much effect all-frozen bit
+ * of VM had for freezing tuples.
+ */
+ ereport(elevel,
+ (errmsg("Skipped %d frozen pages acoording to visibility map",
+ vacrelstats->vmskipped_frozen_pages)));
+
+ /*
* This is pretty messy, but we split it up so that we can skip emitting
* individual parts of the message when not applicable.
*/
@@ -1226,6 +1314,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
OffsetNumber unused[MaxOffsetNumber];
int uncnt = 0;
TransactionId visibility_cutoff_xid;
+ bool all_frozen;
START_CRIT_SECTION();
@@ -1277,19 +1366,31 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
* dirty, exclusively locked, and, if needed, a full page image has been
* emitted in the log_heap_clean() above.
*/
- if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid))
+ if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid, &all_frozen))
PageSetAllVisible(page);
/*
* All the changes to the heap page have been done. If the all-visible
- * flag is now set, also set the VM bit.
+ * flag is now set, also set the VM all-visible bit.
+ * Also, if this page is all-frozen, set VM all-frozen bit and flag.
*/
- if (PageIsAllVisible(page) &&
- !visibilitymap_test(onerel, blkno, vmbuffer))
+ if (PageIsAllVisible(page))
{
- Assert(BufferIsValid(*vmbuffer));
- visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
- visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ if (!visibilitymap_test(onerel, blkno, vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+
+ /* mark page all-frozen, and set VM all-frozen bit */
+ if (all_frozen)
+ {
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
+ visibility_cutoff_xid, flags);
}
return tupindex;
@@ -1408,6 +1509,7 @@ lazy_cleanup_index(Relation indrel,
stats->num_pages,
stats->num_index_tuples,
0,
+ 0,
false,
InvalidTransactionId,
InvalidMultiXactId,
@@ -1782,7 +1884,8 @@ vac_cmp_itemptr(const void *left, const void *right)
* xmin amongst the visible tuples.
*/
static bool
-heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid)
+heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid,
+ bool *all_frozen)
{
Page page = BufferGetPage(buf);
BlockNumber blockno = BufferGetBlockNumber(buf);
@@ -1791,6 +1894,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
bool all_visible = true;
*visibility_cutoff_xid = InvalidTransactionId;
+ *all_frozen = true;
/*
* This is a stripped down version of the line pointer scan in
@@ -1814,7 +1918,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/*
* Dead line pointers can have index pointers pointing to them. So
- * they can't be treated as visible
+ * they can't be treated as visible and frozen.
*/
if (ItemIdIsDead(itemid))
{
@@ -1855,6 +1959,10 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
+
+ /* Check whether this tuple is alrady frozen or not */
+ if (!HeapTupleHeaderXminFrozen(tuple.t_data))
+ *all_frozen = false;
}
break;
@@ -1863,6 +1971,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
case HEAPTUPLE_INSERT_IN_PROGRESS:
case HEAPTUPLE_DELETE_IN_PROGRESS:
all_visible = false;
+ *all_frozen = false;
break;
default:
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 9f54c46..08df289 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -116,7 +116,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
*/
if (!visibilitymap_test(scandesc->heapRelation,
ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
+ &node->ioss_VMBuffer, VISIBILITYMAP_ALL_VISIBLE))
{
/*
* Rats, we have to visit the heap to check visibility.
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 1ef76d0..ee49ddf 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -127,7 +127,7 @@ ExecCheckPlanOutput(Relation resultRel, List *targetList)
if (attno != resultDesc->natts)
ereport(ERROR,
(errcode(ERRCODE_DATATYPE_MISMATCH),
- errmsg("table row type and query-specified row type do not match"),
+ errmsg("table row type and query-specified row type do not match"),
errdetail("Query has too few columns.")));
}
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 79d9390..8fededc 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -10,6 +10,7 @@
#include "postgres_fe.h"
#include "pg_upgrade.h"
+#include "storage/bufpage.h"
#include <fcntl.h>
@@ -21,6 +22,27 @@ static int copy_file(const char *fromfile, const char *tofile, bool force);
static int win32_pghardlink(const char *src, const char *dst);
#endif
+static int rewrite_vm_to_vfm(const char *fromfile, const char *tofile, bool force);
+
+/* table for fast rewriting vm file to vfm file */
+static const uint16 rewrite_vm_to_vfm_table[256] = {
+ 0, 1, 4, 5, 16, 17, 20, 21, 64, 65, 68, 69, 80, 81, 84, 85,
+ 256, 257, 260, 261, 272, 273, 276, 277, 320, 321, 324, 325, 336, 337, 340, 341,
+ 1024, 1025, 1028, 1029, 1040, 1041, 1044, 1045, 1088, 1089, 1092, 1093, 1104, 1105, 1108, 1109,
+ 1280, 1281, 1284, 1285, 1296, 1297, 1300, 1301, 1344, 1345, 1348, 1349, 1360, 1361, 1364, 1365,
+ 4096, 4097, 4100, 4101, 4112, 4113, 4116, 4117, 4160, 4161, 4164, 4165, 4176, 4177, 4180, 4181,
+ 4352, 4353, 4356, 4357, 4368, 4369, 4372, 4373, 4416, 4417, 4420, 4421, 4432, 4433, 4436, 4437,
+ 5120, 5121, 5124, 5125, 5136, 5137, 5140, 5141, 5184, 5185, 5188, 5189, 5200, 5201, 5204, 5205,
+ 5376, 5377, 5380, 5381, 5392, 5393, 5396, 5397, 5440, 5441, 5444, 5445, 5456, 5457, 5460, 5461,
+ 16384, 16385, 16388, 16389, 16400, 16401, 16404, 16405, 16448, 16449, 16452, 16453, 16464, 16465, 16468, 16469,
+ 16640, 16641, 16644, 16645, 16656, 16657, 16660, 16661, 16704, 16705, 16708, 16709, 16720, 16721, 16724, 16725,
+ 17408, 17409, 17412, 17413, 17424, 17425, 17428, 17429, 17472, 17473, 17476, 17477, 17488, 17489, 17492, 17493,
+ 17664, 17665, 17668, 17669, 17680, 17681, 17684, 17685, 17728, 17729, 17732, 17733, 17744, 17745, 17748, 17749,
+ 20480, 20481, 20484, 20485, 20496, 20497, 20500, 20501, 20544, 20545, 20548, 20549, 20560, 20561, 20564, 20565,
+ 20736, 20737, 20740, 20741, 20752, 20753, 20756, 20757, 20800, 20801, 20804, 20805, 20816, 20817, 20820, 20821,
+ 21504, 21505, 21508, 21509, 21520, 21521, 21524, 21525, 21568, 21569, 21572, 21573, 21584, 21585, 21588, 21589,
+ 21760, 21761, 21764, 21765, 21776, 21777, 21780, 21781, 21824, 21825, 21828, 21829, 21840, 21841, 21844, 21845
+};
/*
* copyAndUpdateFile()
@@ -30,11 +52,19 @@ static int win32_pghardlink(const char *src, const char *dst);
*/
const char *
copyAndUpdateFile(pageCnvCtx *pageConverter,
- const char *src, const char *dst, bool force)
+ const char *src, const char *dst, bool force, bool rewrite_vm)
{
+
if (pageConverter == NULL)
{
- if (pg_copy_file(src, dst, force) == -1)
+ int ret;
+
+ if (rewrite_vm)
+ ret = rewrite_vm_to_vfm(src, dst, force);
+ else
+ ret = pg_copy_file(src, dst, force);
+
+ if (ret)
return getErrorText(errno);
else
return NULL;
@@ -99,7 +129,6 @@ copyAndUpdateFile(pageCnvCtx *pageConverter,
}
}
-
/*
* linkAndUpdateFile()
*
@@ -201,6 +230,110 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
#endif
+/*
+ * rewriteVisibiiltyMap()
+ *
+ * A additional bit which indicates that all tuples on page is completely
+ * frozen is added into visibility map at PG 9.6. So the format of visibiilty
+ * map has been changed.
+ * Copies a visibility map file while adding all-frozen bit(0) into each bit.
+ */
+static int
+rewrite_vm_to_vfm(const char *fromfile, const char *tofile, bool force)
+{
+#define REWRITE_BUF_SIZE (50 * BLCKSZ)
+#define BITS_PER_HEAPBLOCK 2
+
+ int src_fd, dst_fd;
+ uint16 vfm_bits;
+ ssize_t nbytes;
+ char *buffer;
+ int ret = 0;
+ int save_errno = 0;
+
+ if ((fromfile == NULL) || (tofile == NULL))
+ {
+ errno = EINVAL;
+ return -1;
+ }
+
+ if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
+ return -1;
+
+ if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+ {
+ save_errno = errno;
+ if (src_fd != 0)
+ close(src_fd);
+
+ errno = save_errno;
+ return -1;
+ }
+
+ buffer = (char *) pg_malloc(REWRITE_BUF_SIZE);
+
+ /* Copy page header data in advance */
+ if ((nbytes = read(src_fd, buffer, MAXALIGN(SizeOfPageHeaderData))) <= 0)
+ {
+ save_errno = errno;
+ return -1;
+ }
+
+ if (write(dst_fd, buffer, nbytes) != nbytes)
+ {
+ /* if write didn't set errno, assume problem is no disk space */
+ if (errno == 0)
+ errno = ENOSPC;
+ save_errno = errno;
+ return -1;
+ }
+
+ /* perform data rewriting i.e read src srouce, write to destination */
+ while (true)
+ {
+ ssize_t nbytes = read(src_fd, buffer, REWRITE_BUF_SIZE);
+ char *cur, *end;
+
+ if (nbytes < 0)
+ {
+ ret = -1;
+ break;
+ }
+
+ if (nbytes == 0)
+ break;
+
+ cur = buffer;
+ end = buffer + nbytes;
+
+ /*
+ * Rewrite a byte and write dest_fd per BITS_PER_HEAPBLOCK bytes.
+ */
+ while (end > cur)
+ {
+ /* Get rewritten bit from table and its string representation */
+ vfm_bits = rewrite_vm_to_vfm_table[(uint8) *cur];
+
+ if (write(dst_fd, &vfm_bits, BITS_PER_HEAPBLOCK) != BITS_PER_HEAPBLOCK)
+ {
+ ret = -1;
+ break;
+ }
+ cur++;
+ }
+ }
+
+ pg_free(buffer);
+
+ if (src_fd != 0)
+ close(src_fd);
+
+ if (dst_fd != 0)
+ close(dst_fd);
+
+ return ret;
+}
+
void
check_hard_link(void)
{
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 13aa891..d957581 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -112,6 +112,11 @@ extern char *output_files[];
#define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031
/*
+ * The format of visibility map changed with this 9.6 commit,
+ *
+ */
+#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201507161
+/*
* pg_multixact format changed in 9.3 commit 0ac5ad5134f2769ccbaefec73844f85,
* ("Improve concurrency of foreign key locking") which also updated catalog
* version to this value. pg_upgrade behavior depends on whether old and new
@@ -397,7 +402,7 @@ typedef void *pageCnvCtx;
#endif
const char *copyAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
- const char *dst, bool force);
+ const char *dst, bool force, bool rewrite_vm);
const char *linkAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
const char *dst);
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index c22df42..766a473 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -18,7 +18,7 @@
static void transfer_single_new_db(pageCnvCtx *pageConverter,
FileNameMap *maps, int size, char *old_tablespace);
static void transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *suffix);
+ const char *type_old_suffix, const char *type_new_suffix);
/*
@@ -171,6 +171,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
{
int mapnum;
bool vm_crashsafe_match = true;
+ bool vm_rewrite_needed = false;
/*
* Do the old and new cluster disagree on the crash-safetiness of the vm
@@ -180,13 +181,20 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
vm_crashsafe_match = false;
+ /*
+ * Do we need to rewrite "vm" to "vfm".
+ */
+ if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ vm_rewrite_needed = true;
+
for (mapnum = 0; mapnum < size; mapnum++)
{
if (old_tablespace == NULL ||
strcmp(maps[mapnum].old_tablespace, old_tablespace) == 0)
{
/* transfer primary file */
- transfer_relfile(pageConverter, &maps[mapnum], "");
+ transfer_relfile(pageConverter, &maps[mapnum], "", "");
/* fsm/vm files added in PG 8.4 */
if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)
@@ -194,9 +202,17 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
/*
* Copy/link any fsm and vm files, if they exist
*/
- transfer_relfile(pageConverter, &maps[mapnum], "_fsm");
+ transfer_relfile(pageConverter, &maps[mapnum], "_fsm", "_fsm");
if (vm_crashsafe_match)
- transfer_relfile(pageConverter, &maps[mapnum], "_vm");
+ {
+ /*
+ * vm file is changed to vfm file in PG 9.6.
+ */
+ if (vm_rewrite_needed)
+ transfer_relfile(pageConverter, &maps[mapnum], "_vm", "_vfm");
+ else
+ transfer_relfile(pageConverter, &maps[mapnum], "_vm", "_vm");
+ }
}
}
}
@@ -210,7 +226,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
*/
static void
transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *type_suffix)
+ const char *type_old_suffix, const char *type_new_suffix)
{
const char *msg;
char old_file[MAXPGPATH];
@@ -218,6 +234,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
int fd;
int segno;
char extent_suffix[65];
+ bool rewrite_vm = false;
/*
* Now copy/link any related segments as well. Remember, PG breaks large
@@ -236,18 +253,18 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
map->old_tablespace_suffix,
map->old_db_oid,
map->old_relfilenode,
- type_suffix,
+ type_old_suffix,
extent_suffix);
snprintf(new_file, sizeof(new_file), "%s%s/%u/%u%s%s",
map->new_tablespace,
map->new_tablespace_suffix,
map->new_db_oid,
map->new_relfilenode,
- type_suffix,
+ type_new_suffix,
extent_suffix);
/* Is it an extent, fsm, or vm file? */
- if (type_suffix[0] != '\0' || segno != 0)
+ if (type_old_suffix[0] != '\0' || segno != 0)
{
/* Did file open fail? */
if ((fd = open(old_file, O_RDONLY, 0)) == -1)
@@ -276,7 +293,11 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
+ /* We need to rewrite vm file to vfm file. */
+ if (strcmp(type_old_suffix, type_new_suffix) != 0)
+ rewrite_vm = true;
+
+ if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true, rewrite_vm)) != NULL)
pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
diff --git a/src/bin/pg_upgrade/test.sh b/src/bin/pg_upgrade/test.sh
index f4e5d9a..53b8b2f 100644
--- a/src/bin/pg_upgrade/test.sh
+++ b/src/bin/pg_upgrade/test.sh
@@ -171,6 +171,11 @@ if "$MAKE" -C "$oldsrc" installcheck; then
mv "$temp_root"/dump1.sql "$temp_root"/dump1.sql.orig
sed "s;$oldsrc;$newsrc;g" "$temp_root"/dump1.sql.orig >"$temp_root"/dump1.sql
fi
+
+ vm_sql="SELECT c.relname, c.relallvisible FROM pg_class as c, pg_namespace as n WHERE c.relnamespace = n.oid AND n.nspname NOT IN ('information_schema', 'pg_toast', 'pg_catalog') ORDER BY c.relname;"
+ # Test for rewriting visibility map
+ vacuumdb -d regression || visibilitymap_vacuum1_status=$?
+ psql -d regression -c "$vm_sql" > "$temp_root"/vm_test1.txt || visibilitymap_test1_status=$?
else
make_installcheck_status=$?
fi
@@ -185,6 +190,14 @@ if [ -n "$pg_dumpall1_status" ]; then
echo "pg_dumpall of pre-upgrade database cluster failed"
exit 1
fi
+if [ -n "$visibilitymap_vacuum1_status" ];then
+ echo "VACUUM of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+if [ -n "$visibilitymap_test1_status" ];then
+ echo "SELECT of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
PGDATA=$BASE_PGDATA
@@ -200,6 +213,8 @@ case $testhost in
esac
pg_dumpall -f "$temp_root"/dump2.sql || pg_dumpall2_status=$?
+vacuumdb -d regression || visibilitymap_vacuum2_status=$?
+psql -d regression -c "$vm_sql" > "$temp_root"/vm_test2.txt || visibilitymap_test2_status=$?
pg_ctl -m fast stop
# no need to echo commands anymore
@@ -211,11 +226,26 @@ if [ -n "$pg_dumpall2_status" ]; then
exit 1
fi
+if [ -n "$visibilitymap_vacuum2_status" ];then
+ echo "VACUUM of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
+if [ -n "$visibilitymap_test2_status" ];then
+ echo "SELECT of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
case $testhost in
MINGW*) cmd /c delete_old_cluster.bat ;;
*) sh ./delete_old_cluster.sh ;;
esac
+if ! diff "$temp_root"/vm_test1.txt "$temp_root"/vm_test2.txt >/dev/null; then
+ echo "Visibility map rewriting test failed"
+ exit 1
+fi
+
if diff "$temp_root"/dump1.sql "$temp_root"/dump2.sql >/dev/null; then
echo PASSED
exit 0
diff --git a/src/common/relpath.c b/src/common/relpath.c
index 66dfef1..5898f1b 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -30,11 +30,14 @@
* If you add a new entry, remember to update the errhint in
* forkname_to_number() below, and update the SGML documentation for
* pg_relation_size().
+ * 9.6 or later, the visibility map fork name is changed from "vm" to
+ * "vfm" bacause visibility map has not only information about all-visible
+ * but also information about all-frozen.
*/
const char *const forkNames[] = {
"main", /* MAIN_FORKNUM */
"fsm", /* FSM_FORKNUM */
- "vm", /* VISIBILITYMAP_FORKNUM */
+ "vfm", /* VISIBILITYMAP_FORKNUM */
"init" /* INIT_FORKNUM */
};
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index caa0f14..93afb10 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -320,9 +320,10 @@ typedef struct xl_heap_freeze_page
typedef struct xl_heap_visible
{
TransactionId cutoff_xid;
+ uint8 flags;
} xl_heap_visible;
-#define SizeOfHeapVisible (offsetof(xl_heap_visible, cutoff_xid) + sizeof(TransactionId))
+#define SizeOfHeapVisible (offsetof(xl_heap_visible, flags) + sizeof(uint8))
typedef struct xl_heap_new_cid
{
@@ -389,6 +390,6 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
xl_heap_freeze_tuple *xlrec_tp);
extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
- Buffer vm_buffer, TransactionId cutoff_xid);
+ Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
#endif /* HEAPAM_XLOG_H */
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 0c0e0ef..7270609 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -19,15 +19,20 @@
#include "storage/buf.h"
#include "utils/relcache.h"
-extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk,
- Buffer vmbuf);
+/* Flags for bit map */
+#define VISIBILITYMAP_ALL_VISIBLE 0x01
+#define VISIBILITYMAP_ALL_FROZEN 0x02
+
+extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
Buffer *vmbuf);
extern bool visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid);
-extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
-extern BlockNumber visibilitymap_count(Relation rel);
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags);
+extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf,
+ uint8 flags);
+extern BlockNumber visibilitymap_count(Relation rel, uint8 flags);
extern void visibilitymap_truncate(Relation rel, BlockNumber nheapblocks);
#endif /* VISIBILITYMAP_H */
diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
index e526cd9..ea0f7c1 100644
--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -47,6 +47,8 @@ CATALOG(pg_class,1259) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83) BKI_SCHEMA_MACRO
float4 reltuples; /* # of tuples (not always up-to-date) */
int32 relallvisible; /* # of all-visible blocks (not always
* up-to-date) */
+ int32 relallfrozen; /* # of all-frozen blocks (not always
+ up-to-date) */
Oid reltoastrelid; /* OID of toast table; 0 if none */
bool relhasindex; /* T if has (or has had) any indexes */
bool relisshared; /* T if shared across databases */
@@ -95,7 +97,7 @@ typedef FormData_pg_class *Form_pg_class;
* ----------------
*/
-#define Natts_pg_class 30
+#define Natts_pg_class 31
#define Anum_pg_class_relname 1
#define Anum_pg_class_relnamespace 2
#define Anum_pg_class_reltype 3
@@ -107,25 +109,26 @@ typedef FormData_pg_class *Form_pg_class;
#define Anum_pg_class_relpages 9
#define Anum_pg_class_reltuples 10
#define Anum_pg_class_relallvisible 11
-#define Anum_pg_class_reltoastrelid 12
-#define Anum_pg_class_relhasindex 13
-#define Anum_pg_class_relisshared 14
-#define Anum_pg_class_relpersistence 15
-#define Anum_pg_class_relkind 16
-#define Anum_pg_class_relnatts 17
-#define Anum_pg_class_relchecks 18
-#define Anum_pg_class_relhasoids 19
-#define Anum_pg_class_relhaspkey 20
-#define Anum_pg_class_relhasrules 21
-#define Anum_pg_class_relhastriggers 22
-#define Anum_pg_class_relhassubclass 23
-#define Anum_pg_class_relrowsecurity 24
-#define Anum_pg_class_relispopulated 25
-#define Anum_pg_class_relreplident 26
-#define Anum_pg_class_relfrozenxid 27
-#define Anum_pg_class_relminmxid 28
-#define Anum_pg_class_relacl 29
-#define Anum_pg_class_reloptions 30
+#define Anum_pg_class_relallfrozen 12
+#define Anum_pg_class_reltoastrelid 13
+#define Anum_pg_class_relhasindex 14
+#define Anum_pg_class_relisshared 15
+#define Anum_pg_class_relpersistence 16
+#define Anum_pg_class_relkind 17
+#define Anum_pg_class_relnatts 18
+#define Anum_pg_class_relchecks 19
+#define Anum_pg_class_relhasoids 20
+#define Anum_pg_class_relhaspkey 21
+#define Anum_pg_class_relhasrules 22
+#define Anum_pg_class_relhastriggers 23
+#define Anum_pg_class_relhassubclass 24
+#define Anum_pg_class_relrowsecurity 25
+#define Anum_pg_class_relispopulated 26
+#define Anum_pg_class_relreplident 27
+#define Anum_pg_class_relfrozenxid 28
+#define Anum_pg_class_relminmxid 29
+#define Anum_pg_class_relacl 30
+#define Anum_pg_class_reloptions 31
/* ----------------
* initial contents of pg_class
@@ -140,13 +143,13 @@ typedef FormData_pg_class *Form_pg_class;
* Note: "3" in the relfrozenxid column stands for FirstNormalTransactionId;
* similarly, "1" in relminmxid stands for FirstMultiXactId
*/
-DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 f f p r 28 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 0 f f p r 28 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 0 f f p r 31 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 09bf143..dbe16f3 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3213,6 +3213,12 @@ DESCR("sleep until the specified time");
DATA(insert OID = 2971 ( text PGNSP PGUID 12 1 0 0 0 f f f f t f i 1 0 25 "16" _null_ _null_ _null_ _null_ _null_ booltext _null_ _null_ _null_ ));
DESCR("convert boolean to text");
+DATA(insert OID = 3298 ( pg_is_all_visible PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 16 "2205 20" _null_ _null_ _null_ _null_ _null_ pg_is_all_visible _null_ _null_ _null_ ));
+DESCR("true if the page is all visible");
+DATA(insert OID = 3299 ( pg_is_all_frozen PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 16 "2205 20" _null_ _null_ _null_ _null_ _null_ pg_is_all_frozen _null_ _null_ _null_ ));
+DESCR("true if the page is all frozen");
+
+
/* Aggregates (moved here from pg_aggregate for 7.3) */
DATA(insert OID = 2100 ( avg PGNSP PGUID 12 1 0 0 0 t f f f f f i 1 0 1700 "20" _null_ _null_ _null_ _null_ _null_ aggregate_dummy _null_ _null_ _null_ ));
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index e3a31af..d2bae2d 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -172,6 +172,7 @@ extern void vac_update_relstats(Relation relation,
BlockNumber num_pages,
double num_tuples,
BlockNumber num_all_visible_pages,
+ BlockNumber num_all_frozen_pages,
bool hasindex,
TransactionId frozenxid,
MultiXactId minmulti,
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index a2f78ee..7bf2718 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -178,8 +178,10 @@ typedef PageHeaderData *PageHeader;
* tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
+#define PD_ALL_FROZEN 0x0008 /* all tuples on page are completely
+ frozen */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x000F /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -369,6 +371,13 @@ typedef PageHeaderData *PageHeader;
#define PageClearAllVisible(page) \
(((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
+#define PageIsAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags & PD_ALL_FROZEN)
+#define PageSetAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags |= PD_ALL_FROZEN)
+#define PageClearAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags &= ~PD_ALL_FROZEN)
+
#define PageIsPrunable(page, oldestxmin) \
( \
AssertMacro(TransactionIdIsNormal(oldestxmin)), \
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 4df15de..893d773 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -108,5 +108,8 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare without_oid c
# event triggers cannot run concurrently with any test that runs DDL
test: event_trigger
+# visibility map and vacuum test cannot run concurrently with any test that runs SQL
+test: visibilitymap
+
# run stats by itself because its delay may be insufficient under heavy load
test: stats
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 15d74d4..da84aa6 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -157,3 +157,4 @@ test: with
test: xml
test: event_trigger
test: stats
+test: visibilitymap
\ No newline at end of file
On Wed, Jul 8, 2015 at 02:31:04PM +0100, Simon Riggs wrote:
On 7 July 2015 at 18:45, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
On Wed, Jul 8, 2015 at 12:37 AM, Andres Freund <andres@anarazel.de> wrote:
On 2015-07-07 16:25:13 +0100, Simon Riggs wrote:
I don't think pg_freespacemap is the right place.
I agree that pg_freespacemap sounds like an odd location.
I'd prefer to add that as a single function into core, so we can write
formal tests.With the advent of src/test/modules it's not really a prerequisite for
things to be builtin to be testable. I think there's fair arguments for
moving stuff like pg_stattuple, pg_freespacemap, pg_buffercache into
core at some point, but that's probably a separate discussion.I understood.
So I will place bunch of test like src/test/module/visibilitymap_test,
which contains� some tests regarding this feature,
and gather them into one patch.Please place it in core. I see value in having a diagnostic function for
general use on production systems.
Sorry to be coming to this discussion late.
I understand the desire for a diagnostic function in core, but we have
to be consistent. Just because we are adding this function now doesn't
mean we should use different rules from what we did previously for
diagnostic functions. Either their is logic to why this function is
different from the other diagnostic functions in contrib, or we need to
have a separate discussion of whether diagnostic functions belong in
contrib or core.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ Everyone has their own god. +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Bruce Momjian wrote:
I understand the desire for a diagnostic function in core, but we have
to be consistent. Just because we are adding this function now doesn't
mean we should use different rules from what we did previously for
diagnostic functions. Either their is logic to why this function is
different from the other diagnostic functions in contrib, or we need to
have a separate discussion of whether diagnostic functions belong in
contrib or core.
Then let's start moving some extensions to src/extension/.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Aug 5, 2015 at 12:36 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
Bruce Momjian wrote:
I understand the desire for a diagnostic function in core, but we have
to be consistent. Just because we are adding this function now doesn't
mean we should use different rules from what we did previously for
diagnostic functions. Either their is logic to why this function is
different from the other diagnostic functions in contrib, or we need to
have a separate discussion of whether diagnostic functions belong in
contrib or core.Then let's start moving some extensions to src/extension/.
That seems like yet another separate issue.
FWIW, it seems to me that we've done a heck of a lot of moving stuff
out of contrib over the last few releases. A bunch of things moved to
src/test/modules and a bunch of things went to src/bin. We can move
more, of course, but this code reorganization has non-trivial costs
and I'm not clear what benefits we hope to realize and whether we are
in fact realizing those benefits. At this point, the overwhelming
majority of what's in contrib is extensions; we're not far from being
able to put the whole thing in src/extensions if it really needs to be
moved at all.
But I don't think it's fair to conflate that with Bruce's question,
which it seems to me is both a fair question and a different one.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas wrote:
On Wed, Aug 5, 2015 at 12:36 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:Bruce Momjian wrote:
I understand the desire for a diagnostic function in core, but we have
to be consistent. Just because we are adding this function now doesn't
mean we should use different rules from what we did previously for
diagnostic functions. Either their is logic to why this function is
different from the other diagnostic functions in contrib, or we need to
have a separate discussion of whether diagnostic functions belong in
contrib or core.Then let's start moving some extensions to src/extension/.
That seems like yet another separate issue.
FWIW, it seems to me that we've done a heck of a lot of moving stuff
out of contrib over the last few releases. A bunch of things moved to
src/test/modules and a bunch of things went to src/bin. We can move
more, of course, but this code reorganization has non-trivial costs
and I'm not clear what benefits we hope to realize and whether we are
in fact realizing those benefits. At this point, the overwhelming
majority of what's in contrib is extensions; we're not far from being
able to put the whole thing in src/extensions if it really needs to be
moved at all.
There are a number of things in contrib that are not extensions, and
others are not core-quality yet. I don't think we should move
everything; at least not everything in one go. I think there are a
small number of diagnostic extensions that would be useful to have in
core (pageinspect, pg_buffercache, pg_stat_statements).
But I don't think it's fair to conflate that with Bruce's question,
which it seems to me is both a fair question and a different one.
Well, there was no question as such. If the question is "should we
instead put it in contrib just to be consistent?" then I think the
answer is no. I value consistency as much as every other person, but I
there are other things I value more, such as availability. If stuff is
in contrib and servers don't have it installed because of package
policies and it takes three management layers' approval to get it
installed in a dying server, then I prefer to have it in core.
If the question was "why are we not using the rule we previously had
that diagnostic tools were in contrib?" then I think the answer is that
we have evolved and we now know better. We have evolved in the sense
that we have more stuff in production now that needs better diagnostic
tooling to be available; and we know better now in the sense that we
have realized there's this company policy bureaucracy that things in
contrib are not always available for reasons that are beyond us.
Anyway, the patch as proposed puts the new functions in core as builtins
(which is what Bruce seems to be objecting to). Maybe instead of
proposing moving existing extensions in core, it would be better to have
this patch put those two new functions alone as a single new extension
in src/extension, and not move anything else. I don't necessarily
resist adding these functions as builtins, but if we do that then
there's no going back to having them as an extension instead, which is
presumably more in line with what we want in the long run.
(It would be a shame to delay this patch, which messes with complex
innards, just because of a discussion about the placement of two
smallish diagnostic functions.)
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 08/05/2015 10:00 AM, Alvaro Herrera wrote:
Anyway, the patch as proposed puts the new functions in core as builtins
(which is what Bruce seems to be objecting to). Maybe instead of
proposing moving existing extensions in core, it would be better to have
this patch put those two new functions alone as a single new extension
in src/extension, and not move anything else. I don't necessarily
resist adding these functions as builtins, but if we do that then
there's no going back to having them as an extension instead, which is
presumably more in line with what we want in the long run.
For my part, I am unclear on why we are putting *any* diagnostic tools
in /contrib today. Either the diagnostic tools are good quality and
necessary for a bunch of users, in which case we ship them in core, or
they are obscure and/or untested, in which case they go in an external
project and/or on PGXN.
Yes, for tools with overhead we might want to require enabling them in
pg.conf. But that's very different from requiring the user to install a
separate package.
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: WM822df433c1251d200c8a9a24c0a0d0a46574406d9bed652639243eb3a56ae2d240f9ebe4c9efb0f894071beba7ad3e89@asav-3.01.com
On Wed, Aug 5, 2015 at 10:22:48AM -0700, Josh Berkus wrote:
On 08/05/2015 10:00 AM, Alvaro Herrera wrote:
Anyway, the patch as proposed puts the new functions in core as builtins
(which is what Bruce seems to be objecting to). Maybe instead of
proposing moving existing extensions in core, it would be better to have
this patch put those two new functions alone as a single new extension
in src/extension, and not move anything else. I don't necessarily
resist adding these functions as builtins, but if we do that then
there's no going back to having them as an extension instead, which is
presumably more in line with what we want in the long run.For my part, I am unclear on why we are putting *any* diagnostic tools
in /contrib today. Either the diagnostic tools are good quality and
necessary for a bunch of users, in which case we ship them in core, or
they are obscure and/or untested, in which case they go in an external
project and/or on PGXN.Yes, for tools with overhead we might want to require enabling them in
pg.conf. But that's very different from requiring the user to install a
separate package.
I don't care what we do, but I do think we should be consistent.
Frankly I am unclear why I am even having to make this point, as cases
where we have chosen expediency over consistency have served us badly in
the past.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ Everyone has their own god. +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 08/05/2015 10:26 AM, Bruce Momjian wrote:
On Wed, Aug 5, 2015 at 10:22:48AM -0700, Josh Berkus wrote:
On 08/05/2015 10:00 AM, Alvaro Herrera wrote:
Anyway, the patch as proposed puts the new functions in core as builtins
(which is what Bruce seems to be objecting to). Maybe instead of
proposing moving existing extensions in core, it would be better to have
this patch put those two new functions alone as a single new extension
in src/extension, and not move anything else. I don't necessarily
resist adding these functions as builtins, but if we do that then
there's no going back to having them as an extension instead, which is
presumably more in line with what we want in the long run.For my part, I am unclear on why we are putting *any* diagnostic tools
in /contrib today. Either the diagnostic tools are good quality and
necessary for a bunch of users, in which case we ship them in core, or
they are obscure and/or untested, in which case they go in an external
project and/or on PGXN.Yes, for tools with overhead we might want to require enabling them in
pg.conf. But that's very different from requiring the user to install a
separate package.I don't care what we do, but I do think we should be consistent.
Frankly I am unclear why I am even having to make this point, as cases
where we have chosen expediency over consistency have served us badly in
the past.
Saying "it's stupid to be consistent with a bad old rule", and making a
new rule is not "expediency".
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: WM060c18748f51524f82df4d099816bc28c31a4e98863f7cb0bacd0e6ea4ff0e4dfc18eeccc6885010e4a8810c46843987@asav-2.01.com
Josh Berkus wrote:
On 08/05/2015 10:26 AM, Bruce Momjian wrote:
I don't care what we do, but I do think we should be consistent.
Frankly I am unclear why I am even having to make this point, as cases
where we have chosen expediency over consistency have served us badly in
the past.Saying "it's stupid to be consistent with a bad old rule", and making a
new rule is not "expediency".
So I discussed this with Bruce on IM a bit. I think there are basically
four ways we could go about this:
1. Add the functions as a builtins.
This is what the current patch does. Simon seems to prefer this,
because he wants the function to be always available in production;
but I don't like this option because adding functions as builtins
makes it impossible to move later to extensions.
Bruce doesn't like this option either.
2. Add the functions to contrib, keep them there for the foreesable future.
Simon is against this option, because the functions will be
unavailable when needed in production. I am of the same position.
Bruce opines this option is acceptable.
3. a) Add the function to some extension in contrib now, by using a
slightly modified version of the current patch, and
b) Apply some later patch to move said extension to src/extension.
4. a) Patch some extension(s) to move it to src/extension,
b) Apply a version of this patch that adds the new functions to said
extension
Essentially 3 and 4 are the same thing except the order is reversed;
they both result in the functions being shipped in some "core extension"
(a concept we do not have today). Bruce says either of these is fine
with him. I am fine with either of them also. As long as we do 3b
during 9.6 timeframe, the outcome of either 3 and 4 seems to be
acceptable for Simon also.
Robert seems to be saying that he doesn't care about moving extensions
to core at all.
What do others think?
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 08/05/2015 10:46 AM, Alvaro Herrera wrote:
1. Add the functions as a builtins.
This is what the current patch does. Simon seems to prefer this,
because he wants the function to be always available in production;
but I don't like this option because adding functions as builtins
makes it impossible to move later to extensions.
Bruce doesn't like this option either.
Why would we want to move them later to extensions? Do you anticipate
not needing them in the future? If we don't need them in the future,
why would they continue to exist at all?
I'm really not getting this.
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: WMda4909ac28b9371f41fb698c4481afd5a76de0db5df2fb5c7a0afd2c926ed91ea36319828609d054f240428745a89054@asav-3.01.com
Josh Berkus wrote:
On 08/05/2015 10:46 AM, Alvaro Herrera wrote:
1. Add the functions as a builtins.
This is what the current patch does. Simon seems to prefer this,
because he wants the function to be always available in production;
but I don't like this option because adding functions as builtins
makes it impossible to move later to extensions.
Bruce doesn't like this option either.Why would we want to move them later to extensions?
Because it's not nice to have random stuff as builtins.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-08-05 20:09, Alvaro Herrera wrote:
Josh Berkus wrote:
On 08/05/2015 10:46 AM, Alvaro Herrera wrote:
1. Add the functions as a builtins.
This is what the current patch does. Simon seems to prefer this,
because he wants the function to be always available in production;
but I don't like this option because adding functions as builtins
makes it impossible to move later to extensions.
Bruce doesn't like this option either.Why would we want to move them later to extensions?
Because it's not nice to have random stuff as builtins.
Extensions have one nice property, they provide namespacing so not
everything has to be in pg_catalog which already has about gazilion
functions. It's nice to have stuff you don't need for day to day
operations separate but still available (which is why src/extensions is
better than contrib).
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Aug 5, 2015 at 10:58:00AM -0700, Josh Berkus wrote:
On 08/05/2015 10:46 AM, Alvaro Herrera wrote:
1. Add the functions as a builtins.
This is what the current patch does. Simon seems to prefer this,
because he wants the function to be always available in production;
but I don't like this option because adding functions as builtins
makes it impossible to move later to extensions.
Bruce doesn't like this option either.Why would we want to move them later to extensions? Do you anticipate
not needing them in the future? If we don't need them in the future,
why would they continue to exist at all?I'm really not getting this.
----------------------------
This is why I suggested putting the new SQL function where it belongs
for consistency and then open a separate thread to discuss the future of
where we want diagnostic functions to be. It is too complicated to talk
about both issues in the same thread.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ Everyone has their own god. +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Bruce Momjian wrote:
This is why I suggested putting the new SQL function where it belongs
for consistency and then open a separate thread to discuss the future of
where we want diagnostic functions to be. It is too complicated to talk
about both issues in the same thread.
Oh come on -- gimme a break. We figure out much more complicated
problems in single threads all the time.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Aug 5, 2015 at 11:57:48PM -0300, Alvaro Herrera wrote:
Bruce Momjian wrote:
This is why I suggested putting the new SQL function where it belongs
for consistency and then open a separate thread to discuss the future of
where we want diagnostic functions to be. It is too complicated to talk
about both issues in the same thread.Oh come on -- gimme a break. We figure out much more complicated
problems in single threads all the time.
Well, people are confused, as stated --- what more can I say?
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ Everyone has their own god. +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 8/5/15 1:47 PM, Petr Jelinek wrote:
On 2015-08-05 20:09, Alvaro Herrera wrote:
Josh Berkus wrote:
On 08/05/2015 10:46 AM, Alvaro Herrera wrote:
1. Add the functions as a builtins.
This is what the current patch does. Simon seems to prefer this,
because he wants the function to be always available in production;
but I don't like this option because adding functions as builtins
makes it impossible to move later to extensions.
Bruce doesn't like this option either.Why would we want to move them later to extensions?
Because it's not nice to have random stuff as builtins.
Extensions have one nice property, they provide namespacing so not
everything has to be in pg_catalog which already has about gazilion
functions. It's nice to have stuff you don't need for day to day
operations separate but still available (which is why src/extensions is
better than contrib).
They also provide a level of control over what is and isn't installed in
a cluster. Personally, I'd prefer that most users not even be aware of
the existence of things like pageinspect.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 5 August 2015 at 18:46, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
What do others think?
Wow, everything moves when you blink, eh? Sorry I was wasn't watching this.
Mainly because I was working on some other related thoughts, separate post
coming.
1. Most importantly, it needs to be somewhere where we can use the function
in a regression test. As I said before, I would not commit this without a
formal proof of correctness.
2. I'd also like to be able to make checks on this while we're in
production, to ensure we have no bugs. I was trying to learn from earlier
mistakes and make sure we are ready with diagnostic tools to allow run-time
checks and confirm everything is good. If people feel that means I've asked
for something in the wrong place, I am happy to skip that request and place
it wherever requested.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Aug 6, 2015 at 11:33 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
They also provide a level of control over what is and isn't installed in a
cluster. Personally, I'd prefer that most users not even be aware of the
existence of things like pageinspect.
+1.
If everybody feels that moving extensions currently stored in contrib
into src/extensions is going to help us somehow, then, uh, OK. I
can't work up any enthusiasm for that, but I can live with it.
However, I think it's affirmatively bad policy to say that we're going
to put all of our debugging facilities into core because otherwise
some people might not have them installed. That's depriving users of
the ability to control their environment, and there are good reasons
for some people to want those things not to be installed. If we
accept the argument "it inconveniences hacker X when Y is not
installed" as a reason to put Y in core, then we can justify putting
anything at all into core. And I don't think that's right at all.
Extensions are a useful packaging mechanism for functionality that is
useful but not required, and debugging facilities are definitely very
useful but should not be required.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Aug 10, 2015 at 12:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Aug 6, 2015 at 11:33 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
They also provide a level of control over what is and isn't installed in a
cluster. Personally, I'd prefer that most users not even be aware of the
existence of things like pageinspect.+1.
[...]
Extensions are a useful packaging mechanism for functionality that is
useful but not required, and debugging facilities are definitely very
useful but should not be required.
+1.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Aug 10, 2015 at 11:05 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
On Mon, Aug 10, 2015 at 12:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Aug 6, 2015 at 11:33 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
They also provide a level of control over what is and isn't installed in a
cluster. Personally, I'd prefer that most users not even be aware of the
existence of things like pageinspect.+1.
[...]
Extensions are a useful packaging mechanism for functionality that is
useful but not required, and debugging facilities are definitely very
useful but should not be required.+1.
Sorry to be come discussion late.
I have encountered the much cases where pg_stat_statement,
pgstattuples are required in production, so I basically agree with
moving such extension into core.
But IMO, the diagnostic tools for visibility map, heap (pageinspect)
and so on, are a kind of debugging tool.
Attached latest v11 patches, which is separated into 2 patches: frozen
bit patch and diagnostic function patch.
Moving diagnostic function into core is still under the discussion,
but this patch puts such function into core because the diagnostic
function for visibility map needs to be in core to execute regression
test at least.
Regards,
--
Masahiko Sawada
Attachments:
000_add_frozen_bit_into_visibilitymap_v11.patchtext/x-patch; charset=US-ASCII; name=000_add_frozen_bit_into_visibilitymap_v11.patchDownload
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 22c5f7a..b1b6a06 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -87,7 +87,7 @@ statapprox_heap(Relation rel, output_type *stat)
* If the page has only visible tuples, then we can find out the free
* space from the FSM and move on.
*/
- if (visibilitymap_test(rel, blkno, &vmbuffer))
+ if (visibilitymap_test(rel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
freespace = GetRecordedFreeSpace(rel, blkno);
stat->tuple_len += BLCKSZ - freespace;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 3701d8e..dabd632 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2176,8 +2176,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
/*
- * Find buffer to insert this tuple into. If the page is all visible,
- * this will also pin the requisite visibility map page.
+ * Find buffer to insert this tuple into. If the page is all visible
+ * or all frozen, this will also pin the requisite visibility map and
+ * frozen map page.
*/
buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
InvalidBuffer, options, bistate,
@@ -2192,7 +2193,11 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
if (PageIsAllVisible(BufferGetPage(buffer)))
{
all_visible_cleared = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(BufferGetPage(buffer));
+ PageClearAllFrozen(BufferGetPage(buffer));
+
visibilitymap_clear(relation,
ItemPointerGetBlockNumber(&(heaptup->t_self)),
vmbuffer);
@@ -2493,7 +2498,11 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
if (PageIsAllVisible(page))
{
all_visible_cleared = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+
visibilitymap_clear(relation,
BufferGetBlockNumber(buffer),
vmbuffer);
@@ -2776,9 +2785,9 @@ heap_delete(Relation relation, ItemPointer tid,
/*
* If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, we'll have to unlock and
- * re-lock, to avoid holding the buffer lock across an I/O. That's a bit
- * unfortunate, but hopefully shouldn't happen often.
+ * visible or all frozen while we were busy locking the buffer, we'll
+ * have to unlock and re-lock, to avoid holding the buffer lock across an
+ * I/O. That's a bit unfortunate, but hopefully shouldn't happen often.
*/
if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
{
@@ -2972,10 +2981,15 @@ l1:
*/
PageSetPrunable(page, xid);
+ /* clear PD_ALL_VISIBLE and PD_ALL_FORZEN flags */
if (PageIsAllVisible(page))
{
all_visible_cleared = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+
visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
vmbuffer);
}
@@ -3254,7 +3268,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
* in the middle of changing this, so we'll need to recheck after we have
* the lock.
*/
- if (PageIsAllVisible(page))
+ if (PageIsAllVisible(page) || PageIsAllFrozen(page))
visibilitymap_pin(relation, block, &vmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
@@ -3850,14 +3864,22 @@ l2:
if (PageIsAllVisible(BufferGetPage(buffer)))
{
all_visible_cleared = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(BufferGetPage(buffer));
+ PageClearAllFrozen(BufferGetPage(buffer));
+
visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
vmbuffer);
}
if (newbuf != buffer && PageIsAllVisible(BufferGetPage(newbuf)))
{
all_visible_cleared_new = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(BufferGetPage(newbuf));
+ PageClearAllFrozen(BufferGetPage(newbuf));
+
visibilitymap_clear(relation, BufferGetBlockNumber(newbuf),
vmbuffer_new);
}
@@ -6942,7 +6964,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
*/
XLogRecPtr
log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
- TransactionId cutoff_xid)
+ TransactionId cutoff_xid, uint8 vmflags)
{
xl_heap_visible xlrec;
XLogRecPtr recptr;
@@ -6952,6 +6974,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
Assert(BufferIsValid(vm_buffer));
xlrec.cutoff_xid = cutoff_xid;
+ xlrec.flags = vmflags;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
@@ -7541,8 +7564,14 @@ heap_xlog_visible(XLogReaderState *record)
* the subsequent update won't be replayed to clear the flag.
*/
page = BufferGetPage(buffer);
- PageSetAllVisible(page);
+
+ if (xlrec->flags & VISIBILITYMAP_ALL_VISIBLE)
+ PageSetAllVisible(page);
+ if (xlrec->flags & VISIBILITYMAP_ALL_FROZEN)
+ PageSetAllFrozen(page);
+
MarkBufferDirty(buffer);
+
}
else if (action == BLK_RESTORED)
{
@@ -7593,7 +7622,7 @@ heap_xlog_visible(XLogReaderState *record)
*/
if (lsn > PageGetLSN(vmpage))
visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
- xlrec->cutoff_xid);
+ xlrec->cutoff_xid, xlrec->flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
@@ -7743,7 +7772,10 @@ heap_xlog_delete(XLogReaderState *record)
PageSetPrunable(page, XLogRecGetXid(record));
if (xlrec->flags & XLH_DELETE_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
/* Make sure there is no forward chain link in t_ctid */
htup->t_ctid = target_tid;
@@ -7847,7 +7879,10 @@ heap_xlog_insert(XLogReaderState *record)
PageSetLSN(page, lsn);
if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
MarkBufferDirty(buffer);
}
@@ -7986,7 +8021,10 @@ heap_xlog_multi_insert(XLogReaderState *record)
PageSetLSN(page, lsn);
if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
MarkBufferDirty(buffer);
}
@@ -8114,7 +8152,10 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
PageSetPrunable(page, XLogRecGetXid(record));
if (xlrec->flags & XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
PageSetLSN(page, lsn);
MarkBufferDirty(obuffer);
@@ -8249,7 +8290,10 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
elog(PANIC, "heap_update_redo: failed to add tuple");
if (xlrec->flags & XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 7c38772..a284b85 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -21,33 +21,45 @@
*
* NOTES
*
- * The visibility map is a bitmap with one bit per heap page. A set bit means
- * that all tuples on the page are known visible to all transactions, and
- * therefore the page doesn't need to be vacuumed. The map is conservative in
- * the sense that we make sure that whenever a bit is set, we know the
- * condition is true, but if a bit is not set, it might or might not be true.
+ * The visibility map is a bitmap with two bits (all-visible and all-frozen)
+ * per heap page. A set all-visible bit means that all tuples on the page are
+ * known visible to all transactions, and therefore the page doesn't need to
+ * be vacuumed. A set all-frozen bit means that all tuples on the page are
+ * completely frozen, and therefore the page doesn't need to be vacuumed even
+ * if whole table scanning vacuum is required (e.g. anti-wraparound vacuum).
+ * A all-frozen bit must be set only when the page is already all-visible.
+ * That is, all-frozen bit is always set with all-visible bit.
+ *
+ * The map is conservative in the sense that we make sure that whenever a bit
+ * is set, we know the condition is true, but if a bit is not set, it might or
+ * might not be true.
*
* Clearing a visibility map bit is not separately WAL-logged. The callers
* must make sure that whenever a bit is cleared, the bit is cleared on WAL
- * replay of the updating operation as well.
+ * replay of the updating operation as well. And all-frozen bit must be
+ * cleared with all-visible at the same time.
*
* When we *set* a visibility map during VACUUM, we must write WAL. This may
* seem counterintuitive, since the bit is basically a hint: if it is clear,
- * it may still be the case that every tuple on the page is visible to all
- * transactions; we just don't know that for certain. The difficulty is that
- * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
- * on the page itself, and the visibility map bit. If a crash occurs after the
- * visibility map page makes it to disk and before the updated heap page makes
- * it to disk, redo must set the bit on the heap page. Otherwise, the next
- * insert, update, or delete on the heap page will fail to realize that the
- * visibility map bit must be cleared, possibly causing index-only scans to
- * return wrong answers.
+ * it may still be the case that every tuple on the page is visible or frozen
+ * to all transactions; we just don't know that for certain. The difficulty is
+ * that there are two bits which are typically set together: the PD_ALL_VISIBLE
+ * or PD_ALL_FROZEN bit on the page itself, and the visibility map bit. If a
+ * crash occurs after the visibility map page makes it to disk and before the
+ * updated heap page makes it to disk, redo must set the bit on the heap page.
+ * Otherwise, the next insert, update, or delete on the heap page will fail to
+ * realize that the visibility map bit must be cleared, possibly causing index-only
+ * scans to return wrong answers.
*
* VACUUM will normally skip pages for which the visibility map bit is set;
* such pages can't contain any dead tuples and therefore don't need vacuuming.
- * The visibility map is not used for anti-wraparound vacuums, because
+ * The visibility map is not used for anti-wraparound vacuums before 9.5, because
* an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
* present in the table, even on pages that don't have any dead tuples.
+ * 9.6 or later, the visibility map has a additional bit which indicates all tuple
+ * on single page has been completely forzen, so the visibility map is also used for
+ * anti-wraparound vacuums.
+ *
*
* LOCKING
*
@@ -58,14 +70,14 @@
* section that logs the page modification. However, we don't want to hold
* the buffer lock over any I/O that may be required to read in the visibility
* map page. To avoid this, we examine the heap page before locking it;
- * if the page-level PD_ALL_VISIBLE bit is set, we pin the visibility map
- * bit. Then, we lock the buffer. But this creates a race condition: there
- * is a possibility that in the time it takes to lock the buffer, the
- * PD_ALL_VISIBLE bit gets set. If that happens, we have to unlock the
- * buffer, pin the visibility map page, and relock the buffer. This shouldn't
- * happen often, because only VACUUM currently sets visibility map bits,
- * and the race will only occur if VACUUM processes a given page at almost
- * exactly the same time that someone tries to further modify it.
+ * if the page-level PD_ALL_VISIBLE or PD_ALL_FROZEN bit is set, we pin the
+ * visibility map bit. Then, we lock the buffer. But this creates a race
+ * condition: there is a possibility that in the time it takes to lock the
+ * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that happens,
+ * we have to unlock the buffer, pin the visibility map page, and relock the
+ * buffer. This shouldn't happen often, because only VACUUM currently sets
+ * visibility map bits, and the race will only occur if VACUUM processes a given
+ * page at almost exactly the same time that someone tries to further modify it.
*
* To set a bit, you need to hold a lock on the heap page. That prevents
* the race condition where VACUUM sees that all tuples on the page are
@@ -101,11 +113,14 @@
*/
#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
-/* Number of bits allocated for each heap block. */
-#define BITS_PER_HEAPBLOCK 1
+/*
+ * Number of bits allocated for each heap block.
+ * One for all-visible, other for all-frozen.
+*/
+#define BITS_PER_HEAPBLOCK 2
/* Number of heap blocks we can represent in one byte. */
-#define HEAPBLOCKS_PER_BYTE 8
+#define HEAPBLOCKS_PER_BYTE 4
/* Number of heap blocks we can represent in one visibility map page. */
#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
@@ -115,24 +130,42 @@
#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
#define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
-/* table for fast counting of set bits */
-static const uint8 number_of_ones[256] = {
- 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
+/* tables for fast counting of set bits for visible and freeze */
+static const uint8 number_of_ones_for_visible[256] = {
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4
+};
+static const uint8 number_of_ones_for_frozen[256] = {
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4
};
/* prototypes for internal routines */
@@ -141,7 +174,7 @@ static void vm_extend(Relation rel, BlockNumber nvmblocks);
/*
- * visibilitymap_clear - clear a bit in visibility map
+ * visibilitymap_clear - clear all bits in visibility map
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -153,7 +186,8 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
int mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- uint8 mask = 1 << mapBit;
+ uint8 mask = (VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN) <<
+ (BITS_PER_HEAPBLOCK * mapBit);
char *map;
#ifdef TRACE_VISIBILITYMAP
@@ -225,7 +259,7 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
}
/*
- * visibilitymap_set - set a bit on a previously pinned page
+ * visibilitymap_set - set bit(s) on a previously pinned page
*
* recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
* or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
@@ -234,10 +268,11 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
* marked all-visible; it is needed for Hot Standby, and can be
* InvalidTransactionId if the page contains no tuples.
*
- * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
- * this function. Except in recovery, caller should also pass the heap
- * buffer. When checksums are enabled and we're not in recovery, we must add
- * the heap buffer to the WAL chain to protect it from being torn.
+ * Caller is expected to set the heap page's PD_ALL_VISIBLE or PD_ALL_FROZEN
+ * bit before calling this function. Except in recovery, caller should also
+ * pass the heap buffer and flags which indicates what flag we want to set.
+ * When checksums are enabled and we're not in recovery, we must add the heap
+ * buffer to the WAL chain to protect it from being torn.
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -245,7 +280,8 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
*/
void
visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid)
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -254,7 +290,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_set %s %d %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
@@ -272,11 +308,11 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map = PageGetContents(page);
LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
- if (!(map[mapByte] & (1 << mapBit)))
+ if (flags != (map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))))
{
START_CRIT_SECTION();
- map[mapByte] |= (1 << mapBit);
+ map[mapByte] |= (flags << (BITS_PER_HEAPBLOCK * mapBit));
MarkBufferDirty(vmBuf);
if (RelationNeedsWAL(rel))
@@ -285,7 +321,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Assert(!InRecovery);
recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
- cutoff_xid);
+ cutoff_xid, flags);
/*
* If data checksums are enabled (or wal_log_hints=on), we
@@ -295,11 +331,15 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Page heapPage = BufferGetPage(heapBuf);
- /* caller is expected to set PD_ALL_VISIBLE first */
- Assert(PageIsAllVisible(heapPage));
+ /*
+ * caller is expected to set PD_ALL_VISIBLE or
+ * PD_ALL_FROZEN first.
+ */
+ Assert(PageIsAllVisible(heapPage) || PageIsAllFrozen(heapPage));
PageSetLSN(heapPage, recptr);
}
}
+
PageSetLSN(page, recptr);
}
@@ -310,15 +350,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
}
/*
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_test - test if bit(s) is set
*
- * Are all tuples on heapBlk visible to all, according to the visibility map?
+ * Are all tuples on heapBlk visible or frozen to all, according to the visibility map?
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by an
* earlier call to visibilitymap_pin or visibilitymap_test on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk, or InvalidBuffer. The caller is responsible for
- * releasing *buf after it's done testing and setting bits.
+ * releasing *buf after it's done testing and setting bits, and must set flags
+ * which indicates what flag we want to test.
*
* NOTE: This function is typically called without a lock on the heap page,
* so somebody else could change the bit just after we look at it. In fact,
@@ -328,7 +369,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
* all concurrency issues!
*/
bool
-visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
+visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf, uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -337,7 +378,7 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_test %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_test %s %d %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
/* Reuse the old pinned buffer if possible */
@@ -360,11 +401,12 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
map = PageGetContents(BufferGetPage(*buf));
/*
- * A single-bit read is atomic. There could be memory-ordering effects
+ * A single or double bit read is atomic. There could be memory-ordering effects
* here, but for performance reasons we make it the caller's job to worry
* about that.
*/
- result = (map[mapByte] & (1 << mapBit)) ? true : false;
+ result = (map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))) ?
+ true : false;
return result;
}
@@ -374,10 +416,11 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
*
* Note: we ignore the possibility of race conditions when the table is being
* extended concurrently with the call. New pages added to the table aren't
- * going to be marked all-visible, so they won't affect the result.
+ * going to be marked all-visible or all-frozen, so they won't affect the result.
+ * The caller must set the flags which indicates what flag we want to count.
*/
BlockNumber
-visibilitymap_count(Relation rel)
+visibilitymap_count(Relation rel, uint8 flags)
{
BlockNumber result = 0;
BlockNumber mapBlock;
@@ -406,7 +449,10 @@ visibilitymap_count(Relation rel)
for (i = 0; i < MAPSIZE; i++)
{
- result += number_of_ones[map[i]];
+ if (flags & VISIBILITYMAP_ALL_VISIBLE)
+ result += number_of_ones_for_visible[map[i]];
+ if (flags & VISIBILITYMAP_ALL_FROZEN)
+ result += number_of_ones_for_frozen[map[i]];
}
ReleaseBuffer(mapBuffer);
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e59b163..10f8dc9 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1919,11 +1919,18 @@ index_update_stats(Relation rel,
{
BlockNumber relpages = RelationGetNumberOfBlocks(rel);
BlockNumber relallvisible;
+ BlockNumber relallfrozen;
if (rd_rel->relkind != RELKIND_INDEX)
- relallvisible = visibilitymap_count(rel);
+ {
+ relallvisible = visibilitymap_count(rel, VISIBILITYMAP_ALL_VISIBLE);
+ relallfrozen = visibilitymap_count(rel, VISIBILITYMAP_ALL_FROZEN);
+ }
else /* don't bother for indexes */
+ {
relallvisible = 0;
+ relallfrozen = 0;
+ }
if (rd_rel->relpages != (int32) relpages)
{
@@ -1940,6 +1947,11 @@ index_update_stats(Relation rel,
rd_rel->relallvisible = (int32) relallvisible;
dirty = true;
}
+ if (rd_rel->relallfrozen != (int32) relallfrozen)
+ {
+ rd_rel->relallfrozen = (int32) relallfrozen;
+ dirty = true;
+ }
}
/*
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 861048f..392c2a4 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -572,7 +572,8 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
vac_update_relstats(onerel,
relpages,
totalrows,
- visibilitymap_count(onerel),
+ visibilitymap_count(onerel, VISIBILITYMAP_ALL_VISIBLE),
+ visibilitymap_count(onerel, VISIBILITYMAP_ALL_FROZEN),
hasindex,
InvalidTransactionId,
InvalidMultiXactId,
@@ -595,6 +596,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
RelationGetNumberOfBlocks(Irel[ind]),
totalindexrows,
0,
+ 0,
false,
InvalidTransactionId,
InvalidMultiXactId,
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 7ab4874..d3725dd 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -22,6 +22,7 @@
#include "access/rewriteheap.h"
#include "access/transam.h"
#include "access/tuptoaster.h"
+#include "access/visibilitymap.h"
#include "access/xact.h"
#include "access/xlog.h"
#include "catalog/catalog.h"
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 85b0483..744bfff 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -747,6 +747,7 @@ void
vac_update_relstats(Relation relation,
BlockNumber num_pages, double num_tuples,
BlockNumber num_all_visible_pages,
+ BlockNumber num_all_frozen_pages,
bool hasindex, TransactionId frozenxid,
MultiXactId minmulti,
bool in_outer_xact)
@@ -784,6 +785,11 @@ vac_update_relstats(Relation relation,
pgcform->relallvisible = (int32) num_all_visible_pages;
dirty = true;
}
+ if (pgcform->relallfrozen != (int32) num_all_frozen_pages)
+ {
+ pgcform->relallfrozen = (int32) num_all_frozen_pages;
+ dirty = true;
+ }
/* Apply DDL updates, but not inside an outer transaction (see above) */
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index a01cfb4..120de63 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -106,6 +106,8 @@ typedef struct LVRelStats
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* number of pages we examined */
BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber vmskipped_frozen_pages; /* # of pages we skipped by all-frozen bit
+ of visibility map */
double scanned_tuples; /* counts only tuples on scanned pages */
double old_rel_tuples; /* previous value of pg_class.reltuples */
double new_rel_tuples; /* new estimated total # of tuples */
@@ -156,7 +158,7 @@ static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
static int vac_cmp_itemptr(const void *left, const void *right);
static bool heap_page_is_all_visible(Relation rel, Buffer buf,
- TransactionId *visibility_cutoff_xid);
+ TransactionId *visibility_cutoff_xid, bool *all_frozen);
/*
@@ -188,7 +190,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
MultiXactId mxactFullScanLimit;
BlockNumber new_rel_pages;
double new_rel_tuples;
- BlockNumber new_rel_allvisible;
+ BlockNumber new_rel_allvisible,
+ new_rel_allfrozen;
double new_live_tuples;
TransactionId new_frozen_xid;
MultiXactId new_min_multi;
@@ -222,6 +225,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* than or equal to the requested Xid full-table scan limit; or if the
* table's minimum MultiXactId is older than or equal to the requested
* mxid full-table scan limit.
+ * Even if scan_all is set so far, we could skip to scan some pages
+ * according by frozen map.
*/
scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
xidFullScanLimit);
@@ -253,7 +258,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
- if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages + vacrelstats->vmskipped_frozen_pages)
+ < vacrelstats->rel_pages)
{
Assert(!scan_all);
scanned_all = false;
@@ -301,10 +307,14 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_tuples = vacrelstats->old_rel_tuples;
}
- new_rel_allvisible = visibilitymap_count(onerel);
+ new_rel_allvisible = visibilitymap_count(onerel, VISIBILITYMAP_ALL_VISIBLE);
if (new_rel_allvisible > new_rel_pages)
new_rel_allvisible = new_rel_pages;
+ new_rel_allfrozen = visibilitymap_count(onerel, VISIBILITYMAP_ALL_FROZEN);
+ if (new_rel_allfrozen > new_rel_pages)
+ new_rel_allfrozen = new_rel_pages;
+
new_frozen_xid = scanned_all ? FreezeLimit : InvalidTransactionId;
new_min_multi = scanned_all ? MultiXactCutoff : InvalidMultiXactId;
@@ -312,6 +322,7 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_pages,
new_rel_tuples,
new_rel_allvisible,
+ new_rel_allfrozen,
vacrelstats->hasindex,
new_frozen_xid,
new_min_multi,
@@ -360,10 +371,11 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel),
vacrelstats->num_index_scans);
- appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins\n"),
+ appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped according to vm\n"),
vacrelstats->pages_removed,
vacrelstats->rel_pages,
- vacrelstats->pinskipped_pages);
+ vacrelstats->pinskipped_pages,
+ vacrelstats->vmskipped_frozen_pages);
appendStringInfo(&buf,
_("tuples: %.0f removed, %.0f remain, %.0f are dead but not yet removable\n"),
vacrelstats->tuples_deleted,
@@ -486,9 +498,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* consecutive pages. Since we're reading sequentially, the OS should be
* doing readahead for us, so there's no gain in skipping a page now and
* then; that's likely to disable readahead and so be counterproductive.
- * Also, skipping even a single page means that we can't update
- * relfrozenxid, so we only want to do it if we can skip a goodly number
- * of pages.
+ * Also, skipping even a single page accorinding to all-visible bit of
+ * visibility map means that we can't update relfrozenxid, so we only want
+ * to do it if we can skip a goodly number. On the other hand, we count
+ * both how many pages we skipped according to all-frozen bit of visibility
+ * map and how many pages we freeze page, so we can update relfrozenxid if
+ * the sum of them is as many as pages of table.
*
* Before entering the main loop, establish the invariant that
* next_not_all_visible_block is the next block number >= blkno that's not
@@ -515,7 +530,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer))
+ if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer,
+ VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN))
break;
vacuum_delay_point();
}
@@ -533,7 +549,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
bool tupgone,
hastup;
int prev_dead_count;
- int nfrozen;
+ int nfrozen; /* # of tuples is frozen */
+ int nalready_frozen; /* # of tuples is already frozen */
+ int ntotal_frozen; /* # of tuples is in single page */
+ int ntup_per_page;
Size freespace;
bool all_visible_according_to_vm;
bool all_visible;
@@ -548,7 +567,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block++)
{
if (!visibilitymap_test(onerel, next_not_all_visible_block,
- &vmbuffer))
+ &vmbuffer,
+ VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN))
break;
vacuum_delay_point();
}
@@ -566,9 +586,25 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
}
else
{
- /* Current block is all-visible */
- if (skipping_all_visible_blocks && !scan_all)
- continue;
+ /*
+ * This block is at least all-visible according to visibility map.
+ * We check whehter this block is all-frozen to skip to vacuum this
+ * page even if scanning whole page is required.
+ */
+ if (scan_all)
+ {
+ if (visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_FROZEN))
+ {
+ vacrelstats->vmskipped_frozen_pages++;
+ continue;
+ }
+ }
+ else
+ {
+ if (skipping_all_visible_blocks)
+ continue;
+ }
+
all_visible_according_to_vm = true;
}
@@ -740,7 +776,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
PageSetAllVisible(page);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, InvalidTransactionId);
+ vmbuffer, InvalidTransactionId,
+ VISIBILITYMAP_ALL_VISIBLE);
END_CRIT_SECTION();
}
@@ -764,6 +801,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
all_visible = true;
has_dead_tuples = false;
nfrozen = 0;
+ nalready_frozen = 0;
+ ntup_per_page = 0;
hastup = false;
prev_dead_count = vacrelstats->num_dead_tuples;
maxoff = PageGetMaxOffsetNumber(page);
@@ -918,8 +957,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
{
num_tuples += 1;
+ ntup_per_page += 1;;
hastup = true;
+ /* Check whether this tuple is alrady frozen or not */
+ if (HeapTupleHeaderXminFrozen(tuple.t_data))
+ nalready_frozen += 1;
+
/*
* Each non-removable tuple must be checked to see if it needs
* freezing. Note we already have exclusive buffer lock.
@@ -931,9 +975,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
} /* scan along page */
/*
- * If we froze any tuples, mark the buffer dirty, and write a WAL
- * record recording the changes. We must log the changes to be
- * crash-safe against future truncation of CLOG.
+ * If we froze any tuples or any tuples are already frozen,
+ * mark the buffer dirty, and write a WAL record recording the changes.
+ * We must log the changes to be crash-safe against future truncation
+ * of CLOG.
*/
if (nfrozen > 0)
{
@@ -966,6 +1011,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
END_CRIT_SECTION();
}
+ /* Compute the number of frozen tuples in a page */
+ ntotal_frozen = nfrozen + nalready_frozen;
+
/*
* If there are no indexes then we can vacuum the page right now
* instead of doing a second scan.
@@ -988,26 +1036,47 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
freespace = PageGetHeapFreeSpace(page);
- /* mark page all-visible, if appropriate */
- if (all_visible && !all_visible_according_to_vm)
+ /* This page is all visible */
+ if (all_visible)
{
- /*
- * It should never be the case that the visibility map page is set
- * while the page-level bit is clear, but the reverse is allowed
- * (if checksums are not enabled). Regardless, set the both bits
- * so that we get back in sync.
- *
- * NB: If the heap page is all-visible but the VM bit is not set,
- * we don't need to dirty the heap page. However, if checksums
- * are enabled, we do need to make sure that the heap page is
- * dirtied before passing it to visibilitymap_set(), because it
- * may be logged. Given that this situation should only happen in
- * rare cases after a crash, it is not worth optimizing.
- */
- PageSetAllVisible(page);
- MarkBufferDirty(buf);
- visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ /* mark page all-visible, if appropriate */
+ if (!all_visible_according_to_vm)
+ {
+ /*
+ * It should never be the case that the visibility map page is set
+ * while the page-level bit is clear, but the reverse is allowed
+ * (if checksums are not enabled). Regardless, set the both bits
+ * so that we get back in sync.
+ *
+ * NB: If the heap page is all-visible but the VM bit is not set,
+ * we don't need to dirty the heap page. However, if checksums
+ * are enabled, we do need to make sure that the heap page is
+ * dirtied before passing it to visibilitymap_set(), because it
+ * may be logged. Given that this situation should only happen in
+ * rare cases after a crash, it is not worth optimizing.
+ */
+ PageSetAllVisible(page);
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+ }
+
+ /* mark page all-frozen, if all tuples are frozen in total */
+ if ((ntotal_frozen == ntup_per_page) &&
+ !visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_FROZEN))
+ {
+ Assert(PageIsAllVisible(page));
+
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ {
+ MarkBufferDirty(buf);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+ vmbuffer, visibility_cutoff_xid, flags);
+ }
}
/*
@@ -1018,7 +1087,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* that something bad has happened.
*/
else if (all_visible_according_to_vm && !PageIsAllVisible(page)
- && visibilitymap_test(onerel, blkno, &vmbuffer))
+ && visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
relname, blkno);
@@ -1047,6 +1116,17 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
visibilitymap_clear(onerel, blkno, vmbuffer);
}
+ /*
+ * As a result of scanning a page, we set VM all-frozen bit and page header
+ * if all tuples of single page are frozen.
+ */
+ if (ntotal_frozen == ntup_per_page)
+ {
+ PageSetAllFrozen(page);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr, vmbuffer,
+ InvalidTransactionId, VISIBILITYMAP_ALL_FROZEN);
+ }
+
UnlockReleaseBuffer(buf);
/* Remember the location of the last page with nonremovable tuples */
@@ -1078,7 +1158,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
num_tuples);
/*
- * Release any remaining pin on visibility map page.
+ * Release any remaining pin on visibility map and frozen map page.
*/
if (BufferIsValid(vmbuffer))
{
@@ -1115,6 +1195,14 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
tups_vacuumed, vacuumed_pages)));
/*
+ * This information would be effective for how much effect all-frozen bit
+ * of VM had for freezing tuples.
+ */
+ ereport(elevel,
+ (errmsg("Skipped %d frozen pages acoording to visibility map",
+ vacrelstats->vmskipped_frozen_pages)));
+
+ /*
* This is pretty messy, but we split it up so that we can skip emitting
* individual parts of the message when not applicable.
*/
@@ -1226,6 +1314,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
OffsetNumber unused[MaxOffsetNumber];
int uncnt = 0;
TransactionId visibility_cutoff_xid;
+ bool all_frozen;
START_CRIT_SECTION();
@@ -1277,19 +1366,31 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
* dirty, exclusively locked, and, if needed, a full page image has been
* emitted in the log_heap_clean() above.
*/
- if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid))
+ if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid, &all_frozen))
PageSetAllVisible(page);
/*
* All the changes to the heap page have been done. If the all-visible
- * flag is now set, also set the VM bit.
+ * flag is now set, also set the VM all-visible bit.
+ * Also, if this page is all-frozen, set VM all-frozen bit and flag.
*/
- if (PageIsAllVisible(page) &&
- !visibilitymap_test(onerel, blkno, vmbuffer))
+ if (PageIsAllVisible(page))
{
- Assert(BufferIsValid(*vmbuffer));
- visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
- visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ if (!visibilitymap_test(onerel, blkno, vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+
+ /* mark page all-frozen, and set VM all-frozen bit */
+ if (all_frozen)
+ {
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
+ visibility_cutoff_xid, flags);
}
return tupindex;
@@ -1408,6 +1509,7 @@ lazy_cleanup_index(Relation indrel,
stats->num_pages,
stats->num_index_tuples,
0,
+ 0,
false,
InvalidTransactionId,
InvalidMultiXactId,
@@ -1782,7 +1884,8 @@ vac_cmp_itemptr(const void *left, const void *right)
* xmin amongst the visible tuples.
*/
static bool
-heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid)
+heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid,
+ bool *all_frozen)
{
Page page = BufferGetPage(buf);
BlockNumber blockno = BufferGetBlockNumber(buf);
@@ -1791,6 +1894,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
bool all_visible = true;
*visibility_cutoff_xid = InvalidTransactionId;
+ *all_frozen = true;
/*
* This is a stripped down version of the line pointer scan in
@@ -1814,7 +1918,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/*
* Dead line pointers can have index pointers pointing to them. So
- * they can't be treated as visible
+ * they can't be treated as visible and frozen.
*/
if (ItemIdIsDead(itemid))
{
@@ -1855,6 +1959,10 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
+
+ /* Check whether this tuple is alrady frozen or not */
+ if (!HeapTupleHeaderXminFrozen(tuple.t_data))
+ *all_frozen = false;
}
break;
@@ -1863,6 +1971,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
case HEAPTUPLE_INSERT_IN_PROGRESS:
case HEAPTUPLE_DELETE_IN_PROGRESS:
all_visible = false;
+ *all_frozen = false;
break;
default:
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 9f54c46..08df289 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -116,7 +116,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
*/
if (!visibilitymap_test(scandesc->heapRelation,
ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
+ &node->ioss_VMBuffer, VISIBILITYMAP_ALL_VISIBLE))
{
/*
* Rats, we have to visit the heap to check visibility.
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 1ef76d0..ee49ddf 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -127,7 +127,7 @@ ExecCheckPlanOutput(Relation resultRel, List *targetList)
if (attno != resultDesc->natts)
ereport(ERROR,
(errcode(ERRCODE_DATATYPE_MISMATCH),
- errmsg("table row type and query-specified row type do not match"),
+ errmsg("table row type and query-specified row type do not match"),
errdetail("Query has too few columns.")));
}
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 79d9390..8fededc 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -10,6 +10,7 @@
#include "postgres_fe.h"
#include "pg_upgrade.h"
+#include "storage/bufpage.h"
#include <fcntl.h>
@@ -21,6 +22,27 @@ static int copy_file(const char *fromfile, const char *tofile, bool force);
static int win32_pghardlink(const char *src, const char *dst);
#endif
+static int rewrite_vm_to_vfm(const char *fromfile, const char *tofile, bool force);
+
+/* table for fast rewriting vm file to vfm file */
+static const uint16 rewrite_vm_to_vfm_table[256] = {
+ 0, 1, 4, 5, 16, 17, 20, 21, 64, 65, 68, 69, 80, 81, 84, 85,
+ 256, 257, 260, 261, 272, 273, 276, 277, 320, 321, 324, 325, 336, 337, 340, 341,
+ 1024, 1025, 1028, 1029, 1040, 1041, 1044, 1045, 1088, 1089, 1092, 1093, 1104, 1105, 1108, 1109,
+ 1280, 1281, 1284, 1285, 1296, 1297, 1300, 1301, 1344, 1345, 1348, 1349, 1360, 1361, 1364, 1365,
+ 4096, 4097, 4100, 4101, 4112, 4113, 4116, 4117, 4160, 4161, 4164, 4165, 4176, 4177, 4180, 4181,
+ 4352, 4353, 4356, 4357, 4368, 4369, 4372, 4373, 4416, 4417, 4420, 4421, 4432, 4433, 4436, 4437,
+ 5120, 5121, 5124, 5125, 5136, 5137, 5140, 5141, 5184, 5185, 5188, 5189, 5200, 5201, 5204, 5205,
+ 5376, 5377, 5380, 5381, 5392, 5393, 5396, 5397, 5440, 5441, 5444, 5445, 5456, 5457, 5460, 5461,
+ 16384, 16385, 16388, 16389, 16400, 16401, 16404, 16405, 16448, 16449, 16452, 16453, 16464, 16465, 16468, 16469,
+ 16640, 16641, 16644, 16645, 16656, 16657, 16660, 16661, 16704, 16705, 16708, 16709, 16720, 16721, 16724, 16725,
+ 17408, 17409, 17412, 17413, 17424, 17425, 17428, 17429, 17472, 17473, 17476, 17477, 17488, 17489, 17492, 17493,
+ 17664, 17665, 17668, 17669, 17680, 17681, 17684, 17685, 17728, 17729, 17732, 17733, 17744, 17745, 17748, 17749,
+ 20480, 20481, 20484, 20485, 20496, 20497, 20500, 20501, 20544, 20545, 20548, 20549, 20560, 20561, 20564, 20565,
+ 20736, 20737, 20740, 20741, 20752, 20753, 20756, 20757, 20800, 20801, 20804, 20805, 20816, 20817, 20820, 20821,
+ 21504, 21505, 21508, 21509, 21520, 21521, 21524, 21525, 21568, 21569, 21572, 21573, 21584, 21585, 21588, 21589,
+ 21760, 21761, 21764, 21765, 21776, 21777, 21780, 21781, 21824, 21825, 21828, 21829, 21840, 21841, 21844, 21845
+};
/*
* copyAndUpdateFile()
@@ -30,11 +52,19 @@ static int win32_pghardlink(const char *src, const char *dst);
*/
const char *
copyAndUpdateFile(pageCnvCtx *pageConverter,
- const char *src, const char *dst, bool force)
+ const char *src, const char *dst, bool force, bool rewrite_vm)
{
+
if (pageConverter == NULL)
{
- if (pg_copy_file(src, dst, force) == -1)
+ int ret;
+
+ if (rewrite_vm)
+ ret = rewrite_vm_to_vfm(src, dst, force);
+ else
+ ret = pg_copy_file(src, dst, force);
+
+ if (ret)
return getErrorText(errno);
else
return NULL;
@@ -99,7 +129,6 @@ copyAndUpdateFile(pageCnvCtx *pageConverter,
}
}
-
/*
* linkAndUpdateFile()
*
@@ -201,6 +230,110 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
#endif
+/*
+ * rewriteVisibiiltyMap()
+ *
+ * A additional bit which indicates that all tuples on page is completely
+ * frozen is added into visibility map at PG 9.6. So the format of visibiilty
+ * map has been changed.
+ * Copies a visibility map file while adding all-frozen bit(0) into each bit.
+ */
+static int
+rewrite_vm_to_vfm(const char *fromfile, const char *tofile, bool force)
+{
+#define REWRITE_BUF_SIZE (50 * BLCKSZ)
+#define BITS_PER_HEAPBLOCK 2
+
+ int src_fd, dst_fd;
+ uint16 vfm_bits;
+ ssize_t nbytes;
+ char *buffer;
+ int ret = 0;
+ int save_errno = 0;
+
+ if ((fromfile == NULL) || (tofile == NULL))
+ {
+ errno = EINVAL;
+ return -1;
+ }
+
+ if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
+ return -1;
+
+ if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+ {
+ save_errno = errno;
+ if (src_fd != 0)
+ close(src_fd);
+
+ errno = save_errno;
+ return -1;
+ }
+
+ buffer = (char *) pg_malloc(REWRITE_BUF_SIZE);
+
+ /* Copy page header data in advance */
+ if ((nbytes = read(src_fd, buffer, MAXALIGN(SizeOfPageHeaderData))) <= 0)
+ {
+ save_errno = errno;
+ return -1;
+ }
+
+ if (write(dst_fd, buffer, nbytes) != nbytes)
+ {
+ /* if write didn't set errno, assume problem is no disk space */
+ if (errno == 0)
+ errno = ENOSPC;
+ save_errno = errno;
+ return -1;
+ }
+
+ /* perform data rewriting i.e read src srouce, write to destination */
+ while (true)
+ {
+ ssize_t nbytes = read(src_fd, buffer, REWRITE_BUF_SIZE);
+ char *cur, *end;
+
+ if (nbytes < 0)
+ {
+ ret = -1;
+ break;
+ }
+
+ if (nbytes == 0)
+ break;
+
+ cur = buffer;
+ end = buffer + nbytes;
+
+ /*
+ * Rewrite a byte and write dest_fd per BITS_PER_HEAPBLOCK bytes.
+ */
+ while (end > cur)
+ {
+ /* Get rewritten bit from table and its string representation */
+ vfm_bits = rewrite_vm_to_vfm_table[(uint8) *cur];
+
+ if (write(dst_fd, &vfm_bits, BITS_PER_HEAPBLOCK) != BITS_PER_HEAPBLOCK)
+ {
+ ret = -1;
+ break;
+ }
+ cur++;
+ }
+ }
+
+ pg_free(buffer);
+
+ if (src_fd != 0)
+ close(src_fd);
+
+ if (dst_fd != 0)
+ close(dst_fd);
+
+ return ret;
+}
+
void
check_hard_link(void)
{
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 13aa891..090422d 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -112,6 +112,11 @@ extern char *output_files[];
#define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031
/*
+ * The format of visibility map changed with this 9.6 commit,
+ *
+ */
+#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201508181
+/*
* pg_multixact format changed in 9.3 commit 0ac5ad5134f2769ccbaefec73844f85,
* ("Improve concurrency of foreign key locking") which also updated catalog
* version to this value. pg_upgrade behavior depends on whether old and new
@@ -397,7 +402,7 @@ typedef void *pageCnvCtx;
#endif
const char *copyAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
- const char *dst, bool force);
+ const char *dst, bool force, bool rewrite_vm);
const char *linkAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
const char *dst);
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index c22df42..766a473 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -18,7 +18,7 @@
static void transfer_single_new_db(pageCnvCtx *pageConverter,
FileNameMap *maps, int size, char *old_tablespace);
static void transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *suffix);
+ const char *type_old_suffix, const char *type_new_suffix);
/*
@@ -171,6 +171,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
{
int mapnum;
bool vm_crashsafe_match = true;
+ bool vm_rewrite_needed = false;
/*
* Do the old and new cluster disagree on the crash-safetiness of the vm
@@ -180,13 +181,20 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
vm_crashsafe_match = false;
+ /*
+ * Do we need to rewrite "vm" to "vfm".
+ */
+ if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ vm_rewrite_needed = true;
+
for (mapnum = 0; mapnum < size; mapnum++)
{
if (old_tablespace == NULL ||
strcmp(maps[mapnum].old_tablespace, old_tablespace) == 0)
{
/* transfer primary file */
- transfer_relfile(pageConverter, &maps[mapnum], "");
+ transfer_relfile(pageConverter, &maps[mapnum], "", "");
/* fsm/vm files added in PG 8.4 */
if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)
@@ -194,9 +202,17 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
/*
* Copy/link any fsm and vm files, if they exist
*/
- transfer_relfile(pageConverter, &maps[mapnum], "_fsm");
+ transfer_relfile(pageConverter, &maps[mapnum], "_fsm", "_fsm");
if (vm_crashsafe_match)
- transfer_relfile(pageConverter, &maps[mapnum], "_vm");
+ {
+ /*
+ * vm file is changed to vfm file in PG 9.6.
+ */
+ if (vm_rewrite_needed)
+ transfer_relfile(pageConverter, &maps[mapnum], "_vm", "_vfm");
+ else
+ transfer_relfile(pageConverter, &maps[mapnum], "_vm", "_vm");
+ }
}
}
}
@@ -210,7 +226,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
*/
static void
transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *type_suffix)
+ const char *type_old_suffix, const char *type_new_suffix)
{
const char *msg;
char old_file[MAXPGPATH];
@@ -218,6 +234,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
int fd;
int segno;
char extent_suffix[65];
+ bool rewrite_vm = false;
/*
* Now copy/link any related segments as well. Remember, PG breaks large
@@ -236,18 +253,18 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
map->old_tablespace_suffix,
map->old_db_oid,
map->old_relfilenode,
- type_suffix,
+ type_old_suffix,
extent_suffix);
snprintf(new_file, sizeof(new_file), "%s%s/%u/%u%s%s",
map->new_tablespace,
map->new_tablespace_suffix,
map->new_db_oid,
map->new_relfilenode,
- type_suffix,
+ type_new_suffix,
extent_suffix);
/* Is it an extent, fsm, or vm file? */
- if (type_suffix[0] != '\0' || segno != 0)
+ if (type_old_suffix[0] != '\0' || segno != 0)
{
/* Did file open fail? */
if ((fd = open(old_file, O_RDONLY, 0)) == -1)
@@ -276,7 +293,11 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
+ /* We need to rewrite vm file to vfm file. */
+ if (strcmp(type_old_suffix, type_new_suffix) != 0)
+ rewrite_vm = true;
+
+ if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true, rewrite_vm)) != NULL)
pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
diff --git a/src/bin/pg_upgrade/test.sh b/src/bin/pg_upgrade/test.sh
index ec3a7ed..508757e 100644
--- a/src/bin/pg_upgrade/test.sh
+++ b/src/bin/pg_upgrade/test.sh
@@ -170,6 +170,11 @@ if "$MAKE" -C "$oldsrc" installcheck; then
mv "$temp_root"/dump1.sql "$temp_root"/dump1.sql.orig
sed "s;$oldsrc;$newsrc;g" "$temp_root"/dump1.sql.orig >"$temp_root"/dump1.sql
fi
+
+ vm_sql="SELECT c.relname, c.relallvisible FROM pg_class as c, pg_namespace as n WHERE c.relnamespace = n.oid AND n.nspname NOT IN ('information_schema', 'pg_toast', 'pg_catalog') ORDER BY c.relname;"
+ # Test for rewriting visibility map
+ vacuumdb -d regression || visibilitymap_vacuum1_status=$?
+ psql -d regression -c "$vm_sql" > "$temp_root"/vm_test1.txt || visibilitymap_test1_status=$?
else
make_installcheck_status=$?
fi
@@ -184,6 +189,14 @@ if [ -n "$pg_dumpall1_status" ]; then
echo "pg_dumpall of pre-upgrade database cluster failed"
exit 1
fi
+if [ -n "$visibilitymap_vacuum1_status" ];then
+ echo "VACUUM of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+if [ -n "$visibilitymap_test1_status" ];then
+ echo "SELECT of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
PGDATA=$BASE_PGDATA
@@ -199,6 +212,8 @@ case $testhost in
esac
pg_dumpall -f "$temp_root"/dump2.sql || pg_dumpall2_status=$?
+vacuumdb -d regression || visibilitymap_vacuum2_status=$?
+psql -d regression -c "$vm_sql" > "$temp_root"/vm_test2.txt || visibilitymap_test2_status=$?
pg_ctl -m fast stop
# no need to echo commands anymore
@@ -210,11 +225,26 @@ if [ -n "$pg_dumpall2_status" ]; then
exit 1
fi
+if [ -n "$visibilitymap_vacuum2_status" ];then
+ echo "VACUUM of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
+if [ -n "$visibilitymap_test2_status" ];then
+ echo "SELECT of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
case $testhost in
MINGW*) cmd /c delete_old_cluster.bat ;;
*) sh ./delete_old_cluster.sh ;;
esac
+if ! diff "$temp_root"/vm_test1.txt "$temp_root"/vm_test2.txt >/dev/null; then
+ echo "Visibility map rewriting test failed"
+ exit 1
+fi
+
if diff "$temp_root"/dump1.sql "$temp_root"/dump2.sql >/dev/null; then
echo PASSED
exit 0
diff --git a/src/common/relpath.c b/src/common/relpath.c
index 66dfef1..5898f1b 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -30,11 +30,14 @@
* If you add a new entry, remember to update the errhint in
* forkname_to_number() below, and update the SGML documentation for
* pg_relation_size().
+ * 9.6 or later, the visibility map fork name is changed from "vm" to
+ * "vfm" bacause visibility map has not only information about all-visible
+ * but also information about all-frozen.
*/
const char *const forkNames[] = {
"main", /* MAIN_FORKNUM */
"fsm", /* FSM_FORKNUM */
- "vm", /* VISIBILITYMAP_FORKNUM */
+ "vfm", /* VISIBILITYMAP_FORKNUM */
"init" /* INIT_FORKNUM */
};
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index caa0f14..93afb10 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -320,9 +320,10 @@ typedef struct xl_heap_freeze_page
typedef struct xl_heap_visible
{
TransactionId cutoff_xid;
+ uint8 flags;
} xl_heap_visible;
-#define SizeOfHeapVisible (offsetof(xl_heap_visible, cutoff_xid) + sizeof(TransactionId))
+#define SizeOfHeapVisible (offsetof(xl_heap_visible, flags) + sizeof(uint8))
typedef struct xl_heap_new_cid
{
@@ -389,6 +390,6 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
xl_heap_freeze_tuple *xlrec_tp);
extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
- Buffer vm_buffer, TransactionId cutoff_xid);
+ Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
#endif /* HEAPAM_XLOG_H */
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 0c0e0ef..7270609 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -19,15 +19,20 @@
#include "storage/buf.h"
#include "utils/relcache.h"
-extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk,
- Buffer vmbuf);
+/* Flags for bit map */
+#define VISIBILITYMAP_ALL_VISIBLE 0x01
+#define VISIBILITYMAP_ALL_FROZEN 0x02
+
+extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
Buffer *vmbuf);
extern bool visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid);
-extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
-extern BlockNumber visibilitymap_count(Relation rel);
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags);
+extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf,
+ uint8 flags);
+extern BlockNumber visibilitymap_count(Relation rel, uint8 flags);
extern void visibilitymap_truncate(Relation rel, BlockNumber nheapblocks);
#endif /* VISIBILITYMAP_H */
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index b58fe46..98d93c5 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 201508111
+#define CATALOG_VERSION_NO 201508181
#endif
diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
index e526cd9..ea0f7c1 100644
--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -47,6 +47,8 @@ CATALOG(pg_class,1259) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83) BKI_SCHEMA_MACRO
float4 reltuples; /* # of tuples (not always up-to-date) */
int32 relallvisible; /* # of all-visible blocks (not always
* up-to-date) */
+ int32 relallfrozen; /* # of all-frozen blocks (not always
+ up-to-date) */
Oid reltoastrelid; /* OID of toast table; 0 if none */
bool relhasindex; /* T if has (or has had) any indexes */
bool relisshared; /* T if shared across databases */
@@ -95,7 +97,7 @@ typedef FormData_pg_class *Form_pg_class;
* ----------------
*/
-#define Natts_pg_class 30
+#define Natts_pg_class 31
#define Anum_pg_class_relname 1
#define Anum_pg_class_relnamespace 2
#define Anum_pg_class_reltype 3
@@ -107,25 +109,26 @@ typedef FormData_pg_class *Form_pg_class;
#define Anum_pg_class_relpages 9
#define Anum_pg_class_reltuples 10
#define Anum_pg_class_relallvisible 11
-#define Anum_pg_class_reltoastrelid 12
-#define Anum_pg_class_relhasindex 13
-#define Anum_pg_class_relisshared 14
-#define Anum_pg_class_relpersistence 15
-#define Anum_pg_class_relkind 16
-#define Anum_pg_class_relnatts 17
-#define Anum_pg_class_relchecks 18
-#define Anum_pg_class_relhasoids 19
-#define Anum_pg_class_relhaspkey 20
-#define Anum_pg_class_relhasrules 21
-#define Anum_pg_class_relhastriggers 22
-#define Anum_pg_class_relhassubclass 23
-#define Anum_pg_class_relrowsecurity 24
-#define Anum_pg_class_relispopulated 25
-#define Anum_pg_class_relreplident 26
-#define Anum_pg_class_relfrozenxid 27
-#define Anum_pg_class_relminmxid 28
-#define Anum_pg_class_relacl 29
-#define Anum_pg_class_reloptions 30
+#define Anum_pg_class_relallfrozen 12
+#define Anum_pg_class_reltoastrelid 13
+#define Anum_pg_class_relhasindex 14
+#define Anum_pg_class_relisshared 15
+#define Anum_pg_class_relpersistence 16
+#define Anum_pg_class_relkind 17
+#define Anum_pg_class_relnatts 18
+#define Anum_pg_class_relchecks 19
+#define Anum_pg_class_relhasoids 20
+#define Anum_pg_class_relhaspkey 21
+#define Anum_pg_class_relhasrules 22
+#define Anum_pg_class_relhastriggers 23
+#define Anum_pg_class_relhassubclass 24
+#define Anum_pg_class_relrowsecurity 25
+#define Anum_pg_class_relispopulated 26
+#define Anum_pg_class_relreplident 27
+#define Anum_pg_class_relfrozenxid 28
+#define Anum_pg_class_relminmxid 29
+#define Anum_pg_class_relacl 30
+#define Anum_pg_class_reloptions 31
/* ----------------
* initial contents of pg_class
@@ -140,13 +143,13 @@ typedef FormData_pg_class *Form_pg_class;
* Note: "3" in the relfrozenxid column stands for FirstNormalTransactionId;
* similarly, "1" in relminmxid stands for FirstMultiXactId
*/
-DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 f f p r 28 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 0 f f p r 28 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 0 f f p r 31 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index ddf7c67..e320149 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3213,6 +3213,12 @@ DESCR("sleep until the specified time");
DATA(insert OID = 2971 ( text PGNSP PGUID 12 1 0 0 0 f f f f t f i 1 0 25 "16" _null_ _null_ _null_ _null_ _null_ booltext _null_ _null_ _null_ ));
DESCR("convert boolean to text");
+DATA(insert OID = 3308 ( pg_is_all_visible PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 16 "2205 20" _null_ _null_ _null_ _null_ _null_ pg_is_all_visible _null_ _null_ _null_ ));
+DESCR("true if the page is all visible");
+DATA(insert OID = 3309 ( pg_is_all_frozen PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 16 "2205 20" _null_ _null_ _null_ _null_ _null_ pg_is_all_frozen _null_ _null_ _null_ ));
+DESCR("true if the page is all frozen");
+
+
/* Aggregates (moved here from pg_aggregate for 7.3) */
DATA(insert OID = 2100 ( avg PGNSP PGUID 12 1 0 0 0 t f f f f f i 1 0 1700 "20" _null_ _null_ _null_ _null_ _null_ aggregate_dummy _null_ _null_ _null_ ));
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index e3a31af..d2bae2d 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -172,6 +172,7 @@ extern void vac_update_relstats(Relation relation,
BlockNumber num_pages,
double num_tuples,
BlockNumber num_all_visible_pages,
+ BlockNumber num_all_frozen_pages,
bool hasindex,
TransactionId frozenxid,
MultiXactId minmulti,
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index a2f78ee..7bf2718 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -178,8 +178,10 @@ typedef PageHeaderData *PageHeader;
* tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
+#define PD_ALL_FROZEN 0x0008 /* all tuples on page are completely
+ frozen */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x000F /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -369,6 +371,13 @@ typedef PageHeaderData *PageHeader;
#define PageClearAllVisible(page) \
(((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
+#define PageIsAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags & PD_ALL_FROZEN)
+#define PageSetAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags |= PD_ALL_FROZEN)
+#define PageClearAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags &= ~PD_ALL_FROZEN)
+
#define PageIsPrunable(page, oldestxmin) \
( \
AssertMacro(TransactionIdIsNormal(oldestxmin)), \
diff --git a/src/test/regress/expected/visibilitymap.out b/src/test/regress/expected/visibilitymap.out
new file mode 100644
index 0000000..543eeaa
--- /dev/null
+++ b/src/test/regress/expected/visibilitymap.out
@@ -0,0 +1,75 @@
+--
+-- Visibility map
+--
+CREATE FUNCTION
+ pg_visibilitymap(rel regclass, blkno OUT bigint, all_visible OUT bool, all_frozen OUT bool)
+RETURNS SETOF RECORD
+AS $$
+ SELECT blkno, pg_is_all_visible($1, blkno) AS all_visible, pg_is_all_frozen($1, blkno) AS all_frozen
+ FROM generate_series(0, pg_relation_size($1) / current_setting('block_size')::bigint - 1) AS blkno;
+$$
+LANGUAGE SQL;
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+\set VERBOSITY terse
+-- All pages are become all-visible
+VACUUM vmtest;
+SELECT count(all_visible) = (pg_relation_size('vmtest') / current_setting('block_size')::int)
+ FROM pg_visibilitymap('vmtest')
+ WHERE all_visible;
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int)
+ FROM pg_class
+ WHERE relname = 'vmtest';
+ ?column?
+----------
+ t
+(1 row)
+
+VACUUM FREEZE vmtest;
+SELECT count(all_visible) = (pg_relation_size('vmtest') / current_setting('block_size')::int)
+ FROM pg_visibilitymap('vmtest')
+ WHERE all_visible
+ GROUP BY all_visible;
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT count(all_frozen) = (pg_relation_size('vmtest') / current_setting('block_size')::int)
+ FROM pg_visibilitymap('vmtest')
+ WHERE all_frozen
+ GROUP BY all_frozen;
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int)
+ FROM pg_class
+ WHERE relname = 'vmtest';
+ ?column?
+----------
+ t
+(1 row)
+
+SELECT relallfrozen = (pg_relation_size('vmtest') / current_setting('block_size')::int)
+ FROM pg_class
+ WHERE relname = 'vmtest';
+ ?column?
+----------
+ t
+(1 row)
+
+-- All pages are skipped acoording to VM
+VACUUM FREEZE VERBOSE vmtest;
+INFO: vacuuming "public.vmtest"
+INFO: index "vmtest_pkey" now contains 10000 row versions in 30 pages
+INFO: Skipped 45 frozen pages acoording to visibility map
+INFO: "vmtest": found 0 removable, 0 nonremovable row versions in 0 out of 45 pages
+DROP FUNCTION pg_visibilitymap(regclass);
+DROP TABLE vmtest;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 4df15de..893d773 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -108,5 +108,8 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare without_oid c
# event triggers cannot run concurrently with any test that runs DDL
test: event_trigger
+# visibility map and vacuum test cannot run concurrently with any test that runs SQL
+test: visibilitymap
+
# run stats by itself because its delay may be insufficient under heavy load
test: stats
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 15d74d4..da84aa6 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -157,3 +157,4 @@ test: with
test: xml
test: event_trigger
test: stats
+test: visibilitymap
\ No newline at end of file
diff --git a/src/test/regress/sql/visibilitymap.sql b/src/test/regress/sql/visibilitymap.sql
new file mode 100644
index 0000000..11b552e
--- /dev/null
+++ b/src/test/regress/sql/visibilitymap.sql
@@ -0,0 +1,49 @@
+--
+-- Visibility map
+--
+
+CREATE FUNCTION
+ pg_visibilitymap(rel regclass, blkno OUT bigint, all_visible OUT bool, all_frozen OUT bool)
+RETURNS SETOF RECORD
+AS $$
+ SELECT blkno, pg_is_all_visible($1, blkno) AS all_visible, pg_is_all_frozen($1, blkno) AS all_frozen
+ FROM generate_series(0, pg_relation_size($1) / current_setting('block_size')::bigint - 1) AS blkno;
+$$
+LANGUAGE SQL;
+
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+
+\set VERBOSITY terse
+
+-- All pages are become all-visible
+VACUUM vmtest;
+SELECT count(all_visible) = (pg_relation_size('vmtest') / current_setting('block_size')::int)
+ FROM pg_visibilitymap('vmtest')
+ WHERE all_visible;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int)
+ FROM pg_class
+ WHERE relname = 'vmtest';
+
+VACUUM FREEZE vmtest;
+SELECT count(all_visible) = (pg_relation_size('vmtest') / current_setting('block_size')::int)
+ FROM pg_visibilitymap('vmtest')
+ WHERE all_visible
+ GROUP BY all_visible;
+SELECT count(all_frozen) = (pg_relation_size('vmtest') / current_setting('block_size')::int)
+ FROM pg_visibilitymap('vmtest')
+ WHERE all_frozen
+ GROUP BY all_frozen;
+
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int)
+ FROM pg_class
+ WHERE relname = 'vmtest';
+SELECT relallfrozen = (pg_relation_size('vmtest') / current_setting('block_size')::int)
+ FROM pg_class
+ WHERE relname = 'vmtest';
+
+-- All pages are skipped acoording to VM
+VACUUM FREEZE VERBOSE vmtest;
+
+DROP FUNCTION pg_visibilitymap(regclass);
+DROP TABLE vmtest;
001_diagnostic_function_for_visibility_map_v11.patchtext/x-patch; charset=US-ASCII; name=001_diagnostic_function_for_visibility_map_v11.patchDownload
diff --git a/src/backend/access/heap/Makefile b/src/backend/access/heap/Makefile
index b83d496..806ce27 100644
--- a/src/backend/access/heap/Makefile
+++ b/src/backend/access/heap/Makefile
@@ -12,6 +12,7 @@ subdir = src/backend/access/heap
top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
-OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o visibilitymap.o
+OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o visibilitymap.o \
+ heapfuncs.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/heapfuncs.c b/src/backend/access/heap/heapfuncs.c
new file mode 100644
index 0000000..6c3753b
--- /dev/null
+++ b/src/backend/access/heap/heapfuncs.c
@@ -0,0 +1,81 @@
+/*-------------------------------------------------------------------------
+ *
+ * heapfuncs.c
+ * Functions for accessing the related heap page
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/heap/heapfuncs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/visibilitymap.h"
+#include "funcapi.h"
+#include "storage/freespace.h"
+#include "storage/bufmgr.h"
+
+/* Functions for visibilitymap */
+extern Datum pg_is_all_visible(PG_FUNCTION_ARGS);
+extern Datum pg_is_all_frozen(PG_FUNCTION_ARGS);
+
+static bool visibilitymap_test_internal(Oid relid, uint64 blkno, uint8);
+
+/*
+ * Return the page is all-visible or not, according to the visibility map.
+ */
+Datum
+pg_is_all_visible(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int64 blkno = PG_GETARG_INT64(1);
+ bool all_visible;
+
+ all_visible = visibilitymap_test_internal(relid, blkno, VISIBILITYMAP_ALL_VISIBLE);
+
+ PG_RETURN_BOOL(all_visible);
+}
+
+/*
+ * Return the page is all-frozen or not, according to the visibility map.
+ */
+Datum
+pg_is_all_frozen(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int64 blkno = PG_GETARG_INT64(1);
+ bool all_frozen;
+
+ all_frozen = visibilitymap_test_internal(relid, blkno, VISIBILITYMAP_ALL_FROZEN);
+
+ PG_RETURN_BOOL(all_frozen);
+}
+
+static bool
+visibilitymap_test_internal(Oid relid, uint64 blkno, uint8 flag)
+{
+
+ Relation rel;
+ Buffer vmbuffer = InvalidBuffer;
+ bool result;
+
+ rel = relation_open(relid, AccessShareLock);
+
+ if (blkno < 0 || blkno > MaxBlockNumber)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid block number")));
+
+ result = visibilitymap_test(rel, blkno, &vmbuffer, flag);
+
+ if (BufferIsValid(vmbuffer))
+ ReleaseBuffer(vmbuffer);
+ relation_close(rel, AccessShareLock);
+
+ return result;
+}
On Tue, Aug 18, 2015 at 7:27 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I have encountered the much cases where pg_stat_statement,
pgstattuples are required in production, so I basically agree with
moving such extension into core.
But IMO, the diagnostic tools for visibility map, heap (pageinspect)
and so on, are a kind of debugging tool.
Just because something might be required in production isn't a
sufficient reason to put it in core. Debugging tools, or anything
else, can be required in production, too.
Attached latest v11 patches, which is separated into 2 patches: frozen
bit patch and diagnostic function patch.
Moving diagnostic function into core is still under the discussion,
but this patch puts such function into core because the diagnostic
function for visibility map needs to be in core to execute regression
test at least.
As has been discussed recently, there are other ways to handle that.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Aug 19, 2015 at 1:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Aug 18, 2015 at 7:27 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I have encountered the much cases where pg_stat_statement,
pgstattuples are required in production, so I basically agree with
moving such extension into core.
But IMO, the diagnostic tools for visibility map, heap (pageinspect)
and so on, are a kind of debugging tool.Just because something might be required in production isn't a
sufficient reason to put it in core. Debugging tools, or anything
else, can be required in production, too.Attached latest v11 patches, which is separated into 2 patches: frozen
bit patch and diagnostic function patch.
Moving diagnostic function into core is still under the discussion,
but this patch puts such function into core because the diagnostic
function for visibility map needs to be in core to execute regression
test at least.As has been discussed recently, there are other ways to handle that.
The currently regression test for VM is that we just compare between
the total number of all-visible and all-frozen in VM before and after
VACUUM, and don't check particular a bit in VM.
we could substitute it to the ANALYZE command with enough sampling
number and checking pg_class.relallvisible and pg_class.relallfrozen.
So another way is that diagnostic function for VM is put into
something contrib (pg_freespacemap or pageinspect), and if we want to
use such function in production, we can install such extension as in
the past.
Regards,
--
Masahiko Sawada
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 8/19/15 2:56 AM, Masahiko Sawada wrote:
The currently regression test for VM is that we just compare between
the total number of all-visible and all-frozen in VM before and after
VACUUM, and don't check particular a bit in VM.
we could substitute it to the ANALYZE command with enough sampling
number and checking pg_class.relallvisible and pg_class.relallfrozen.
I think this is another indication that we need more than just pg_regress...
So another way is that diagnostic function for VM is put into
something contrib (pg_freespacemap or pageinspect), and if we want to
use such function in production, we can install such extension as in
the past.
pg_buffercache is very useful as a performance monitoring tool, and I
view being able to pull statistics about the VM and FM the same way. I'd
like to see us providing more performance information by default, not less.
I think things like pageinspect are very different; I really can't see
any use for those beyond debugging (and debugging by an expert at that).
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Jim Nasby wrote:
I think things like pageinspect are very different; I really can't see any
use for those beyond debugging (and debugging by an expert at that).
I don't think that necessarily means it must continue to be in contrib.
Quite the contrary, I think it is a tool critical enough that it should
not be relegated to be a second-class citizen as it is now (let's face
it, being in contrib *is* second-class citizenship).
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Aug 20, 2015 at 11:46 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
Jim Nasby wrote:
I think things like pageinspect are very different; I really can't see any
use for those beyond debugging (and debugging by an expert at that).I don't think that necessarily means it must continue to be in contrib.
Quite the contrary, I think it is a tool critical enough that it should
not be relegated to be a second-class citizen as it is now (let's face
it, being in contrib *is* second-class citizenship).
Attached patch is latest patch.
The how to do the VM regression test is changed so that we do test
without diagnostic functions.
In current patch, we do VACUUM and VACUUM FREEZE table, and check its
value of pg_class.relallvisible and relallfrozen.
When doing first VACUUM in regression test, the table doesn't have VM.
So VACUUM scans all pages and update exactly information about the
number of all-visible bit.
And when doing second VACUUM FREEZE, VACUUM FREEZE also scans all
pages because every page is not marked as all-frozen. So VACUUM FREEZE
can update exactly information about the number of all-frozen bit.
In previous patch, we checked a bit of VM one by one using by
diagnostic function, and compared between these result and
pg_class.relallvisible(/frozen).
So the essential check process is same as previous patch.
We can ensure correctness by using such procedure.
Regards,
--
Masahiko Sawada
Attachments:
000_add_frozen_bit_into_visibilitymap_v12.patchapplication/octet-stream; name=000_add_frozen_bit_into_visibilitymap_v12.patchDownload
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 22c5f7a..b1b6a06 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -87,7 +87,7 @@ statapprox_heap(Relation rel, output_type *stat)
* If the page has only visible tuples, then we can find out the free
* space from the FSM and move on.
*/
- if (visibilitymap_test(rel, blkno, &vmbuffer))
+ if (visibilitymap_test(rel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
freespace = GetRecordedFreeSpace(rel, blkno);
stat->tuple_len += BLCKSZ - freespace;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 3701d8e..dabd632 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2176,8 +2176,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
/*
- * Find buffer to insert this tuple into. If the page is all visible,
- * this will also pin the requisite visibility map page.
+ * Find buffer to insert this tuple into. If the page is all visible
+ * or all frozen, this will also pin the requisite visibility map and
+ * frozen map page.
*/
buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
InvalidBuffer, options, bistate,
@@ -2192,7 +2193,11 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
if (PageIsAllVisible(BufferGetPage(buffer)))
{
all_visible_cleared = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(BufferGetPage(buffer));
+ PageClearAllFrozen(BufferGetPage(buffer));
+
visibilitymap_clear(relation,
ItemPointerGetBlockNumber(&(heaptup->t_self)),
vmbuffer);
@@ -2493,7 +2498,11 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
if (PageIsAllVisible(page))
{
all_visible_cleared = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+
visibilitymap_clear(relation,
BufferGetBlockNumber(buffer),
vmbuffer);
@@ -2776,9 +2785,9 @@ heap_delete(Relation relation, ItemPointer tid,
/*
* If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, we'll have to unlock and
- * re-lock, to avoid holding the buffer lock across an I/O. That's a bit
- * unfortunate, but hopefully shouldn't happen often.
+ * visible or all frozen while we were busy locking the buffer, we'll
+ * have to unlock and re-lock, to avoid holding the buffer lock across an
+ * I/O. That's a bit unfortunate, but hopefully shouldn't happen often.
*/
if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
{
@@ -2972,10 +2981,15 @@ l1:
*/
PageSetPrunable(page, xid);
+ /* clear PD_ALL_VISIBLE and PD_ALL_FORZEN flags */
if (PageIsAllVisible(page))
{
all_visible_cleared = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+
visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
vmbuffer);
}
@@ -3254,7 +3268,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
* in the middle of changing this, so we'll need to recheck after we have
* the lock.
*/
- if (PageIsAllVisible(page))
+ if (PageIsAllVisible(page) || PageIsAllFrozen(page))
visibilitymap_pin(relation, block, &vmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
@@ -3850,14 +3864,22 @@ l2:
if (PageIsAllVisible(BufferGetPage(buffer)))
{
all_visible_cleared = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(BufferGetPage(buffer));
+ PageClearAllFrozen(BufferGetPage(buffer));
+
visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
vmbuffer);
}
if (newbuf != buffer && PageIsAllVisible(BufferGetPage(newbuf)))
{
all_visible_cleared_new = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(BufferGetPage(newbuf));
+ PageClearAllFrozen(BufferGetPage(newbuf));
+
visibilitymap_clear(relation, BufferGetBlockNumber(newbuf),
vmbuffer_new);
}
@@ -6942,7 +6964,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
*/
XLogRecPtr
log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
- TransactionId cutoff_xid)
+ TransactionId cutoff_xid, uint8 vmflags)
{
xl_heap_visible xlrec;
XLogRecPtr recptr;
@@ -6952,6 +6974,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
Assert(BufferIsValid(vm_buffer));
xlrec.cutoff_xid = cutoff_xid;
+ xlrec.flags = vmflags;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
@@ -7541,8 +7564,14 @@ heap_xlog_visible(XLogReaderState *record)
* the subsequent update won't be replayed to clear the flag.
*/
page = BufferGetPage(buffer);
- PageSetAllVisible(page);
+
+ if (xlrec->flags & VISIBILITYMAP_ALL_VISIBLE)
+ PageSetAllVisible(page);
+ if (xlrec->flags & VISIBILITYMAP_ALL_FROZEN)
+ PageSetAllFrozen(page);
+
MarkBufferDirty(buffer);
+
}
else if (action == BLK_RESTORED)
{
@@ -7593,7 +7622,7 @@ heap_xlog_visible(XLogReaderState *record)
*/
if (lsn > PageGetLSN(vmpage))
visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
- xlrec->cutoff_xid);
+ xlrec->cutoff_xid, xlrec->flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
@@ -7743,7 +7772,10 @@ heap_xlog_delete(XLogReaderState *record)
PageSetPrunable(page, XLogRecGetXid(record));
if (xlrec->flags & XLH_DELETE_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
/* Make sure there is no forward chain link in t_ctid */
htup->t_ctid = target_tid;
@@ -7847,7 +7879,10 @@ heap_xlog_insert(XLogReaderState *record)
PageSetLSN(page, lsn);
if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
MarkBufferDirty(buffer);
}
@@ -7986,7 +8021,10 @@ heap_xlog_multi_insert(XLogReaderState *record)
PageSetLSN(page, lsn);
if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
MarkBufferDirty(buffer);
}
@@ -8114,7 +8152,10 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
PageSetPrunable(page, XLogRecGetXid(record));
if (xlrec->flags & XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
PageSetLSN(page, lsn);
MarkBufferDirty(obuffer);
@@ -8249,7 +8290,10 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
elog(PANIC, "heap_update_redo: failed to add tuple");
if (xlrec->flags & XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 7c38772..a284b85 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -21,33 +21,45 @@
*
* NOTES
*
- * The visibility map is a bitmap with one bit per heap page. A set bit means
- * that all tuples on the page are known visible to all transactions, and
- * therefore the page doesn't need to be vacuumed. The map is conservative in
- * the sense that we make sure that whenever a bit is set, we know the
- * condition is true, but if a bit is not set, it might or might not be true.
+ * The visibility map is a bitmap with two bits (all-visible and all-frozen)
+ * per heap page. A set all-visible bit means that all tuples on the page are
+ * known visible to all transactions, and therefore the page doesn't need to
+ * be vacuumed. A set all-frozen bit means that all tuples on the page are
+ * completely frozen, and therefore the page doesn't need to be vacuumed even
+ * if whole table scanning vacuum is required (e.g. anti-wraparound vacuum).
+ * A all-frozen bit must be set only when the page is already all-visible.
+ * That is, all-frozen bit is always set with all-visible bit.
+ *
+ * The map is conservative in the sense that we make sure that whenever a bit
+ * is set, we know the condition is true, but if a bit is not set, it might or
+ * might not be true.
*
* Clearing a visibility map bit is not separately WAL-logged. The callers
* must make sure that whenever a bit is cleared, the bit is cleared on WAL
- * replay of the updating operation as well.
+ * replay of the updating operation as well. And all-frozen bit must be
+ * cleared with all-visible at the same time.
*
* When we *set* a visibility map during VACUUM, we must write WAL. This may
* seem counterintuitive, since the bit is basically a hint: if it is clear,
- * it may still be the case that every tuple on the page is visible to all
- * transactions; we just don't know that for certain. The difficulty is that
- * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
- * on the page itself, and the visibility map bit. If a crash occurs after the
- * visibility map page makes it to disk and before the updated heap page makes
- * it to disk, redo must set the bit on the heap page. Otherwise, the next
- * insert, update, or delete on the heap page will fail to realize that the
- * visibility map bit must be cleared, possibly causing index-only scans to
- * return wrong answers.
+ * it may still be the case that every tuple on the page is visible or frozen
+ * to all transactions; we just don't know that for certain. The difficulty is
+ * that there are two bits which are typically set together: the PD_ALL_VISIBLE
+ * or PD_ALL_FROZEN bit on the page itself, and the visibility map bit. If a
+ * crash occurs after the visibility map page makes it to disk and before the
+ * updated heap page makes it to disk, redo must set the bit on the heap page.
+ * Otherwise, the next insert, update, or delete on the heap page will fail to
+ * realize that the visibility map bit must be cleared, possibly causing index-only
+ * scans to return wrong answers.
*
* VACUUM will normally skip pages for which the visibility map bit is set;
* such pages can't contain any dead tuples and therefore don't need vacuuming.
- * The visibility map is not used for anti-wraparound vacuums, because
+ * The visibility map is not used for anti-wraparound vacuums before 9.5, because
* an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
* present in the table, even on pages that don't have any dead tuples.
+ * 9.6 or later, the visibility map has a additional bit which indicates all tuple
+ * on single page has been completely forzen, so the visibility map is also used for
+ * anti-wraparound vacuums.
+ *
*
* LOCKING
*
@@ -58,14 +70,14 @@
* section that logs the page modification. However, we don't want to hold
* the buffer lock over any I/O that may be required to read in the visibility
* map page. To avoid this, we examine the heap page before locking it;
- * if the page-level PD_ALL_VISIBLE bit is set, we pin the visibility map
- * bit. Then, we lock the buffer. But this creates a race condition: there
- * is a possibility that in the time it takes to lock the buffer, the
- * PD_ALL_VISIBLE bit gets set. If that happens, we have to unlock the
- * buffer, pin the visibility map page, and relock the buffer. This shouldn't
- * happen often, because only VACUUM currently sets visibility map bits,
- * and the race will only occur if VACUUM processes a given page at almost
- * exactly the same time that someone tries to further modify it.
+ * if the page-level PD_ALL_VISIBLE or PD_ALL_FROZEN bit is set, we pin the
+ * visibility map bit. Then, we lock the buffer. But this creates a race
+ * condition: there is a possibility that in the time it takes to lock the
+ * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that happens,
+ * we have to unlock the buffer, pin the visibility map page, and relock the
+ * buffer. This shouldn't happen often, because only VACUUM currently sets
+ * visibility map bits, and the race will only occur if VACUUM processes a given
+ * page at almost exactly the same time that someone tries to further modify it.
*
* To set a bit, you need to hold a lock on the heap page. That prevents
* the race condition where VACUUM sees that all tuples on the page are
@@ -101,11 +113,14 @@
*/
#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
-/* Number of bits allocated for each heap block. */
-#define BITS_PER_HEAPBLOCK 1
+/*
+ * Number of bits allocated for each heap block.
+ * One for all-visible, other for all-frozen.
+*/
+#define BITS_PER_HEAPBLOCK 2
/* Number of heap blocks we can represent in one byte. */
-#define HEAPBLOCKS_PER_BYTE 8
+#define HEAPBLOCKS_PER_BYTE 4
/* Number of heap blocks we can represent in one visibility map page. */
#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
@@ -115,24 +130,42 @@
#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
#define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
-/* table for fast counting of set bits */
-static const uint8 number_of_ones[256] = {
- 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
+/* tables for fast counting of set bits for visible and freeze */
+static const uint8 number_of_ones_for_visible[256] = {
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4
+};
+static const uint8 number_of_ones_for_frozen[256] = {
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4
};
/* prototypes for internal routines */
@@ -141,7 +174,7 @@ static void vm_extend(Relation rel, BlockNumber nvmblocks);
/*
- * visibilitymap_clear - clear a bit in visibility map
+ * visibilitymap_clear - clear all bits in visibility map
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -153,7 +186,8 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
int mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- uint8 mask = 1 << mapBit;
+ uint8 mask = (VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN) <<
+ (BITS_PER_HEAPBLOCK * mapBit);
char *map;
#ifdef TRACE_VISIBILITYMAP
@@ -225,7 +259,7 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
}
/*
- * visibilitymap_set - set a bit on a previously pinned page
+ * visibilitymap_set - set bit(s) on a previously pinned page
*
* recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
* or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
@@ -234,10 +268,11 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
* marked all-visible; it is needed for Hot Standby, and can be
* InvalidTransactionId if the page contains no tuples.
*
- * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
- * this function. Except in recovery, caller should also pass the heap
- * buffer. When checksums are enabled and we're not in recovery, we must add
- * the heap buffer to the WAL chain to protect it from being torn.
+ * Caller is expected to set the heap page's PD_ALL_VISIBLE or PD_ALL_FROZEN
+ * bit before calling this function. Except in recovery, caller should also
+ * pass the heap buffer and flags which indicates what flag we want to set.
+ * When checksums are enabled and we're not in recovery, we must add the heap
+ * buffer to the WAL chain to protect it from being torn.
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -245,7 +280,8 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
*/
void
visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid)
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -254,7 +290,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_set %s %d %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
@@ -272,11 +308,11 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map = PageGetContents(page);
LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
- if (!(map[mapByte] & (1 << mapBit)))
+ if (flags != (map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))))
{
START_CRIT_SECTION();
- map[mapByte] |= (1 << mapBit);
+ map[mapByte] |= (flags << (BITS_PER_HEAPBLOCK * mapBit));
MarkBufferDirty(vmBuf);
if (RelationNeedsWAL(rel))
@@ -285,7 +321,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Assert(!InRecovery);
recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
- cutoff_xid);
+ cutoff_xid, flags);
/*
* If data checksums are enabled (or wal_log_hints=on), we
@@ -295,11 +331,15 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Page heapPage = BufferGetPage(heapBuf);
- /* caller is expected to set PD_ALL_VISIBLE first */
- Assert(PageIsAllVisible(heapPage));
+ /*
+ * caller is expected to set PD_ALL_VISIBLE or
+ * PD_ALL_FROZEN first.
+ */
+ Assert(PageIsAllVisible(heapPage) || PageIsAllFrozen(heapPage));
PageSetLSN(heapPage, recptr);
}
}
+
PageSetLSN(page, recptr);
}
@@ -310,15 +350,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
}
/*
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_test - test if bit(s) is set
*
- * Are all tuples on heapBlk visible to all, according to the visibility map?
+ * Are all tuples on heapBlk visible or frozen to all, according to the visibility map?
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by an
* earlier call to visibilitymap_pin or visibilitymap_test on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk, or InvalidBuffer. The caller is responsible for
- * releasing *buf after it's done testing and setting bits.
+ * releasing *buf after it's done testing and setting bits, and must set flags
+ * which indicates what flag we want to test.
*
* NOTE: This function is typically called without a lock on the heap page,
* so somebody else could change the bit just after we look at it. In fact,
@@ -328,7 +369,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
* all concurrency issues!
*/
bool
-visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
+visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf, uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -337,7 +378,7 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_test %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_test %s %d %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
/* Reuse the old pinned buffer if possible */
@@ -360,11 +401,12 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
map = PageGetContents(BufferGetPage(*buf));
/*
- * A single-bit read is atomic. There could be memory-ordering effects
+ * A single or double bit read is atomic. There could be memory-ordering effects
* here, but for performance reasons we make it the caller's job to worry
* about that.
*/
- result = (map[mapByte] & (1 << mapBit)) ? true : false;
+ result = (map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))) ?
+ true : false;
return result;
}
@@ -374,10 +416,11 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
*
* Note: we ignore the possibility of race conditions when the table is being
* extended concurrently with the call. New pages added to the table aren't
- * going to be marked all-visible, so they won't affect the result.
+ * going to be marked all-visible or all-frozen, so they won't affect the result.
+ * The caller must set the flags which indicates what flag we want to count.
*/
BlockNumber
-visibilitymap_count(Relation rel)
+visibilitymap_count(Relation rel, uint8 flags)
{
BlockNumber result = 0;
BlockNumber mapBlock;
@@ -406,7 +449,10 @@ visibilitymap_count(Relation rel)
for (i = 0; i < MAPSIZE; i++)
{
- result += number_of_ones[map[i]];
+ if (flags & VISIBILITYMAP_ALL_VISIBLE)
+ result += number_of_ones_for_visible[map[i]];
+ if (flags & VISIBILITYMAP_ALL_FROZEN)
+ result += number_of_ones_for_frozen[map[i]];
}
ReleaseBuffer(mapBuffer);
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e59b163..10f8dc9 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1919,11 +1919,18 @@ index_update_stats(Relation rel,
{
BlockNumber relpages = RelationGetNumberOfBlocks(rel);
BlockNumber relallvisible;
+ BlockNumber relallfrozen;
if (rd_rel->relkind != RELKIND_INDEX)
- relallvisible = visibilitymap_count(rel);
+ {
+ relallvisible = visibilitymap_count(rel, VISIBILITYMAP_ALL_VISIBLE);
+ relallfrozen = visibilitymap_count(rel, VISIBILITYMAP_ALL_FROZEN);
+ }
else /* don't bother for indexes */
+ {
relallvisible = 0;
+ relallfrozen = 0;
+ }
if (rd_rel->relpages != (int32) relpages)
{
@@ -1940,6 +1947,11 @@ index_update_stats(Relation rel,
rd_rel->relallvisible = (int32) relallvisible;
dirty = true;
}
+ if (rd_rel->relallfrozen != (int32) relallfrozen)
+ {
+ rd_rel->relallfrozen = (int32) relallfrozen;
+ dirty = true;
+ }
}
/*
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 861048f..392c2a4 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -572,7 +572,8 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
vac_update_relstats(onerel,
relpages,
totalrows,
- visibilitymap_count(onerel),
+ visibilitymap_count(onerel, VISIBILITYMAP_ALL_VISIBLE),
+ visibilitymap_count(onerel, VISIBILITYMAP_ALL_FROZEN),
hasindex,
InvalidTransactionId,
InvalidMultiXactId,
@@ -595,6 +596,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
RelationGetNumberOfBlocks(Irel[ind]),
totalindexrows,
0,
+ 0,
false,
InvalidTransactionId,
InvalidMultiXactId,
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 7ab4874..d3725dd 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -22,6 +22,7 @@
#include "access/rewriteheap.h"
#include "access/transam.h"
#include "access/tuptoaster.h"
+#include "access/visibilitymap.h"
#include "access/xact.h"
#include "access/xlog.h"
#include "catalog/catalog.h"
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 85b0483..744bfff 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -747,6 +747,7 @@ void
vac_update_relstats(Relation relation,
BlockNumber num_pages, double num_tuples,
BlockNumber num_all_visible_pages,
+ BlockNumber num_all_frozen_pages,
bool hasindex, TransactionId frozenxid,
MultiXactId minmulti,
bool in_outer_xact)
@@ -784,6 +785,11 @@ vac_update_relstats(Relation relation,
pgcform->relallvisible = (int32) num_all_visible_pages;
dirty = true;
}
+ if (pgcform->relallfrozen != (int32) num_all_frozen_pages)
+ {
+ pgcform->relallfrozen = (int32) num_all_frozen_pages;
+ dirty = true;
+ }
/* Apply DDL updates, but not inside an outer transaction (see above) */
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index a01cfb4..120de63 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -106,6 +106,8 @@ typedef struct LVRelStats
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* number of pages we examined */
BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber vmskipped_frozen_pages; /* # of pages we skipped by all-frozen bit
+ of visibility map */
double scanned_tuples; /* counts only tuples on scanned pages */
double old_rel_tuples; /* previous value of pg_class.reltuples */
double new_rel_tuples; /* new estimated total # of tuples */
@@ -156,7 +158,7 @@ static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
static int vac_cmp_itemptr(const void *left, const void *right);
static bool heap_page_is_all_visible(Relation rel, Buffer buf,
- TransactionId *visibility_cutoff_xid);
+ TransactionId *visibility_cutoff_xid, bool *all_frozen);
/*
@@ -188,7 +190,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
MultiXactId mxactFullScanLimit;
BlockNumber new_rel_pages;
double new_rel_tuples;
- BlockNumber new_rel_allvisible;
+ BlockNumber new_rel_allvisible,
+ new_rel_allfrozen;
double new_live_tuples;
TransactionId new_frozen_xid;
MultiXactId new_min_multi;
@@ -222,6 +225,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* than or equal to the requested Xid full-table scan limit; or if the
* table's minimum MultiXactId is older than or equal to the requested
* mxid full-table scan limit.
+ * Even if scan_all is set so far, we could skip to scan some pages
+ * according by frozen map.
*/
scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
xidFullScanLimit);
@@ -253,7 +258,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
- if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages + vacrelstats->vmskipped_frozen_pages)
+ < vacrelstats->rel_pages)
{
Assert(!scan_all);
scanned_all = false;
@@ -301,10 +307,14 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_tuples = vacrelstats->old_rel_tuples;
}
- new_rel_allvisible = visibilitymap_count(onerel);
+ new_rel_allvisible = visibilitymap_count(onerel, VISIBILITYMAP_ALL_VISIBLE);
if (new_rel_allvisible > new_rel_pages)
new_rel_allvisible = new_rel_pages;
+ new_rel_allfrozen = visibilitymap_count(onerel, VISIBILITYMAP_ALL_FROZEN);
+ if (new_rel_allfrozen > new_rel_pages)
+ new_rel_allfrozen = new_rel_pages;
+
new_frozen_xid = scanned_all ? FreezeLimit : InvalidTransactionId;
new_min_multi = scanned_all ? MultiXactCutoff : InvalidMultiXactId;
@@ -312,6 +322,7 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_pages,
new_rel_tuples,
new_rel_allvisible,
+ new_rel_allfrozen,
vacrelstats->hasindex,
new_frozen_xid,
new_min_multi,
@@ -360,10 +371,11 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel),
vacrelstats->num_index_scans);
- appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins\n"),
+ appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped according to vm\n"),
vacrelstats->pages_removed,
vacrelstats->rel_pages,
- vacrelstats->pinskipped_pages);
+ vacrelstats->pinskipped_pages,
+ vacrelstats->vmskipped_frozen_pages);
appendStringInfo(&buf,
_("tuples: %.0f removed, %.0f remain, %.0f are dead but not yet removable\n"),
vacrelstats->tuples_deleted,
@@ -486,9 +498,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* consecutive pages. Since we're reading sequentially, the OS should be
* doing readahead for us, so there's no gain in skipping a page now and
* then; that's likely to disable readahead and so be counterproductive.
- * Also, skipping even a single page means that we can't update
- * relfrozenxid, so we only want to do it if we can skip a goodly number
- * of pages.
+ * Also, skipping even a single page accorinding to all-visible bit of
+ * visibility map means that we can't update relfrozenxid, so we only want
+ * to do it if we can skip a goodly number. On the other hand, we count
+ * both how many pages we skipped according to all-frozen bit of visibility
+ * map and how many pages we freeze page, so we can update relfrozenxid if
+ * the sum of them is as many as pages of table.
*
* Before entering the main loop, establish the invariant that
* next_not_all_visible_block is the next block number >= blkno that's not
@@ -515,7 +530,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer))
+ if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer,
+ VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN))
break;
vacuum_delay_point();
}
@@ -533,7 +549,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
bool tupgone,
hastup;
int prev_dead_count;
- int nfrozen;
+ int nfrozen; /* # of tuples is frozen */
+ int nalready_frozen; /* # of tuples is already frozen */
+ int ntotal_frozen; /* # of tuples is in single page */
+ int ntup_per_page;
Size freespace;
bool all_visible_according_to_vm;
bool all_visible;
@@ -548,7 +567,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block++)
{
if (!visibilitymap_test(onerel, next_not_all_visible_block,
- &vmbuffer))
+ &vmbuffer,
+ VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN))
break;
vacuum_delay_point();
}
@@ -566,9 +586,25 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
}
else
{
- /* Current block is all-visible */
- if (skipping_all_visible_blocks && !scan_all)
- continue;
+ /*
+ * This block is at least all-visible according to visibility map.
+ * We check whehter this block is all-frozen to skip to vacuum this
+ * page even if scanning whole page is required.
+ */
+ if (scan_all)
+ {
+ if (visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_FROZEN))
+ {
+ vacrelstats->vmskipped_frozen_pages++;
+ continue;
+ }
+ }
+ else
+ {
+ if (skipping_all_visible_blocks)
+ continue;
+ }
+
all_visible_according_to_vm = true;
}
@@ -740,7 +776,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
PageSetAllVisible(page);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, InvalidTransactionId);
+ vmbuffer, InvalidTransactionId,
+ VISIBILITYMAP_ALL_VISIBLE);
END_CRIT_SECTION();
}
@@ -764,6 +801,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
all_visible = true;
has_dead_tuples = false;
nfrozen = 0;
+ nalready_frozen = 0;
+ ntup_per_page = 0;
hastup = false;
prev_dead_count = vacrelstats->num_dead_tuples;
maxoff = PageGetMaxOffsetNumber(page);
@@ -918,8 +957,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
{
num_tuples += 1;
+ ntup_per_page += 1;;
hastup = true;
+ /* Check whether this tuple is alrady frozen or not */
+ if (HeapTupleHeaderXminFrozen(tuple.t_data))
+ nalready_frozen += 1;
+
/*
* Each non-removable tuple must be checked to see if it needs
* freezing. Note we already have exclusive buffer lock.
@@ -931,9 +975,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
} /* scan along page */
/*
- * If we froze any tuples, mark the buffer dirty, and write a WAL
- * record recording the changes. We must log the changes to be
- * crash-safe against future truncation of CLOG.
+ * If we froze any tuples or any tuples are already frozen,
+ * mark the buffer dirty, and write a WAL record recording the changes.
+ * We must log the changes to be crash-safe against future truncation
+ * of CLOG.
*/
if (nfrozen > 0)
{
@@ -966,6 +1011,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
END_CRIT_SECTION();
}
+ /* Compute the number of frozen tuples in a page */
+ ntotal_frozen = nfrozen + nalready_frozen;
+
/*
* If there are no indexes then we can vacuum the page right now
* instead of doing a second scan.
@@ -988,26 +1036,47 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
freespace = PageGetHeapFreeSpace(page);
- /* mark page all-visible, if appropriate */
- if (all_visible && !all_visible_according_to_vm)
+ /* This page is all visible */
+ if (all_visible)
{
- /*
- * It should never be the case that the visibility map page is set
- * while the page-level bit is clear, but the reverse is allowed
- * (if checksums are not enabled). Regardless, set the both bits
- * so that we get back in sync.
- *
- * NB: If the heap page is all-visible but the VM bit is not set,
- * we don't need to dirty the heap page. However, if checksums
- * are enabled, we do need to make sure that the heap page is
- * dirtied before passing it to visibilitymap_set(), because it
- * may be logged. Given that this situation should only happen in
- * rare cases after a crash, it is not worth optimizing.
- */
- PageSetAllVisible(page);
- MarkBufferDirty(buf);
- visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ /* mark page all-visible, if appropriate */
+ if (!all_visible_according_to_vm)
+ {
+ /*
+ * It should never be the case that the visibility map page is set
+ * while the page-level bit is clear, but the reverse is allowed
+ * (if checksums are not enabled). Regardless, set the both bits
+ * so that we get back in sync.
+ *
+ * NB: If the heap page is all-visible but the VM bit is not set,
+ * we don't need to dirty the heap page. However, if checksums
+ * are enabled, we do need to make sure that the heap page is
+ * dirtied before passing it to visibilitymap_set(), because it
+ * may be logged. Given that this situation should only happen in
+ * rare cases after a crash, it is not worth optimizing.
+ */
+ PageSetAllVisible(page);
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+ }
+
+ /* mark page all-frozen, if all tuples are frozen in total */
+ if ((ntotal_frozen == ntup_per_page) &&
+ !visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_FROZEN))
+ {
+ Assert(PageIsAllVisible(page));
+
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ {
+ MarkBufferDirty(buf);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+ vmbuffer, visibility_cutoff_xid, flags);
+ }
}
/*
@@ -1018,7 +1087,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* that something bad has happened.
*/
else if (all_visible_according_to_vm && !PageIsAllVisible(page)
- && visibilitymap_test(onerel, blkno, &vmbuffer))
+ && visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
relname, blkno);
@@ -1047,6 +1116,17 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
visibilitymap_clear(onerel, blkno, vmbuffer);
}
+ /*
+ * As a result of scanning a page, we set VM all-frozen bit and page header
+ * if all tuples of single page are frozen.
+ */
+ if (ntotal_frozen == ntup_per_page)
+ {
+ PageSetAllFrozen(page);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr, vmbuffer,
+ InvalidTransactionId, VISIBILITYMAP_ALL_FROZEN);
+ }
+
UnlockReleaseBuffer(buf);
/* Remember the location of the last page with nonremovable tuples */
@@ -1078,7 +1158,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
num_tuples);
/*
- * Release any remaining pin on visibility map page.
+ * Release any remaining pin on visibility map and frozen map page.
*/
if (BufferIsValid(vmbuffer))
{
@@ -1115,6 +1195,14 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
tups_vacuumed, vacuumed_pages)));
/*
+ * This information would be effective for how much effect all-frozen bit
+ * of VM had for freezing tuples.
+ */
+ ereport(elevel,
+ (errmsg("Skipped %d frozen pages acoording to visibility map",
+ vacrelstats->vmskipped_frozen_pages)));
+
+ /*
* This is pretty messy, but we split it up so that we can skip emitting
* individual parts of the message when not applicable.
*/
@@ -1226,6 +1314,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
OffsetNumber unused[MaxOffsetNumber];
int uncnt = 0;
TransactionId visibility_cutoff_xid;
+ bool all_frozen;
START_CRIT_SECTION();
@@ -1277,19 +1366,31 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
* dirty, exclusively locked, and, if needed, a full page image has been
* emitted in the log_heap_clean() above.
*/
- if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid))
+ if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid, &all_frozen))
PageSetAllVisible(page);
/*
* All the changes to the heap page have been done. If the all-visible
- * flag is now set, also set the VM bit.
+ * flag is now set, also set the VM all-visible bit.
+ * Also, if this page is all-frozen, set VM all-frozen bit and flag.
*/
- if (PageIsAllVisible(page) &&
- !visibilitymap_test(onerel, blkno, vmbuffer))
+ if (PageIsAllVisible(page))
{
- Assert(BufferIsValid(*vmbuffer));
- visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
- visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ if (!visibilitymap_test(onerel, blkno, vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+
+ /* mark page all-frozen, and set VM all-frozen bit */
+ if (all_frozen)
+ {
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
+ visibility_cutoff_xid, flags);
}
return tupindex;
@@ -1408,6 +1509,7 @@ lazy_cleanup_index(Relation indrel,
stats->num_pages,
stats->num_index_tuples,
0,
+ 0,
false,
InvalidTransactionId,
InvalidMultiXactId,
@@ -1782,7 +1884,8 @@ vac_cmp_itemptr(const void *left, const void *right)
* xmin amongst the visible tuples.
*/
static bool
-heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid)
+heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid,
+ bool *all_frozen)
{
Page page = BufferGetPage(buf);
BlockNumber blockno = BufferGetBlockNumber(buf);
@@ -1791,6 +1894,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
bool all_visible = true;
*visibility_cutoff_xid = InvalidTransactionId;
+ *all_frozen = true;
/*
* This is a stripped down version of the line pointer scan in
@@ -1814,7 +1918,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/*
* Dead line pointers can have index pointers pointing to them. So
- * they can't be treated as visible
+ * they can't be treated as visible and frozen.
*/
if (ItemIdIsDead(itemid))
{
@@ -1855,6 +1959,10 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
+
+ /* Check whether this tuple is alrady frozen or not */
+ if (!HeapTupleHeaderXminFrozen(tuple.t_data))
+ *all_frozen = false;
}
break;
@@ -1863,6 +1971,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
case HEAPTUPLE_INSERT_IN_PROGRESS:
case HEAPTUPLE_DELETE_IN_PROGRESS:
all_visible = false;
+ *all_frozen = false;
break;
default:
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 9f54c46..08df289 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -116,7 +116,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
*/
if (!visibilitymap_test(scandesc->heapRelation,
ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
+ &node->ioss_VMBuffer, VISIBILITYMAP_ALL_VISIBLE))
{
/*
* Rats, we have to visit the heap to check visibility.
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 1ef76d0..ee49ddf 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -127,7 +127,7 @@ ExecCheckPlanOutput(Relation resultRel, List *targetList)
if (attno != resultDesc->natts)
ereport(ERROR,
(errcode(ERRCODE_DATATYPE_MISMATCH),
- errmsg("table row type and query-specified row type do not match"),
+ errmsg("table row type and query-specified row type do not match"),
errdetail("Query has too few columns.")));
}
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 79d9390..8fededc 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -10,6 +10,7 @@
#include "postgres_fe.h"
#include "pg_upgrade.h"
+#include "storage/bufpage.h"
#include <fcntl.h>
@@ -21,6 +22,27 @@ static int copy_file(const char *fromfile, const char *tofile, bool force);
static int win32_pghardlink(const char *src, const char *dst);
#endif
+static int rewrite_vm_to_vfm(const char *fromfile, const char *tofile, bool force);
+
+/* table for fast rewriting vm file to vfm file */
+static const uint16 rewrite_vm_to_vfm_table[256] = {
+ 0, 1, 4, 5, 16, 17, 20, 21, 64, 65, 68, 69, 80, 81, 84, 85,
+ 256, 257, 260, 261, 272, 273, 276, 277, 320, 321, 324, 325, 336, 337, 340, 341,
+ 1024, 1025, 1028, 1029, 1040, 1041, 1044, 1045, 1088, 1089, 1092, 1093, 1104, 1105, 1108, 1109,
+ 1280, 1281, 1284, 1285, 1296, 1297, 1300, 1301, 1344, 1345, 1348, 1349, 1360, 1361, 1364, 1365,
+ 4096, 4097, 4100, 4101, 4112, 4113, 4116, 4117, 4160, 4161, 4164, 4165, 4176, 4177, 4180, 4181,
+ 4352, 4353, 4356, 4357, 4368, 4369, 4372, 4373, 4416, 4417, 4420, 4421, 4432, 4433, 4436, 4437,
+ 5120, 5121, 5124, 5125, 5136, 5137, 5140, 5141, 5184, 5185, 5188, 5189, 5200, 5201, 5204, 5205,
+ 5376, 5377, 5380, 5381, 5392, 5393, 5396, 5397, 5440, 5441, 5444, 5445, 5456, 5457, 5460, 5461,
+ 16384, 16385, 16388, 16389, 16400, 16401, 16404, 16405, 16448, 16449, 16452, 16453, 16464, 16465, 16468, 16469,
+ 16640, 16641, 16644, 16645, 16656, 16657, 16660, 16661, 16704, 16705, 16708, 16709, 16720, 16721, 16724, 16725,
+ 17408, 17409, 17412, 17413, 17424, 17425, 17428, 17429, 17472, 17473, 17476, 17477, 17488, 17489, 17492, 17493,
+ 17664, 17665, 17668, 17669, 17680, 17681, 17684, 17685, 17728, 17729, 17732, 17733, 17744, 17745, 17748, 17749,
+ 20480, 20481, 20484, 20485, 20496, 20497, 20500, 20501, 20544, 20545, 20548, 20549, 20560, 20561, 20564, 20565,
+ 20736, 20737, 20740, 20741, 20752, 20753, 20756, 20757, 20800, 20801, 20804, 20805, 20816, 20817, 20820, 20821,
+ 21504, 21505, 21508, 21509, 21520, 21521, 21524, 21525, 21568, 21569, 21572, 21573, 21584, 21585, 21588, 21589,
+ 21760, 21761, 21764, 21765, 21776, 21777, 21780, 21781, 21824, 21825, 21828, 21829, 21840, 21841, 21844, 21845
+};
/*
* copyAndUpdateFile()
@@ -30,11 +52,19 @@ static int win32_pghardlink(const char *src, const char *dst);
*/
const char *
copyAndUpdateFile(pageCnvCtx *pageConverter,
- const char *src, const char *dst, bool force)
+ const char *src, const char *dst, bool force, bool rewrite_vm)
{
+
if (pageConverter == NULL)
{
- if (pg_copy_file(src, dst, force) == -1)
+ int ret;
+
+ if (rewrite_vm)
+ ret = rewrite_vm_to_vfm(src, dst, force);
+ else
+ ret = pg_copy_file(src, dst, force);
+
+ if (ret)
return getErrorText(errno);
else
return NULL;
@@ -99,7 +129,6 @@ copyAndUpdateFile(pageCnvCtx *pageConverter,
}
}
-
/*
* linkAndUpdateFile()
*
@@ -201,6 +230,110 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
#endif
+/*
+ * rewriteVisibiiltyMap()
+ *
+ * A additional bit which indicates that all tuples on page is completely
+ * frozen is added into visibility map at PG 9.6. So the format of visibiilty
+ * map has been changed.
+ * Copies a visibility map file while adding all-frozen bit(0) into each bit.
+ */
+static int
+rewrite_vm_to_vfm(const char *fromfile, const char *tofile, bool force)
+{
+#define REWRITE_BUF_SIZE (50 * BLCKSZ)
+#define BITS_PER_HEAPBLOCK 2
+
+ int src_fd, dst_fd;
+ uint16 vfm_bits;
+ ssize_t nbytes;
+ char *buffer;
+ int ret = 0;
+ int save_errno = 0;
+
+ if ((fromfile == NULL) || (tofile == NULL))
+ {
+ errno = EINVAL;
+ return -1;
+ }
+
+ if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
+ return -1;
+
+ if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+ {
+ save_errno = errno;
+ if (src_fd != 0)
+ close(src_fd);
+
+ errno = save_errno;
+ return -1;
+ }
+
+ buffer = (char *) pg_malloc(REWRITE_BUF_SIZE);
+
+ /* Copy page header data in advance */
+ if ((nbytes = read(src_fd, buffer, MAXALIGN(SizeOfPageHeaderData))) <= 0)
+ {
+ save_errno = errno;
+ return -1;
+ }
+
+ if (write(dst_fd, buffer, nbytes) != nbytes)
+ {
+ /* if write didn't set errno, assume problem is no disk space */
+ if (errno == 0)
+ errno = ENOSPC;
+ save_errno = errno;
+ return -1;
+ }
+
+ /* perform data rewriting i.e read src srouce, write to destination */
+ while (true)
+ {
+ ssize_t nbytes = read(src_fd, buffer, REWRITE_BUF_SIZE);
+ char *cur, *end;
+
+ if (nbytes < 0)
+ {
+ ret = -1;
+ break;
+ }
+
+ if (nbytes == 0)
+ break;
+
+ cur = buffer;
+ end = buffer + nbytes;
+
+ /*
+ * Rewrite a byte and write dest_fd per BITS_PER_HEAPBLOCK bytes.
+ */
+ while (end > cur)
+ {
+ /* Get rewritten bit from table and its string representation */
+ vfm_bits = rewrite_vm_to_vfm_table[(uint8) *cur];
+
+ if (write(dst_fd, &vfm_bits, BITS_PER_HEAPBLOCK) != BITS_PER_HEAPBLOCK)
+ {
+ ret = -1;
+ break;
+ }
+ cur++;
+ }
+ }
+
+ pg_free(buffer);
+
+ if (src_fd != 0)
+ close(src_fd);
+
+ if (dst_fd != 0)
+ close(dst_fd);
+
+ return ret;
+}
+
void
check_hard_link(void)
{
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 13aa891..090422d 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -112,6 +112,11 @@ extern char *output_files[];
#define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031
/*
+ * The format of visibility map changed with this 9.6 commit,
+ *
+ */
+#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201508181
+/*
* pg_multixact format changed in 9.3 commit 0ac5ad5134f2769ccbaefec73844f85,
* ("Improve concurrency of foreign key locking") which also updated catalog
* version to this value. pg_upgrade behavior depends on whether old and new
@@ -397,7 +402,7 @@ typedef void *pageCnvCtx;
#endif
const char *copyAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
- const char *dst, bool force);
+ const char *dst, bool force, bool rewrite_vm);
const char *linkAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
const char *dst);
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index c22df42..766a473 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -18,7 +18,7 @@
static void transfer_single_new_db(pageCnvCtx *pageConverter,
FileNameMap *maps, int size, char *old_tablespace);
static void transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *suffix);
+ const char *type_old_suffix, const char *type_new_suffix);
/*
@@ -171,6 +171,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
{
int mapnum;
bool vm_crashsafe_match = true;
+ bool vm_rewrite_needed = false;
/*
* Do the old and new cluster disagree on the crash-safetiness of the vm
@@ -180,13 +181,20 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
vm_crashsafe_match = false;
+ /*
+ * Do we need to rewrite "vm" to "vfm".
+ */
+ if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ vm_rewrite_needed = true;
+
for (mapnum = 0; mapnum < size; mapnum++)
{
if (old_tablespace == NULL ||
strcmp(maps[mapnum].old_tablespace, old_tablespace) == 0)
{
/* transfer primary file */
- transfer_relfile(pageConverter, &maps[mapnum], "");
+ transfer_relfile(pageConverter, &maps[mapnum], "", "");
/* fsm/vm files added in PG 8.4 */
if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)
@@ -194,9 +202,17 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
/*
* Copy/link any fsm and vm files, if they exist
*/
- transfer_relfile(pageConverter, &maps[mapnum], "_fsm");
+ transfer_relfile(pageConverter, &maps[mapnum], "_fsm", "_fsm");
if (vm_crashsafe_match)
- transfer_relfile(pageConverter, &maps[mapnum], "_vm");
+ {
+ /*
+ * vm file is changed to vfm file in PG 9.6.
+ */
+ if (vm_rewrite_needed)
+ transfer_relfile(pageConverter, &maps[mapnum], "_vm", "_vfm");
+ else
+ transfer_relfile(pageConverter, &maps[mapnum], "_vm", "_vm");
+ }
}
}
}
@@ -210,7 +226,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
*/
static void
transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *type_suffix)
+ const char *type_old_suffix, const char *type_new_suffix)
{
const char *msg;
char old_file[MAXPGPATH];
@@ -218,6 +234,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
int fd;
int segno;
char extent_suffix[65];
+ bool rewrite_vm = false;
/*
* Now copy/link any related segments as well. Remember, PG breaks large
@@ -236,18 +253,18 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
map->old_tablespace_suffix,
map->old_db_oid,
map->old_relfilenode,
- type_suffix,
+ type_old_suffix,
extent_suffix);
snprintf(new_file, sizeof(new_file), "%s%s/%u/%u%s%s",
map->new_tablespace,
map->new_tablespace_suffix,
map->new_db_oid,
map->new_relfilenode,
- type_suffix,
+ type_new_suffix,
extent_suffix);
/* Is it an extent, fsm, or vm file? */
- if (type_suffix[0] != '\0' || segno != 0)
+ if (type_old_suffix[0] != '\0' || segno != 0)
{
/* Did file open fail? */
if ((fd = open(old_file, O_RDONLY, 0)) == -1)
@@ -276,7 +293,11 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
+ /* We need to rewrite vm file to vfm file. */
+ if (strcmp(type_old_suffix, type_new_suffix) != 0)
+ rewrite_vm = true;
+
+ if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true, rewrite_vm)) != NULL)
pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
diff --git a/src/bin/pg_upgrade/test.sh b/src/bin/pg_upgrade/test.sh
index ec3a7ed..508757e 100644
--- a/src/bin/pg_upgrade/test.sh
+++ b/src/bin/pg_upgrade/test.sh
@@ -170,6 +170,11 @@ if "$MAKE" -C "$oldsrc" installcheck; then
mv "$temp_root"/dump1.sql "$temp_root"/dump1.sql.orig
sed "s;$oldsrc;$newsrc;g" "$temp_root"/dump1.sql.orig >"$temp_root"/dump1.sql
fi
+
+ vm_sql="SELECT c.relname, c.relallvisible FROM pg_class as c, pg_namespace as n WHERE c.relnamespace = n.oid AND n.nspname NOT IN ('information_schema', 'pg_toast', 'pg_catalog') ORDER BY c.relname;"
+ # Test for rewriting visibility map
+ vacuumdb -d regression || visibilitymap_vacuum1_status=$?
+ psql -d regression -c "$vm_sql" > "$temp_root"/vm_test1.txt || visibilitymap_test1_status=$?
else
make_installcheck_status=$?
fi
@@ -184,6 +189,14 @@ if [ -n "$pg_dumpall1_status" ]; then
echo "pg_dumpall of pre-upgrade database cluster failed"
exit 1
fi
+if [ -n "$visibilitymap_vacuum1_status" ];then
+ echo "VACUUM of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+if [ -n "$visibilitymap_test1_status" ];then
+ echo "SELECT of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
PGDATA=$BASE_PGDATA
@@ -199,6 +212,8 @@ case $testhost in
esac
pg_dumpall -f "$temp_root"/dump2.sql || pg_dumpall2_status=$?
+vacuumdb -d regression || visibilitymap_vacuum2_status=$?
+psql -d regression -c "$vm_sql" > "$temp_root"/vm_test2.txt || visibilitymap_test2_status=$?
pg_ctl -m fast stop
# no need to echo commands anymore
@@ -210,11 +225,26 @@ if [ -n "$pg_dumpall2_status" ]; then
exit 1
fi
+if [ -n "$visibilitymap_vacuum2_status" ];then
+ echo "VACUUM of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
+if [ -n "$visibilitymap_test2_status" ];then
+ echo "SELECT of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
case $testhost in
MINGW*) cmd /c delete_old_cluster.bat ;;
*) sh ./delete_old_cluster.sh ;;
esac
+if ! diff "$temp_root"/vm_test1.txt "$temp_root"/vm_test2.txt >/dev/null; then
+ echo "Visibility map rewriting test failed"
+ exit 1
+fi
+
if diff "$temp_root"/dump1.sql "$temp_root"/dump2.sql >/dev/null; then
echo PASSED
exit 0
diff --git a/src/common/relpath.c b/src/common/relpath.c
index 66dfef1..5898f1b 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -30,11 +30,14 @@
* If you add a new entry, remember to update the errhint in
* forkname_to_number() below, and update the SGML documentation for
* pg_relation_size().
+ * 9.6 or later, the visibility map fork name is changed from "vm" to
+ * "vfm" bacause visibility map has not only information about all-visible
+ * but also information about all-frozen.
*/
const char *const forkNames[] = {
"main", /* MAIN_FORKNUM */
"fsm", /* FSM_FORKNUM */
- "vm", /* VISIBILITYMAP_FORKNUM */
+ "vfm", /* VISIBILITYMAP_FORKNUM */
"init" /* INIT_FORKNUM */
};
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index caa0f14..93afb10 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -320,9 +320,10 @@ typedef struct xl_heap_freeze_page
typedef struct xl_heap_visible
{
TransactionId cutoff_xid;
+ uint8 flags;
} xl_heap_visible;
-#define SizeOfHeapVisible (offsetof(xl_heap_visible, cutoff_xid) + sizeof(TransactionId))
+#define SizeOfHeapVisible (offsetof(xl_heap_visible, flags) + sizeof(uint8))
typedef struct xl_heap_new_cid
{
@@ -389,6 +390,6 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
xl_heap_freeze_tuple *xlrec_tp);
extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
- Buffer vm_buffer, TransactionId cutoff_xid);
+ Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
#endif /* HEAPAM_XLOG_H */
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 0c0e0ef..7270609 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -19,15 +19,20 @@
#include "storage/buf.h"
#include "utils/relcache.h"
-extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk,
- Buffer vmbuf);
+/* Flags for bit map */
+#define VISIBILITYMAP_ALL_VISIBLE 0x01
+#define VISIBILITYMAP_ALL_FROZEN 0x02
+
+extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
Buffer *vmbuf);
extern bool visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid);
-extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
-extern BlockNumber visibilitymap_count(Relation rel);
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags);
+extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf,
+ uint8 flags);
+extern BlockNumber visibilitymap_count(Relation rel, uint8 flags);
extern void visibilitymap_truncate(Relation rel, BlockNumber nheapblocks);
#endif /* VISIBILITYMAP_H */
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index b58fe46..98d93c5 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 201508111
+#define CATALOG_VERSION_NO 201508181
#endif
diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
index e526cd9..ea0f7c1 100644
--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -47,6 +47,8 @@ CATALOG(pg_class,1259) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83) BKI_SCHEMA_MACRO
float4 reltuples; /* # of tuples (not always up-to-date) */
int32 relallvisible; /* # of all-visible blocks (not always
* up-to-date) */
+ int32 relallfrozen; /* # of all-frozen blocks (not always
+ up-to-date) */
Oid reltoastrelid; /* OID of toast table; 0 if none */
bool relhasindex; /* T if has (or has had) any indexes */
bool relisshared; /* T if shared across databases */
@@ -95,7 +97,7 @@ typedef FormData_pg_class *Form_pg_class;
* ----------------
*/
-#define Natts_pg_class 30
+#define Natts_pg_class 31
#define Anum_pg_class_relname 1
#define Anum_pg_class_relnamespace 2
#define Anum_pg_class_reltype 3
@@ -107,25 +109,26 @@ typedef FormData_pg_class *Form_pg_class;
#define Anum_pg_class_relpages 9
#define Anum_pg_class_reltuples 10
#define Anum_pg_class_relallvisible 11
-#define Anum_pg_class_reltoastrelid 12
-#define Anum_pg_class_relhasindex 13
-#define Anum_pg_class_relisshared 14
-#define Anum_pg_class_relpersistence 15
-#define Anum_pg_class_relkind 16
-#define Anum_pg_class_relnatts 17
-#define Anum_pg_class_relchecks 18
-#define Anum_pg_class_relhasoids 19
-#define Anum_pg_class_relhaspkey 20
-#define Anum_pg_class_relhasrules 21
-#define Anum_pg_class_relhastriggers 22
-#define Anum_pg_class_relhassubclass 23
-#define Anum_pg_class_relrowsecurity 24
-#define Anum_pg_class_relispopulated 25
-#define Anum_pg_class_relreplident 26
-#define Anum_pg_class_relfrozenxid 27
-#define Anum_pg_class_relminmxid 28
-#define Anum_pg_class_relacl 29
-#define Anum_pg_class_reloptions 30
+#define Anum_pg_class_relallfrozen 12
+#define Anum_pg_class_reltoastrelid 13
+#define Anum_pg_class_relhasindex 14
+#define Anum_pg_class_relisshared 15
+#define Anum_pg_class_relpersistence 16
+#define Anum_pg_class_relkind 17
+#define Anum_pg_class_relnatts 18
+#define Anum_pg_class_relchecks 19
+#define Anum_pg_class_relhasoids 20
+#define Anum_pg_class_relhaspkey 21
+#define Anum_pg_class_relhasrules 22
+#define Anum_pg_class_relhastriggers 23
+#define Anum_pg_class_relhassubclass 24
+#define Anum_pg_class_relrowsecurity 25
+#define Anum_pg_class_relispopulated 26
+#define Anum_pg_class_relreplident 27
+#define Anum_pg_class_relfrozenxid 28
+#define Anum_pg_class_relminmxid 29
+#define Anum_pg_class_relacl 30
+#define Anum_pg_class_reloptions 31
/* ----------------
* initial contents of pg_class
@@ -140,13 +143,13 @@ typedef FormData_pg_class *Form_pg_class;
* Note: "3" in the relfrozenxid column stands for FirstNormalTransactionId;
* similarly, "1" in relminmxid stands for FirstMultiXactId
*/
-DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 f f p r 28 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 0 f f p r 28 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 0 f f p r 31 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index e3a31af..d2bae2d 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -172,6 +172,7 @@ extern void vac_update_relstats(Relation relation,
BlockNumber num_pages,
double num_tuples,
BlockNumber num_all_visible_pages,
+ BlockNumber num_all_frozen_pages,
bool hasindex,
TransactionId frozenxid,
MultiXactId minmulti,
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index a2f78ee..7bf2718 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -178,8 +178,10 @@ typedef PageHeaderData *PageHeader;
* tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
+#define PD_ALL_FROZEN 0x0008 /* all tuples on page are completely
+ frozen */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x000F /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -369,6 +371,13 @@ typedef PageHeaderData *PageHeader;
#define PageClearAllVisible(page) \
(((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
+#define PageIsAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags & PD_ALL_FROZEN)
+#define PageSetAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags |= PD_ALL_FROZEN)
+#define PageClearAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags &= ~PD_ALL_FROZEN)
+
#define PageIsPrunable(page, oldestxmin) \
( \
AssertMacro(TransactionIdIsNormal(oldestxmin)), \
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 6fc5d1e..a5ff786 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -108,5 +108,8 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare without_oid c
# event triggers cannot run concurrently with any test that runs DDL
test: event_trigger
+# visibility map and vacuum test cannot run concurrently with any test that runs SQL
+test: visibilitymap
+
# run stats by itself because its delay may be insufficient under heavy load
test: stats
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 2ae51cf..d386d67 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -158,3 +158,4 @@ test: with
test: xml
test: event_trigger
test: stats
+test: visibilitymap
\ No newline at end of file
On Thu, Aug 27, 2015 at 1:54 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Thu, Aug 20, 2015 at 11:46 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:Jim Nasby wrote:
I think things like pageinspect are very different; I really can't see any
use for those beyond debugging (and debugging by an expert at that).I don't think that necessarily means it must continue to be in contrib.
Quite the contrary, I think it is a tool critical enough that it should
not be relegated to be a second-class citizen as it is now (let's face
it, being in contrib *is* second-class citizenship).Attached patch is latest patch.
The previous patch lacks some files for regression test.
Attached fixed v12 patch.
Regards,
--
Masahiko Sawada
Attachments:
000_add_frozen_bit_into_visibilitymap_v12.patchapplication/octet-stream; name=000_add_frozen_bit_into_visibilitymap_v12.patchDownload
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 22c5f7a..b1b6a06 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -87,7 +87,7 @@ statapprox_heap(Relation rel, output_type *stat)
* If the page has only visible tuples, then we can find out the free
* space from the FSM and move on.
*/
- if (visibilitymap_test(rel, blkno, &vmbuffer))
+ if (visibilitymap_test(rel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
freespace = GetRecordedFreeSpace(rel, blkno);
stat->tuple_len += BLCKSZ - freespace;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 3701d8e..dabd632 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2176,8 +2176,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
/*
- * Find buffer to insert this tuple into. If the page is all visible,
- * this will also pin the requisite visibility map page.
+ * Find buffer to insert this tuple into. If the page is all visible
+ * or all frozen, this will also pin the requisite visibility map and
+ * frozen map page.
*/
buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
InvalidBuffer, options, bistate,
@@ -2192,7 +2193,11 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
if (PageIsAllVisible(BufferGetPage(buffer)))
{
all_visible_cleared = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(BufferGetPage(buffer));
+ PageClearAllFrozen(BufferGetPage(buffer));
+
visibilitymap_clear(relation,
ItemPointerGetBlockNumber(&(heaptup->t_self)),
vmbuffer);
@@ -2493,7 +2498,11 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
if (PageIsAllVisible(page))
{
all_visible_cleared = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+
visibilitymap_clear(relation,
BufferGetBlockNumber(buffer),
vmbuffer);
@@ -2776,9 +2785,9 @@ heap_delete(Relation relation, ItemPointer tid,
/*
* If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, we'll have to unlock and
- * re-lock, to avoid holding the buffer lock across an I/O. That's a bit
- * unfortunate, but hopefully shouldn't happen often.
+ * visible or all frozen while we were busy locking the buffer, we'll
+ * have to unlock and re-lock, to avoid holding the buffer lock across an
+ * I/O. That's a bit unfortunate, but hopefully shouldn't happen often.
*/
if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
{
@@ -2972,10 +2981,15 @@ l1:
*/
PageSetPrunable(page, xid);
+ /* clear PD_ALL_VISIBLE and PD_ALL_FORZEN flags */
if (PageIsAllVisible(page))
{
all_visible_cleared = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+
visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
vmbuffer);
}
@@ -3254,7 +3268,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
* in the middle of changing this, so we'll need to recheck after we have
* the lock.
*/
- if (PageIsAllVisible(page))
+ if (PageIsAllVisible(page) || PageIsAllFrozen(page))
visibilitymap_pin(relation, block, &vmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
@@ -3850,14 +3864,22 @@ l2:
if (PageIsAllVisible(BufferGetPage(buffer)))
{
all_visible_cleared = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(BufferGetPage(buffer));
+ PageClearAllFrozen(BufferGetPage(buffer));
+
visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
vmbuffer);
}
if (newbuf != buffer && PageIsAllVisible(BufferGetPage(newbuf)))
{
all_visible_cleared_new = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(BufferGetPage(newbuf));
+ PageClearAllFrozen(BufferGetPage(newbuf));
+
visibilitymap_clear(relation, BufferGetBlockNumber(newbuf),
vmbuffer_new);
}
@@ -6942,7 +6964,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
*/
XLogRecPtr
log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
- TransactionId cutoff_xid)
+ TransactionId cutoff_xid, uint8 vmflags)
{
xl_heap_visible xlrec;
XLogRecPtr recptr;
@@ -6952,6 +6974,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
Assert(BufferIsValid(vm_buffer));
xlrec.cutoff_xid = cutoff_xid;
+ xlrec.flags = vmflags;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
@@ -7541,8 +7564,14 @@ heap_xlog_visible(XLogReaderState *record)
* the subsequent update won't be replayed to clear the flag.
*/
page = BufferGetPage(buffer);
- PageSetAllVisible(page);
+
+ if (xlrec->flags & VISIBILITYMAP_ALL_VISIBLE)
+ PageSetAllVisible(page);
+ if (xlrec->flags & VISIBILITYMAP_ALL_FROZEN)
+ PageSetAllFrozen(page);
+
MarkBufferDirty(buffer);
+
}
else if (action == BLK_RESTORED)
{
@@ -7593,7 +7622,7 @@ heap_xlog_visible(XLogReaderState *record)
*/
if (lsn > PageGetLSN(vmpage))
visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
- xlrec->cutoff_xid);
+ xlrec->cutoff_xid, xlrec->flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
@@ -7743,7 +7772,10 @@ heap_xlog_delete(XLogReaderState *record)
PageSetPrunable(page, XLogRecGetXid(record));
if (xlrec->flags & XLH_DELETE_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
/* Make sure there is no forward chain link in t_ctid */
htup->t_ctid = target_tid;
@@ -7847,7 +7879,10 @@ heap_xlog_insert(XLogReaderState *record)
PageSetLSN(page, lsn);
if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
MarkBufferDirty(buffer);
}
@@ -7986,7 +8021,10 @@ heap_xlog_multi_insert(XLogReaderState *record)
PageSetLSN(page, lsn);
if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
MarkBufferDirty(buffer);
}
@@ -8114,7 +8152,10 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
PageSetPrunable(page, XLogRecGetXid(record));
if (xlrec->flags & XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
PageSetLSN(page, lsn);
MarkBufferDirty(obuffer);
@@ -8249,7 +8290,10 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
elog(PANIC, "heap_update_redo: failed to add tuple");
if (xlrec->flags & XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 7c38772..a284b85 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -21,33 +21,45 @@
*
* NOTES
*
- * The visibility map is a bitmap with one bit per heap page. A set bit means
- * that all tuples on the page are known visible to all transactions, and
- * therefore the page doesn't need to be vacuumed. The map is conservative in
- * the sense that we make sure that whenever a bit is set, we know the
- * condition is true, but if a bit is not set, it might or might not be true.
+ * The visibility map is a bitmap with two bits (all-visible and all-frozen)
+ * per heap page. A set all-visible bit means that all tuples on the page are
+ * known visible to all transactions, and therefore the page doesn't need to
+ * be vacuumed. A set all-frozen bit means that all tuples on the page are
+ * completely frozen, and therefore the page doesn't need to be vacuumed even
+ * if whole table scanning vacuum is required (e.g. anti-wraparound vacuum).
+ * A all-frozen bit must be set only when the page is already all-visible.
+ * That is, all-frozen bit is always set with all-visible bit.
+ *
+ * The map is conservative in the sense that we make sure that whenever a bit
+ * is set, we know the condition is true, but if a bit is not set, it might or
+ * might not be true.
*
* Clearing a visibility map bit is not separately WAL-logged. The callers
* must make sure that whenever a bit is cleared, the bit is cleared on WAL
- * replay of the updating operation as well.
+ * replay of the updating operation as well. And all-frozen bit must be
+ * cleared with all-visible at the same time.
*
* When we *set* a visibility map during VACUUM, we must write WAL. This may
* seem counterintuitive, since the bit is basically a hint: if it is clear,
- * it may still be the case that every tuple on the page is visible to all
- * transactions; we just don't know that for certain. The difficulty is that
- * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
- * on the page itself, and the visibility map bit. If a crash occurs after the
- * visibility map page makes it to disk and before the updated heap page makes
- * it to disk, redo must set the bit on the heap page. Otherwise, the next
- * insert, update, or delete on the heap page will fail to realize that the
- * visibility map bit must be cleared, possibly causing index-only scans to
- * return wrong answers.
+ * it may still be the case that every tuple on the page is visible or frozen
+ * to all transactions; we just don't know that for certain. The difficulty is
+ * that there are two bits which are typically set together: the PD_ALL_VISIBLE
+ * or PD_ALL_FROZEN bit on the page itself, and the visibility map bit. If a
+ * crash occurs after the visibility map page makes it to disk and before the
+ * updated heap page makes it to disk, redo must set the bit on the heap page.
+ * Otherwise, the next insert, update, or delete on the heap page will fail to
+ * realize that the visibility map bit must be cleared, possibly causing index-only
+ * scans to return wrong answers.
*
* VACUUM will normally skip pages for which the visibility map bit is set;
* such pages can't contain any dead tuples and therefore don't need vacuuming.
- * The visibility map is not used for anti-wraparound vacuums, because
+ * The visibility map is not used for anti-wraparound vacuums before 9.5, because
* an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
* present in the table, even on pages that don't have any dead tuples.
+ * 9.6 or later, the visibility map has a additional bit which indicates all tuple
+ * on single page has been completely forzen, so the visibility map is also used for
+ * anti-wraparound vacuums.
+ *
*
* LOCKING
*
@@ -58,14 +70,14 @@
* section that logs the page modification. However, we don't want to hold
* the buffer lock over any I/O that may be required to read in the visibility
* map page. To avoid this, we examine the heap page before locking it;
- * if the page-level PD_ALL_VISIBLE bit is set, we pin the visibility map
- * bit. Then, we lock the buffer. But this creates a race condition: there
- * is a possibility that in the time it takes to lock the buffer, the
- * PD_ALL_VISIBLE bit gets set. If that happens, we have to unlock the
- * buffer, pin the visibility map page, and relock the buffer. This shouldn't
- * happen often, because only VACUUM currently sets visibility map bits,
- * and the race will only occur if VACUUM processes a given page at almost
- * exactly the same time that someone tries to further modify it.
+ * if the page-level PD_ALL_VISIBLE or PD_ALL_FROZEN bit is set, we pin the
+ * visibility map bit. Then, we lock the buffer. But this creates a race
+ * condition: there is a possibility that in the time it takes to lock the
+ * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that happens,
+ * we have to unlock the buffer, pin the visibility map page, and relock the
+ * buffer. This shouldn't happen often, because only VACUUM currently sets
+ * visibility map bits, and the race will only occur if VACUUM processes a given
+ * page at almost exactly the same time that someone tries to further modify it.
*
* To set a bit, you need to hold a lock on the heap page. That prevents
* the race condition where VACUUM sees that all tuples on the page are
@@ -101,11 +113,14 @@
*/
#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
-/* Number of bits allocated for each heap block. */
-#define BITS_PER_HEAPBLOCK 1
+/*
+ * Number of bits allocated for each heap block.
+ * One for all-visible, other for all-frozen.
+*/
+#define BITS_PER_HEAPBLOCK 2
/* Number of heap blocks we can represent in one byte. */
-#define HEAPBLOCKS_PER_BYTE 8
+#define HEAPBLOCKS_PER_BYTE 4
/* Number of heap blocks we can represent in one visibility map page. */
#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
@@ -115,24 +130,42 @@
#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
#define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
-/* table for fast counting of set bits */
-static const uint8 number_of_ones[256] = {
- 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
+/* tables for fast counting of set bits for visible and freeze */
+static const uint8 number_of_ones_for_visible[256] = {
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4
+};
+static const uint8 number_of_ones_for_frozen[256] = {
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4
};
/* prototypes for internal routines */
@@ -141,7 +174,7 @@ static void vm_extend(Relation rel, BlockNumber nvmblocks);
/*
- * visibilitymap_clear - clear a bit in visibility map
+ * visibilitymap_clear - clear all bits in visibility map
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -153,7 +186,8 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
int mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- uint8 mask = 1 << mapBit;
+ uint8 mask = (VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN) <<
+ (BITS_PER_HEAPBLOCK * mapBit);
char *map;
#ifdef TRACE_VISIBILITYMAP
@@ -225,7 +259,7 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
}
/*
- * visibilitymap_set - set a bit on a previously pinned page
+ * visibilitymap_set - set bit(s) on a previously pinned page
*
* recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
* or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
@@ -234,10 +268,11 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
* marked all-visible; it is needed for Hot Standby, and can be
* InvalidTransactionId if the page contains no tuples.
*
- * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
- * this function. Except in recovery, caller should also pass the heap
- * buffer. When checksums are enabled and we're not in recovery, we must add
- * the heap buffer to the WAL chain to protect it from being torn.
+ * Caller is expected to set the heap page's PD_ALL_VISIBLE or PD_ALL_FROZEN
+ * bit before calling this function. Except in recovery, caller should also
+ * pass the heap buffer and flags which indicates what flag we want to set.
+ * When checksums are enabled and we're not in recovery, we must add the heap
+ * buffer to the WAL chain to protect it from being torn.
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -245,7 +280,8 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
*/
void
visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid)
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -254,7 +290,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_set %s %d %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
@@ -272,11 +308,11 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map = PageGetContents(page);
LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
- if (!(map[mapByte] & (1 << mapBit)))
+ if (flags != (map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))))
{
START_CRIT_SECTION();
- map[mapByte] |= (1 << mapBit);
+ map[mapByte] |= (flags << (BITS_PER_HEAPBLOCK * mapBit));
MarkBufferDirty(vmBuf);
if (RelationNeedsWAL(rel))
@@ -285,7 +321,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Assert(!InRecovery);
recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
- cutoff_xid);
+ cutoff_xid, flags);
/*
* If data checksums are enabled (or wal_log_hints=on), we
@@ -295,11 +331,15 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Page heapPage = BufferGetPage(heapBuf);
- /* caller is expected to set PD_ALL_VISIBLE first */
- Assert(PageIsAllVisible(heapPage));
+ /*
+ * caller is expected to set PD_ALL_VISIBLE or
+ * PD_ALL_FROZEN first.
+ */
+ Assert(PageIsAllVisible(heapPage) || PageIsAllFrozen(heapPage));
PageSetLSN(heapPage, recptr);
}
}
+
PageSetLSN(page, recptr);
}
@@ -310,15 +350,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
}
/*
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_test - test if bit(s) is set
*
- * Are all tuples on heapBlk visible to all, according to the visibility map?
+ * Are all tuples on heapBlk visible or frozen to all, according to the visibility map?
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by an
* earlier call to visibilitymap_pin or visibilitymap_test on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk, or InvalidBuffer. The caller is responsible for
- * releasing *buf after it's done testing and setting bits.
+ * releasing *buf after it's done testing and setting bits, and must set flags
+ * which indicates what flag we want to test.
*
* NOTE: This function is typically called without a lock on the heap page,
* so somebody else could change the bit just after we look at it. In fact,
@@ -328,7 +369,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
* all concurrency issues!
*/
bool
-visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
+visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf, uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -337,7 +378,7 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_test %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_test %s %d %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
/* Reuse the old pinned buffer if possible */
@@ -360,11 +401,12 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
map = PageGetContents(BufferGetPage(*buf));
/*
- * A single-bit read is atomic. There could be memory-ordering effects
+ * A single or double bit read is atomic. There could be memory-ordering effects
* here, but for performance reasons we make it the caller's job to worry
* about that.
*/
- result = (map[mapByte] & (1 << mapBit)) ? true : false;
+ result = (map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))) ?
+ true : false;
return result;
}
@@ -374,10 +416,11 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
*
* Note: we ignore the possibility of race conditions when the table is being
* extended concurrently with the call. New pages added to the table aren't
- * going to be marked all-visible, so they won't affect the result.
+ * going to be marked all-visible or all-frozen, so they won't affect the result.
+ * The caller must set the flags which indicates what flag we want to count.
*/
BlockNumber
-visibilitymap_count(Relation rel)
+visibilitymap_count(Relation rel, uint8 flags)
{
BlockNumber result = 0;
BlockNumber mapBlock;
@@ -406,7 +449,10 @@ visibilitymap_count(Relation rel)
for (i = 0; i < MAPSIZE; i++)
{
- result += number_of_ones[map[i]];
+ if (flags & VISIBILITYMAP_ALL_VISIBLE)
+ result += number_of_ones_for_visible[map[i]];
+ if (flags & VISIBILITYMAP_ALL_FROZEN)
+ result += number_of_ones_for_frozen[map[i]];
}
ReleaseBuffer(mapBuffer);
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e59b163..10f8dc9 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1919,11 +1919,18 @@ index_update_stats(Relation rel,
{
BlockNumber relpages = RelationGetNumberOfBlocks(rel);
BlockNumber relallvisible;
+ BlockNumber relallfrozen;
if (rd_rel->relkind != RELKIND_INDEX)
- relallvisible = visibilitymap_count(rel);
+ {
+ relallvisible = visibilitymap_count(rel, VISIBILITYMAP_ALL_VISIBLE);
+ relallfrozen = visibilitymap_count(rel, VISIBILITYMAP_ALL_FROZEN);
+ }
else /* don't bother for indexes */
+ {
relallvisible = 0;
+ relallfrozen = 0;
+ }
if (rd_rel->relpages != (int32) relpages)
{
@@ -1940,6 +1947,11 @@ index_update_stats(Relation rel,
rd_rel->relallvisible = (int32) relallvisible;
dirty = true;
}
+ if (rd_rel->relallfrozen != (int32) relallfrozen)
+ {
+ rd_rel->relallfrozen = (int32) relallfrozen;
+ dirty = true;
+ }
}
/*
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 861048f..392c2a4 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -572,7 +572,8 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
vac_update_relstats(onerel,
relpages,
totalrows,
- visibilitymap_count(onerel),
+ visibilitymap_count(onerel, VISIBILITYMAP_ALL_VISIBLE),
+ visibilitymap_count(onerel, VISIBILITYMAP_ALL_FROZEN),
hasindex,
InvalidTransactionId,
InvalidMultiXactId,
@@ -595,6 +596,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
RelationGetNumberOfBlocks(Irel[ind]),
totalindexrows,
0,
+ 0,
false,
InvalidTransactionId,
InvalidMultiXactId,
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 7ab4874..d3725dd 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -22,6 +22,7 @@
#include "access/rewriteheap.h"
#include "access/transam.h"
#include "access/tuptoaster.h"
+#include "access/visibilitymap.h"
#include "access/xact.h"
#include "access/xlog.h"
#include "catalog/catalog.h"
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 85b0483..744bfff 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -747,6 +747,7 @@ void
vac_update_relstats(Relation relation,
BlockNumber num_pages, double num_tuples,
BlockNumber num_all_visible_pages,
+ BlockNumber num_all_frozen_pages,
bool hasindex, TransactionId frozenxid,
MultiXactId minmulti,
bool in_outer_xact)
@@ -784,6 +785,11 @@ vac_update_relstats(Relation relation,
pgcform->relallvisible = (int32) num_all_visible_pages;
dirty = true;
}
+ if (pgcform->relallfrozen != (int32) num_all_frozen_pages)
+ {
+ pgcform->relallfrozen = (int32) num_all_frozen_pages;
+ dirty = true;
+ }
/* Apply DDL updates, but not inside an outer transaction (see above) */
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index a01cfb4..120de63 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -106,6 +106,8 @@ typedef struct LVRelStats
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* number of pages we examined */
BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber vmskipped_frozen_pages; /* # of pages we skipped by all-frozen bit
+ of visibility map */
double scanned_tuples; /* counts only tuples on scanned pages */
double old_rel_tuples; /* previous value of pg_class.reltuples */
double new_rel_tuples; /* new estimated total # of tuples */
@@ -156,7 +158,7 @@ static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
static int vac_cmp_itemptr(const void *left, const void *right);
static bool heap_page_is_all_visible(Relation rel, Buffer buf,
- TransactionId *visibility_cutoff_xid);
+ TransactionId *visibility_cutoff_xid, bool *all_frozen);
/*
@@ -188,7 +190,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
MultiXactId mxactFullScanLimit;
BlockNumber new_rel_pages;
double new_rel_tuples;
- BlockNumber new_rel_allvisible;
+ BlockNumber new_rel_allvisible,
+ new_rel_allfrozen;
double new_live_tuples;
TransactionId new_frozen_xid;
MultiXactId new_min_multi;
@@ -222,6 +225,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* than or equal to the requested Xid full-table scan limit; or if the
* table's minimum MultiXactId is older than or equal to the requested
* mxid full-table scan limit.
+ * Even if scan_all is set so far, we could skip to scan some pages
+ * according by frozen map.
*/
scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
xidFullScanLimit);
@@ -253,7 +258,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
- if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages + vacrelstats->vmskipped_frozen_pages)
+ < vacrelstats->rel_pages)
{
Assert(!scan_all);
scanned_all = false;
@@ -301,10 +307,14 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_tuples = vacrelstats->old_rel_tuples;
}
- new_rel_allvisible = visibilitymap_count(onerel);
+ new_rel_allvisible = visibilitymap_count(onerel, VISIBILITYMAP_ALL_VISIBLE);
if (new_rel_allvisible > new_rel_pages)
new_rel_allvisible = new_rel_pages;
+ new_rel_allfrozen = visibilitymap_count(onerel, VISIBILITYMAP_ALL_FROZEN);
+ if (new_rel_allfrozen > new_rel_pages)
+ new_rel_allfrozen = new_rel_pages;
+
new_frozen_xid = scanned_all ? FreezeLimit : InvalidTransactionId;
new_min_multi = scanned_all ? MultiXactCutoff : InvalidMultiXactId;
@@ -312,6 +322,7 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_pages,
new_rel_tuples,
new_rel_allvisible,
+ new_rel_allfrozen,
vacrelstats->hasindex,
new_frozen_xid,
new_min_multi,
@@ -360,10 +371,11 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel),
vacrelstats->num_index_scans);
- appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins\n"),
+ appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped according to vm\n"),
vacrelstats->pages_removed,
vacrelstats->rel_pages,
- vacrelstats->pinskipped_pages);
+ vacrelstats->pinskipped_pages,
+ vacrelstats->vmskipped_frozen_pages);
appendStringInfo(&buf,
_("tuples: %.0f removed, %.0f remain, %.0f are dead but not yet removable\n"),
vacrelstats->tuples_deleted,
@@ -486,9 +498,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* consecutive pages. Since we're reading sequentially, the OS should be
* doing readahead for us, so there's no gain in skipping a page now and
* then; that's likely to disable readahead and so be counterproductive.
- * Also, skipping even a single page means that we can't update
- * relfrozenxid, so we only want to do it if we can skip a goodly number
- * of pages.
+ * Also, skipping even a single page accorinding to all-visible bit of
+ * visibility map means that we can't update relfrozenxid, so we only want
+ * to do it if we can skip a goodly number. On the other hand, we count
+ * both how many pages we skipped according to all-frozen bit of visibility
+ * map and how many pages we freeze page, so we can update relfrozenxid if
+ * the sum of them is as many as pages of table.
*
* Before entering the main loop, establish the invariant that
* next_not_all_visible_block is the next block number >= blkno that's not
@@ -515,7 +530,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer))
+ if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer,
+ VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN))
break;
vacuum_delay_point();
}
@@ -533,7 +549,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
bool tupgone,
hastup;
int prev_dead_count;
- int nfrozen;
+ int nfrozen; /* # of tuples is frozen */
+ int nalready_frozen; /* # of tuples is already frozen */
+ int ntotal_frozen; /* # of tuples is in single page */
+ int ntup_per_page;
Size freespace;
bool all_visible_according_to_vm;
bool all_visible;
@@ -548,7 +567,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block++)
{
if (!visibilitymap_test(onerel, next_not_all_visible_block,
- &vmbuffer))
+ &vmbuffer,
+ VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN))
break;
vacuum_delay_point();
}
@@ -566,9 +586,25 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
}
else
{
- /* Current block is all-visible */
- if (skipping_all_visible_blocks && !scan_all)
- continue;
+ /*
+ * This block is at least all-visible according to visibility map.
+ * We check whehter this block is all-frozen to skip to vacuum this
+ * page even if scanning whole page is required.
+ */
+ if (scan_all)
+ {
+ if (visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_FROZEN))
+ {
+ vacrelstats->vmskipped_frozen_pages++;
+ continue;
+ }
+ }
+ else
+ {
+ if (skipping_all_visible_blocks)
+ continue;
+ }
+
all_visible_according_to_vm = true;
}
@@ -740,7 +776,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
PageSetAllVisible(page);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, InvalidTransactionId);
+ vmbuffer, InvalidTransactionId,
+ VISIBILITYMAP_ALL_VISIBLE);
END_CRIT_SECTION();
}
@@ -764,6 +801,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
all_visible = true;
has_dead_tuples = false;
nfrozen = 0;
+ nalready_frozen = 0;
+ ntup_per_page = 0;
hastup = false;
prev_dead_count = vacrelstats->num_dead_tuples;
maxoff = PageGetMaxOffsetNumber(page);
@@ -918,8 +957,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
{
num_tuples += 1;
+ ntup_per_page += 1;;
hastup = true;
+ /* Check whether this tuple is alrady frozen or not */
+ if (HeapTupleHeaderXminFrozen(tuple.t_data))
+ nalready_frozen += 1;
+
/*
* Each non-removable tuple must be checked to see if it needs
* freezing. Note we already have exclusive buffer lock.
@@ -931,9 +975,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
} /* scan along page */
/*
- * If we froze any tuples, mark the buffer dirty, and write a WAL
- * record recording the changes. We must log the changes to be
- * crash-safe against future truncation of CLOG.
+ * If we froze any tuples or any tuples are already frozen,
+ * mark the buffer dirty, and write a WAL record recording the changes.
+ * We must log the changes to be crash-safe against future truncation
+ * of CLOG.
*/
if (nfrozen > 0)
{
@@ -966,6 +1011,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
END_CRIT_SECTION();
}
+ /* Compute the number of frozen tuples in a page */
+ ntotal_frozen = nfrozen + nalready_frozen;
+
/*
* If there are no indexes then we can vacuum the page right now
* instead of doing a second scan.
@@ -988,26 +1036,47 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
freespace = PageGetHeapFreeSpace(page);
- /* mark page all-visible, if appropriate */
- if (all_visible && !all_visible_according_to_vm)
+ /* This page is all visible */
+ if (all_visible)
{
- /*
- * It should never be the case that the visibility map page is set
- * while the page-level bit is clear, but the reverse is allowed
- * (if checksums are not enabled). Regardless, set the both bits
- * so that we get back in sync.
- *
- * NB: If the heap page is all-visible but the VM bit is not set,
- * we don't need to dirty the heap page. However, if checksums
- * are enabled, we do need to make sure that the heap page is
- * dirtied before passing it to visibilitymap_set(), because it
- * may be logged. Given that this situation should only happen in
- * rare cases after a crash, it is not worth optimizing.
- */
- PageSetAllVisible(page);
- MarkBufferDirty(buf);
- visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ /* mark page all-visible, if appropriate */
+ if (!all_visible_according_to_vm)
+ {
+ /*
+ * It should never be the case that the visibility map page is set
+ * while the page-level bit is clear, but the reverse is allowed
+ * (if checksums are not enabled). Regardless, set the both bits
+ * so that we get back in sync.
+ *
+ * NB: If the heap page is all-visible but the VM bit is not set,
+ * we don't need to dirty the heap page. However, if checksums
+ * are enabled, we do need to make sure that the heap page is
+ * dirtied before passing it to visibilitymap_set(), because it
+ * may be logged. Given that this situation should only happen in
+ * rare cases after a crash, it is not worth optimizing.
+ */
+ PageSetAllVisible(page);
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+ }
+
+ /* mark page all-frozen, if all tuples are frozen in total */
+ if ((ntotal_frozen == ntup_per_page) &&
+ !visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_FROZEN))
+ {
+ Assert(PageIsAllVisible(page));
+
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ {
+ MarkBufferDirty(buf);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+ vmbuffer, visibility_cutoff_xid, flags);
+ }
}
/*
@@ -1018,7 +1087,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* that something bad has happened.
*/
else if (all_visible_according_to_vm && !PageIsAllVisible(page)
- && visibilitymap_test(onerel, blkno, &vmbuffer))
+ && visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
relname, blkno);
@@ -1047,6 +1116,17 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
visibilitymap_clear(onerel, blkno, vmbuffer);
}
+ /*
+ * As a result of scanning a page, we set VM all-frozen bit and page header
+ * if all tuples of single page are frozen.
+ */
+ if (ntotal_frozen == ntup_per_page)
+ {
+ PageSetAllFrozen(page);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr, vmbuffer,
+ InvalidTransactionId, VISIBILITYMAP_ALL_FROZEN);
+ }
+
UnlockReleaseBuffer(buf);
/* Remember the location of the last page with nonremovable tuples */
@@ -1078,7 +1158,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
num_tuples);
/*
- * Release any remaining pin on visibility map page.
+ * Release any remaining pin on visibility map and frozen map page.
*/
if (BufferIsValid(vmbuffer))
{
@@ -1115,6 +1195,14 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
tups_vacuumed, vacuumed_pages)));
/*
+ * This information would be effective for how much effect all-frozen bit
+ * of VM had for freezing tuples.
+ */
+ ereport(elevel,
+ (errmsg("Skipped %d frozen pages acoording to visibility map",
+ vacrelstats->vmskipped_frozen_pages)));
+
+ /*
* This is pretty messy, but we split it up so that we can skip emitting
* individual parts of the message when not applicable.
*/
@@ -1226,6 +1314,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
OffsetNumber unused[MaxOffsetNumber];
int uncnt = 0;
TransactionId visibility_cutoff_xid;
+ bool all_frozen;
START_CRIT_SECTION();
@@ -1277,19 +1366,31 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
* dirty, exclusively locked, and, if needed, a full page image has been
* emitted in the log_heap_clean() above.
*/
- if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid))
+ if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid, &all_frozen))
PageSetAllVisible(page);
/*
* All the changes to the heap page have been done. If the all-visible
- * flag is now set, also set the VM bit.
+ * flag is now set, also set the VM all-visible bit.
+ * Also, if this page is all-frozen, set VM all-frozen bit and flag.
*/
- if (PageIsAllVisible(page) &&
- !visibilitymap_test(onerel, blkno, vmbuffer))
+ if (PageIsAllVisible(page))
{
- Assert(BufferIsValid(*vmbuffer));
- visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
- visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ if (!visibilitymap_test(onerel, blkno, vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+
+ /* mark page all-frozen, and set VM all-frozen bit */
+ if (all_frozen)
+ {
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
+ visibility_cutoff_xid, flags);
}
return tupindex;
@@ -1408,6 +1509,7 @@ lazy_cleanup_index(Relation indrel,
stats->num_pages,
stats->num_index_tuples,
0,
+ 0,
false,
InvalidTransactionId,
InvalidMultiXactId,
@@ -1782,7 +1884,8 @@ vac_cmp_itemptr(const void *left, const void *right)
* xmin amongst the visible tuples.
*/
static bool
-heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid)
+heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid,
+ bool *all_frozen)
{
Page page = BufferGetPage(buf);
BlockNumber blockno = BufferGetBlockNumber(buf);
@@ -1791,6 +1894,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
bool all_visible = true;
*visibility_cutoff_xid = InvalidTransactionId;
+ *all_frozen = true;
/*
* This is a stripped down version of the line pointer scan in
@@ -1814,7 +1918,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/*
* Dead line pointers can have index pointers pointing to them. So
- * they can't be treated as visible
+ * they can't be treated as visible and frozen.
*/
if (ItemIdIsDead(itemid))
{
@@ -1855,6 +1959,10 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
+
+ /* Check whether this tuple is alrady frozen or not */
+ if (!HeapTupleHeaderXminFrozen(tuple.t_data))
+ *all_frozen = false;
}
break;
@@ -1863,6 +1971,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
case HEAPTUPLE_INSERT_IN_PROGRESS:
case HEAPTUPLE_DELETE_IN_PROGRESS:
all_visible = false;
+ *all_frozen = false;
break;
default:
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 9f54c46..08df289 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -116,7 +116,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
*/
if (!visibilitymap_test(scandesc->heapRelation,
ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
+ &node->ioss_VMBuffer, VISIBILITYMAP_ALL_VISIBLE))
{
/*
* Rats, we have to visit the heap to check visibility.
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 1ef76d0..ee49ddf 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -127,7 +127,7 @@ ExecCheckPlanOutput(Relation resultRel, List *targetList)
if (attno != resultDesc->natts)
ereport(ERROR,
(errcode(ERRCODE_DATATYPE_MISMATCH),
- errmsg("table row type and query-specified row type do not match"),
+ errmsg("table row type and query-specified row type do not match"),
errdetail("Query has too few columns.")));
}
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 79d9390..ac3e0f9 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -10,6 +10,7 @@
#include "postgres_fe.h"
#include "pg_upgrade.h"
+#include "storage/bufpage.h"
#include <fcntl.h>
@@ -21,6 +22,27 @@ static int copy_file(const char *fromfile, const char *tofile, bool force);
static int win32_pghardlink(const char *src, const char *dst);
#endif
+static int rewrite_vm_to_vfm(const char *fromfile, const char *tofile, bool force);
+
+/* table for fast rewriting vm file to vfm file */
+static const uint16 rewrite_vm_to_vfm_table[256] = {
+ 0, 1, 4, 5, 16, 17, 20, 21, 64, 65, 68, 69, 80, 81, 84, 85,
+ 256, 257, 260, 261, 272, 273, 276, 277, 320, 321, 324, 325, 336, 337, 340, 341,
+ 1024, 1025, 1028, 1029, 1040, 1041, 1044, 1045, 1088, 1089, 1092, 1093, 1104, 1105, 1108, 1109,
+ 1280, 1281, 1284, 1285, 1296, 1297, 1300, 1301, 1344, 1345, 1348, 1349, 1360, 1361, 1364, 1365,
+ 4096, 4097, 4100, 4101, 4112, 4113, 4116, 4117, 4160, 4161, 4164, 4165, 4176, 4177, 4180, 4181,
+ 4352, 4353, 4356, 4357, 4368, 4369, 4372, 4373, 4416, 4417, 4420, 4421, 4432, 4433, 4436, 4437,
+ 5120, 5121, 5124, 5125, 5136, 5137, 5140, 5141, 5184, 5185, 5188, 5189, 5200, 5201, 5204, 5205,
+ 5376, 5377, 5380, 5381, 5392, 5393, 5396, 5397, 5440, 5441, 5444, 5445, 5456, 5457, 5460, 5461,
+ 16384, 16385, 16388, 16389, 16400, 16401, 16404, 16405, 16448, 16449, 16452, 16453, 16464, 16465, 16468, 16469,
+ 16640, 16641, 16644, 16645, 16656, 16657, 16660, 16661, 16704, 16705, 16708, 16709, 16720, 16721, 16724, 16725,
+ 17408, 17409, 17412, 17413, 17424, 17425, 17428, 17429, 17472, 17473, 17476, 17477, 17488, 17489, 17492, 17493,
+ 17664, 17665, 17668, 17669, 17680, 17681, 17684, 17685, 17728, 17729, 17732, 17733, 17744, 17745, 17748, 17749,
+ 20480, 20481, 20484, 20485, 20496, 20497, 20500, 20501, 20544, 20545, 20548, 20549, 20560, 20561, 20564, 20565,
+ 20736, 20737, 20740, 20741, 20752, 20753, 20756, 20757, 20800, 20801, 20804, 20805, 20816, 20817, 20820, 20821,
+ 21504, 21505, 21508, 21509, 21520, 21521, 21524, 21525, 21568, 21569, 21572, 21573, 21584, 21585, 21588, 21589,
+ 21760, 21761, 21764, 21765, 21776, 21777, 21780, 21781, 21824, 21825, 21828, 21829, 21840, 21841, 21844, 21845
+};
/*
* copyAndUpdateFile()
@@ -30,11 +52,19 @@ static int win32_pghardlink(const char *src, const char *dst);
*/
const char *
copyAndUpdateFile(pageCnvCtx *pageConverter,
- const char *src, const char *dst, bool force)
+ const char *src, const char *dst, bool force, bool rewrite_vm)
{
+
if (pageConverter == NULL)
{
- if (pg_copy_file(src, dst, force) == -1)
+ int ret;
+
+ if (rewrite_vm)
+ ret = rewrite_vm_to_vfm(src, dst, force);
+ else
+ ret = pg_copy_file(src, dst, force);
+
+ if (ret)
return getErrorText(errno);
else
return NULL;
@@ -99,7 +129,6 @@ copyAndUpdateFile(pageCnvCtx *pageConverter,
}
}
-
/*
* linkAndUpdateFile()
*
@@ -201,6 +230,110 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
#endif
+/*
+ * rewrite_vm_to_vfm()
+ *
+ * A additional bit which indicates that all tuples on page is completely
+ * frozen is added into visibility map at PG 9.6. So the format of visibiilty
+ * map has been changed.
+ * Copies a visibility map file while adding all-frozen bit(0) into each bit.
+ */
+static int
+rewrite_vm_to_vfm(const char *fromfile, const char *tofile, bool force)
+{
+#define REWRITE_BUF_SIZE (50 * BLCKSZ)
+#define BITS_PER_HEAPBLOCK 2
+
+ int src_fd, dst_fd;
+ uint16 vfm_bits;
+ ssize_t nbytes;
+ char *buffer;
+ int ret = 0;
+ int save_errno = 0;
+
+ if ((fromfile == NULL) || (tofile == NULL))
+ {
+ errno = EINVAL;
+ return -1;
+ }
+
+ if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
+ return -1;
+
+ if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+ {
+ save_errno = errno;
+ if (src_fd != 0)
+ close(src_fd);
+
+ errno = save_errno;
+ return -1;
+ }
+
+ buffer = (char *) pg_malloc(REWRITE_BUF_SIZE);
+
+ /* Copy page header data in advance */
+ if ((nbytes = read(src_fd, buffer, MAXALIGN(SizeOfPageHeaderData))) <= 0)
+ {
+ save_errno = errno;
+ return -1;
+ }
+
+ if (write(dst_fd, buffer, nbytes) != nbytes)
+ {
+ /* if write didn't set errno, assume problem is no disk space */
+ if (errno == 0)
+ errno = ENOSPC;
+ save_errno = errno;
+ return -1;
+ }
+
+ /* perform data rewriting i.e read src srouce, write to destination */
+ while (true)
+ {
+ ssize_t nbytes = read(src_fd, buffer, REWRITE_BUF_SIZE);
+ char *cur, *end;
+
+ if (nbytes < 0)
+ {
+ ret = -1;
+ break;
+ }
+
+ if (nbytes == 0)
+ break;
+
+ cur = buffer;
+ end = buffer + nbytes;
+
+ /*
+ * Rewrite a byte and write dest_fd per BITS_PER_HEAPBLOCK bytes.
+ */
+ while (end > cur)
+ {
+ /* Get rewritten bit from table and its string representation */
+ vfm_bits = rewrite_vm_to_vfm_table[(uint8) *cur];
+
+ if (write(dst_fd, &vfm_bits, BITS_PER_HEAPBLOCK) != BITS_PER_HEAPBLOCK)
+ {
+ ret = -1;
+ break;
+ }
+ cur++;
+ }
+ }
+
+ pg_free(buffer);
+
+ if (src_fd != 0)
+ close(src_fd);
+
+ if (dst_fd != 0)
+ close(dst_fd);
+
+ return ret;
+}
+
void
check_hard_link(void)
{
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 13aa891..ae7ca6b 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -112,6 +112,11 @@ extern char *output_files[];
#define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031
/*
+ * The format of visibility map changed with this 9.6 commit,
+ *
+ */
+#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201509041
+/*
* pg_multixact format changed in 9.3 commit 0ac5ad5134f2769ccbaefec73844f85,
* ("Improve concurrency of foreign key locking") which also updated catalog
* version to this value. pg_upgrade behavior depends on whether old and new
@@ -397,7 +402,7 @@ typedef void *pageCnvCtx;
#endif
const char *copyAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
- const char *dst, bool force);
+ const char *dst, bool force, bool rewrite_vm);
const char *linkAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
const char *dst);
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index c22df42..766a473 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -18,7 +18,7 @@
static void transfer_single_new_db(pageCnvCtx *pageConverter,
FileNameMap *maps, int size, char *old_tablespace);
static void transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *suffix);
+ const char *type_old_suffix, const char *type_new_suffix);
/*
@@ -171,6 +171,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
{
int mapnum;
bool vm_crashsafe_match = true;
+ bool vm_rewrite_needed = false;
/*
* Do the old and new cluster disagree on the crash-safetiness of the vm
@@ -180,13 +181,20 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
vm_crashsafe_match = false;
+ /*
+ * Do we need to rewrite "vm" to "vfm".
+ */
+ if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ vm_rewrite_needed = true;
+
for (mapnum = 0; mapnum < size; mapnum++)
{
if (old_tablespace == NULL ||
strcmp(maps[mapnum].old_tablespace, old_tablespace) == 0)
{
/* transfer primary file */
- transfer_relfile(pageConverter, &maps[mapnum], "");
+ transfer_relfile(pageConverter, &maps[mapnum], "", "");
/* fsm/vm files added in PG 8.4 */
if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)
@@ -194,9 +202,17 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
/*
* Copy/link any fsm and vm files, if they exist
*/
- transfer_relfile(pageConverter, &maps[mapnum], "_fsm");
+ transfer_relfile(pageConverter, &maps[mapnum], "_fsm", "_fsm");
if (vm_crashsafe_match)
- transfer_relfile(pageConverter, &maps[mapnum], "_vm");
+ {
+ /*
+ * vm file is changed to vfm file in PG 9.6.
+ */
+ if (vm_rewrite_needed)
+ transfer_relfile(pageConverter, &maps[mapnum], "_vm", "_vfm");
+ else
+ transfer_relfile(pageConverter, &maps[mapnum], "_vm", "_vm");
+ }
}
}
}
@@ -210,7 +226,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
*/
static void
transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *type_suffix)
+ const char *type_old_suffix, const char *type_new_suffix)
{
const char *msg;
char old_file[MAXPGPATH];
@@ -218,6 +234,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
int fd;
int segno;
char extent_suffix[65];
+ bool rewrite_vm = false;
/*
* Now copy/link any related segments as well. Remember, PG breaks large
@@ -236,18 +253,18 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
map->old_tablespace_suffix,
map->old_db_oid,
map->old_relfilenode,
- type_suffix,
+ type_old_suffix,
extent_suffix);
snprintf(new_file, sizeof(new_file), "%s%s/%u/%u%s%s",
map->new_tablespace,
map->new_tablespace_suffix,
map->new_db_oid,
map->new_relfilenode,
- type_suffix,
+ type_new_suffix,
extent_suffix);
/* Is it an extent, fsm, or vm file? */
- if (type_suffix[0] != '\0' || segno != 0)
+ if (type_old_suffix[0] != '\0' || segno != 0)
{
/* Did file open fail? */
if ((fd = open(old_file, O_RDONLY, 0)) == -1)
@@ -276,7 +293,11 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
+ /* We need to rewrite vm file to vfm file. */
+ if (strcmp(type_old_suffix, type_new_suffix) != 0)
+ rewrite_vm = true;
+
+ if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true, rewrite_vm)) != NULL)
pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
diff --git a/src/bin/pg_upgrade/test.sh b/src/bin/pg_upgrade/test.sh
index ec3a7ed..508757e 100644
--- a/src/bin/pg_upgrade/test.sh
+++ b/src/bin/pg_upgrade/test.sh
@@ -170,6 +170,11 @@ if "$MAKE" -C "$oldsrc" installcheck; then
mv "$temp_root"/dump1.sql "$temp_root"/dump1.sql.orig
sed "s;$oldsrc;$newsrc;g" "$temp_root"/dump1.sql.orig >"$temp_root"/dump1.sql
fi
+
+ vm_sql="SELECT c.relname, c.relallvisible FROM pg_class as c, pg_namespace as n WHERE c.relnamespace = n.oid AND n.nspname NOT IN ('information_schema', 'pg_toast', 'pg_catalog') ORDER BY c.relname;"
+ # Test for rewriting visibility map
+ vacuumdb -d regression || visibilitymap_vacuum1_status=$?
+ psql -d regression -c "$vm_sql" > "$temp_root"/vm_test1.txt || visibilitymap_test1_status=$?
else
make_installcheck_status=$?
fi
@@ -184,6 +189,14 @@ if [ -n "$pg_dumpall1_status" ]; then
echo "pg_dumpall of pre-upgrade database cluster failed"
exit 1
fi
+if [ -n "$visibilitymap_vacuum1_status" ];then
+ echo "VACUUM of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+if [ -n "$visibilitymap_test1_status" ];then
+ echo "SELECT of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
PGDATA=$BASE_PGDATA
@@ -199,6 +212,8 @@ case $testhost in
esac
pg_dumpall -f "$temp_root"/dump2.sql || pg_dumpall2_status=$?
+vacuumdb -d regression || visibilitymap_vacuum2_status=$?
+psql -d regression -c "$vm_sql" > "$temp_root"/vm_test2.txt || visibilitymap_test2_status=$?
pg_ctl -m fast stop
# no need to echo commands anymore
@@ -210,11 +225,26 @@ if [ -n "$pg_dumpall2_status" ]; then
exit 1
fi
+if [ -n "$visibilitymap_vacuum2_status" ];then
+ echo "VACUUM of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
+if [ -n "$visibilitymap_test2_status" ];then
+ echo "SELECT of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
case $testhost in
MINGW*) cmd /c delete_old_cluster.bat ;;
*) sh ./delete_old_cluster.sh ;;
esac
+if ! diff "$temp_root"/vm_test1.txt "$temp_root"/vm_test2.txt >/dev/null; then
+ echo "Visibility map rewriting test failed"
+ exit 1
+fi
+
if diff "$temp_root"/dump1.sql "$temp_root"/dump2.sql >/dev/null; then
echo PASSED
exit 0
diff --git a/src/common/relpath.c b/src/common/relpath.c
index 66dfef1..5898f1b 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -30,11 +30,14 @@
* If you add a new entry, remember to update the errhint in
* forkname_to_number() below, and update the SGML documentation for
* pg_relation_size().
+ * 9.6 or later, the visibility map fork name is changed from "vm" to
+ * "vfm" bacause visibility map has not only information about all-visible
+ * but also information about all-frozen.
*/
const char *const forkNames[] = {
"main", /* MAIN_FORKNUM */
"fsm", /* FSM_FORKNUM */
- "vm", /* VISIBILITYMAP_FORKNUM */
+ "vfm", /* VISIBILITYMAP_FORKNUM */
"init" /* INIT_FORKNUM */
};
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index caa0f14..93afb10 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -320,9 +320,10 @@ typedef struct xl_heap_freeze_page
typedef struct xl_heap_visible
{
TransactionId cutoff_xid;
+ uint8 flags;
} xl_heap_visible;
-#define SizeOfHeapVisible (offsetof(xl_heap_visible, cutoff_xid) + sizeof(TransactionId))
+#define SizeOfHeapVisible (offsetof(xl_heap_visible, flags) + sizeof(uint8))
typedef struct xl_heap_new_cid
{
@@ -389,6 +390,6 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
xl_heap_freeze_tuple *xlrec_tp);
extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
- Buffer vm_buffer, TransactionId cutoff_xid);
+ Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
#endif /* HEAPAM_XLOG_H */
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 0c0e0ef..7270609 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -19,15 +19,20 @@
#include "storage/buf.h"
#include "utils/relcache.h"
-extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk,
- Buffer vmbuf);
+/* Flags for bit map */
+#define VISIBILITYMAP_ALL_VISIBLE 0x01
+#define VISIBILITYMAP_ALL_FROZEN 0x02
+
+extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
Buffer *vmbuf);
extern bool visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid);
-extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
-extern BlockNumber visibilitymap_count(Relation rel);
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags);
+extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf,
+ uint8 flags);
+extern BlockNumber visibilitymap_count(Relation rel, uint8 flags);
extern void visibilitymap_truncate(Relation rel, BlockNumber nheapblocks);
#endif /* VISIBILITYMAP_H */
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index b58fe46..c03ebbb 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 201508111
+#define CATALOG_VERSION_NO 201509041
#endif
diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
index e526cd9..ea0f7c1 100644
--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -47,6 +47,8 @@ CATALOG(pg_class,1259) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83) BKI_SCHEMA_MACRO
float4 reltuples; /* # of tuples (not always up-to-date) */
int32 relallvisible; /* # of all-visible blocks (not always
* up-to-date) */
+ int32 relallfrozen; /* # of all-frozen blocks (not always
+ up-to-date) */
Oid reltoastrelid; /* OID of toast table; 0 if none */
bool relhasindex; /* T if has (or has had) any indexes */
bool relisshared; /* T if shared across databases */
@@ -95,7 +97,7 @@ typedef FormData_pg_class *Form_pg_class;
* ----------------
*/
-#define Natts_pg_class 30
+#define Natts_pg_class 31
#define Anum_pg_class_relname 1
#define Anum_pg_class_relnamespace 2
#define Anum_pg_class_reltype 3
@@ -107,25 +109,26 @@ typedef FormData_pg_class *Form_pg_class;
#define Anum_pg_class_relpages 9
#define Anum_pg_class_reltuples 10
#define Anum_pg_class_relallvisible 11
-#define Anum_pg_class_reltoastrelid 12
-#define Anum_pg_class_relhasindex 13
-#define Anum_pg_class_relisshared 14
-#define Anum_pg_class_relpersistence 15
-#define Anum_pg_class_relkind 16
-#define Anum_pg_class_relnatts 17
-#define Anum_pg_class_relchecks 18
-#define Anum_pg_class_relhasoids 19
-#define Anum_pg_class_relhaspkey 20
-#define Anum_pg_class_relhasrules 21
-#define Anum_pg_class_relhastriggers 22
-#define Anum_pg_class_relhassubclass 23
-#define Anum_pg_class_relrowsecurity 24
-#define Anum_pg_class_relispopulated 25
-#define Anum_pg_class_relreplident 26
-#define Anum_pg_class_relfrozenxid 27
-#define Anum_pg_class_relminmxid 28
-#define Anum_pg_class_relacl 29
-#define Anum_pg_class_reloptions 30
+#define Anum_pg_class_relallfrozen 12
+#define Anum_pg_class_reltoastrelid 13
+#define Anum_pg_class_relhasindex 14
+#define Anum_pg_class_relisshared 15
+#define Anum_pg_class_relpersistence 16
+#define Anum_pg_class_relkind 17
+#define Anum_pg_class_relnatts 18
+#define Anum_pg_class_relchecks 19
+#define Anum_pg_class_relhasoids 20
+#define Anum_pg_class_relhaspkey 21
+#define Anum_pg_class_relhasrules 22
+#define Anum_pg_class_relhastriggers 23
+#define Anum_pg_class_relhassubclass 24
+#define Anum_pg_class_relrowsecurity 25
+#define Anum_pg_class_relispopulated 26
+#define Anum_pg_class_relreplident 27
+#define Anum_pg_class_relfrozenxid 28
+#define Anum_pg_class_relminmxid 29
+#define Anum_pg_class_relacl 30
+#define Anum_pg_class_reloptions 31
/* ----------------
* initial contents of pg_class
@@ -140,13 +143,13 @@ typedef FormData_pg_class *Form_pg_class;
* Note: "3" in the relfrozenxid column stands for FirstNormalTransactionId;
* similarly, "1" in relminmxid stands for FirstMultiXactId
*/
-DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 f f p r 28 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 0 f f p r 28 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 0 f f p r 31 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index e3a31af..d2bae2d 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -172,6 +172,7 @@ extern void vac_update_relstats(Relation relation,
BlockNumber num_pages,
double num_tuples,
BlockNumber num_all_visible_pages,
+ BlockNumber num_all_frozen_pages,
bool hasindex,
TransactionId frozenxid,
MultiXactId minmulti,
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index a2f78ee..7bf2718 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -178,8 +178,10 @@ typedef PageHeaderData *PageHeader;
* tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
+#define PD_ALL_FROZEN 0x0008 /* all tuples on page are completely
+ frozen */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x000F /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -369,6 +371,13 @@ typedef PageHeaderData *PageHeader;
#define PageClearAllVisible(page) \
(((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
+#define PageIsAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags & PD_ALL_FROZEN)
+#define PageSetAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags |= PD_ALL_FROZEN)
+#define PageClearAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags &= ~PD_ALL_FROZEN)
+
#define PageIsPrunable(page, oldestxmin) \
( \
AssertMacro(TransactionIdIsNormal(oldestxmin)), \
diff --git a/src/test/regress/expected/visibilitymap.out b/src/test/regress/expected/visibilitymap.out
new file mode 100644
index 0000000..a410553
--- /dev/null
+++ b/src/test/regress/expected/visibilitymap.out
@@ -0,0 +1,29 @@
+--
+-- Visibility map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+\set VERBOSITY terse
+-- All pages are become all-visible
+VACUUM vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+ ?column?
+----------
+ t
+(1 row)
+
+-- All pages are become all-frozen
+VACUUM FREEZE vmtest;
+SELECT relallfrozen = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+ ?column?
+----------
+ t
+(1 row)
+
+-- All pages are skipped acoording to VM
+VACUUM FREEZE VERBOSE vmtest;
+INFO: vacuuming "public.vmtest"
+INFO: index "vmtest_pkey" now contains 10000 row versions in 30 pages
+INFO: Skipped 45 frozen pages acoording to visibility map
+INFO: "vmtest": found 0 removable, 0 nonremovable row versions in 0 out of 45 pages
+DROP TABLE vmtest;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 6fc5d1e..a5ff786 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -108,5 +108,8 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare without_oid c
# event triggers cannot run concurrently with any test that runs DDL
test: event_trigger
+# visibility map and vacuum test cannot run concurrently with any test that runs SQL
+test: visibilitymap
+
# run stats by itself because its delay may be insufficient under heavy load
test: stats
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 2ae51cf..d386d67 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -158,3 +158,4 @@ test: with
test: xml
test: event_trigger
test: stats
+test: visibilitymap
\ No newline at end of file
diff --git a/src/test/regress/sql/visibilitymap.sql b/src/test/regress/sql/visibilitymap.sql
new file mode 100644
index 0000000..9bf9094
--- /dev/null
+++ b/src/test/regress/sql/visibilitymap.sql
@@ -0,0 +1,20 @@
+--
+-- Visibility map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+
+\set VERBOSITY terse
+
+-- All pages are become all-visible
+VACUUM vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+
+-- All pages are become all-frozen
+VACUUM FREEZE vmtest;
+SELECT relallfrozen = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+
+-- All pages are skipped acoording to VM
+VACUUM FREEZE VERBOSE vmtest;
+
+DROP TABLE vmtest;
On Thu, Aug 20, 2015 at 10:46 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
Jim Nasby wrote:
I think things like pageinspect are very different; I really can't see any
use for those beyond debugging (and debugging by an expert at that).I don't think that necessarily means it must continue to be in contrib.
Quite the contrary, I think it is a tool critical enough that it should
not be relegated to be a second-class citizen as it is now (let's face
it, being in contrib *is* second-class citizenship).
I have resisted that principle for years and will continue to do so.
It is entirely reasonable for some DBAs to want certain functionality
(debugging tools, crypto) to not be installed on their machines.
Folding everything into core is not a good policy, IMHO.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas wrote:
On Thu, Aug 20, 2015 at 10:46 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
I don't think that necessarily means it must continue to be in contrib.
Quite the contrary, I think it is a tool critical enough that it should
not be relegated to be a second-class citizen as it is now (let's face
it, being in contrib *is* second-class citizenship).I have resisted that principle for years and will continue to do so.
It is entirely reasonable for some DBAs to want certain functionality
(debugging tools, crypto) to not be installed on their machines.
Folding everything into core is not a good policy, IMHO.
I don't understand. I'm just proposing that the source code for the
extension to live in src/extensions/, and have the shared library
installed by toplevel make install; I'm not suggesting that the
extension is installed automatically. For that, you still need a
superuser to run CREATE EXTENSION.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 3, 2015 at 2:26 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Robert Haas wrote:
On Thu, Aug 20, 2015 at 10:46 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:I don't think that necessarily means it must continue to be in contrib.
Quite the contrary, I think it is a tool critical enough that it should
not be relegated to be a second-class citizen as it is now (let's face
it, being in contrib *is* second-class citizenship).I have resisted that principle for years and will continue to do so.
It is entirely reasonable for some DBAs to want certain functionality
(debugging tools, crypto) to not be installed on their machines.
Folding everything into core is not a good policy, IMHO.I don't understand. I'm just proposing that the source code for the
extension to live in src/extensions/, and have the shared library
installed by toplevel make install; I'm not suggesting that the
extension is installed automatically. For that, you still need a
superuser to run CREATE EXTENSION.
Oh. Well, that's different. I don't particularly support that
proposal, but I'm not prepared to fight over it either.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-09-03 20:26, Alvaro Herrera wrote:
Robert Haas wrote:
On Thu, Aug 20, 2015 at 10:46 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:I don't think that necessarily means it must continue to be in contrib.
Quite the contrary, I think it is a tool critical enough that it should
not be relegated to be a second-class citizen as it is now (let's face
it, being in contrib *is* second-class citizenship).I have resisted that principle for years and will continue to do so.
It is entirely reasonable for some DBAs to want certain functionality
(debugging tools, crypto) to not be installed on their machines.
Folding everything into core is not a good policy, IMHO.I don't understand. I'm just proposing that the source code for the
extension to live in src/extensions/, and have the shared library
installed by toplevel make install; I'm not suggesting that the
extension is installed automatically. For that, you still need a
superuser to run CREATE EXTENSION.
+! for this
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 3, 2015 at 11:37:09PM +0200, Petr Jelinek wrote:
I don't understand. I'm just proposing that the source code for the
extension to live in src/extensions/, and have the shared library
installed by toplevel make install; I'm not suggesting that the
extension is installed automatically. For that, you still need a
superuser to run CREATE EXTENSION.+! for this
OK, what does "+!" mean? (I know it is probably a shift-key mistype,
but it looks interesting.)
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ Everyone has their own god. +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 09/03/2015 05:11 PM, Bruce Momjian wrote:
On Thu, Sep 3, 2015 at 11:37:09PM +0200, Petr Jelinek wrote:
I don't understand. I'm just proposing that the source code for the
extension to live in src/extensions/, and have the shared library
installed by toplevel make install; I'm not suggesting that the
extension is installed automatically. For that, you still need a
superuser to run CREATE EXTENSION.+! for this
OK, what does "+!" mean? (I know it is probably a shift-key mistype,
but it looks interesting.)
Add the next factorial value?
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: WMd8b175f744b299d118e17d24c591d440d99f32a472a80f85b7de3b312981d72604b0cc104928b1395f3fd512581d8552@asav-3.01.com
On 04/09/15 12:11, Bruce Momjian wrote:
On Thu, Sep 3, 2015 at 11:37:09PM +0200, Petr Jelinek wrote:
I don't understand. I'm just proposing that the source code for the
extension to live in src/extensions/, and have the shared library
installed by toplevel make install; I'm not suggesting that the
extension is installed automatically. For that, you still need a
superuser to run CREATE EXTENSION.+! for this
OK, what does "+!" mean? (I know it is probably a shift-key mistype,
but it looks interesting.)
It obviously signifies a Good Move that involved a check - at least,
that is what it would mean when annotating a Chess Game! :-)
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 4, 2015 at 2:23 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Thu, Aug 27, 2015 at 1:54 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Thu, Aug 20, 2015 at 11:46 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:Jim Nasby wrote:
I think things like pageinspect are very different; I really can't see any
use for those beyond debugging (and debugging by an expert at that).I don't think that necessarily means it must continue to be in contrib.
Quite the contrary, I think it is a tool critical enough that it should
not be relegated to be a second-class citizen as it is now (let's face
it, being in contrib *is* second-class citizenship).Attached patch is latest patch.
The previous patch lacks some files for regression test.
Attached fixed v12 patch.
The patch could be applied cleanly. "make check" could pass successfully.
But "make check-world -j 2" failed.
Regards,
--
Fujii Masao
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Bruce Momjian wrote:
On Thu, Sep 3, 2015 at 11:37:09PM +0200, Petr Jelinek wrote:
I don't understand. I'm just proposing that the source code for the
extension to live in src/extensions/, and have the shared library
installed by toplevel make install; I'm not suggesting that the
extension is installed automatically. For that, you still need a
superuser to run CREATE EXTENSION.+! for this
OK, what does "+!" mean? (I know it is probably a shift-key mistype,
but it looks interesting.)
I took it as an uppercase 1 myself -- a shouted "PLUS ONE".
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 4, 2015 at 10:35 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
On Fri, Sep 4, 2015 at 2:23 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Thu, Aug 27, 2015 at 1:54 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Thu, Aug 20, 2015 at 11:46 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:Jim Nasby wrote:
I think things like pageinspect are very different; I really can't see any
use for those beyond debugging (and debugging by an expert at that).I don't think that necessarily means it must continue to be in contrib.
Quite the contrary, I think it is a tool critical enough that it should
not be relegated to be a second-class citizen as it is now (let's face
it, being in contrib *is* second-class citizenship).Attached patch is latest patch.
The previous patch lacks some files for regression test.
Attached fixed v12 patch.The patch could be applied cleanly. "make check" could pass successfully.
But "make check-world -j 2" failed.
Thank you for looking at this patch.
Could you tell me what test you got failed?
make check-world -j 2 or more is done successfully in my environment.
Regards,
--
Masahiko Sawada
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-09-04 02:11, Bruce Momjian wrote:
On Thu, Sep 3, 2015 at 11:37:09PM +0200, Petr Jelinek wrote:
I don't understand. I'm just proposing that the source code for the
extension to live in src/extensions/, and have the shared library
installed by toplevel make install; I'm not suggesting that the
extension is installed automatically. For that, you still need a
superuser to run CREATE EXTENSION.+! for this
OK, what does "+!" mean? (I know it is probably a shift-key mistype,
but it looks interesting.)
Yes, shift-key mistype:)
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 3, 2015 at 11:56:52PM -0300, Alvaro Herrera wrote:
Bruce Momjian wrote:
On Thu, Sep 3, 2015 at 11:37:09PM +0200, Petr Jelinek wrote:
I don't understand. I'm just proposing that the source code for the
extension to live in src/extensions/, and have the shared library
installed by toplevel make install; I'm not suggesting that the
extension is installed automatically. For that, you still need a
superuser to run CREATE EXTENSION.+! for this
OK, what does "+!" mean? (I know it is probably a shift-key mistype,
but it looks interesting.)I took it as an uppercase 1 myself -- a shouted "PLUS ONE".
Oh, an ALL-CAPS +1. Yeah, it actually makes sense. ;-)
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ Everyone has their own god. +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 3 September 2015 at 18:23, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
The previous patch lacks some files for regression test.
Attached fixed v12 patch.
This looks OK. You saw that I was proposing to solve this problem a
different way ("Summary of plans to avoid the annoyance of Freezing"),
suggesting that we wait for a few CFs to see if a patch emerges for that -
then fall back to this patch if it doesn't? So I am moving this patch to
next CF.
I apologise for the personal annoyance caused by this; I hope whatever
solution we find we can work together on it.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Sep 5, 2015 at 7:35 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 3 September 2015 at 18:23, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
The previous patch lacks some files for regression test.
Attached fixed v12 patch.This looks OK. You saw that I was proposing to solve this problem a
different way ("Summary of plans to avoid the annoyance of Freezing"),
suggesting that we wait for a few CFs to see if a patch emerges for that -
then fall back to this patch if it doesn't? So I am moving this patch to
next CF.I apologise for the personal annoyance caused by this; I hope whatever
solution we find we can work together on it.
I had missed that thread actually, but have understood status of
around freeze avoidance topic.
It's no problem to me that we address Heikki's solution at first and
next is other plan(maybe frozen map).
But this frozen map patch is still under the reviewing and might have
serious problem, that is still need to be reviewed.
So I think we should continue to review this patch at least, while
reviewing Heikki's solution, and then we can select solution for
frozen map.
Otherwise, if frozen map has serious problem or other big problem is
occurred, the reviewing of patch will be not enough, and then it will
leads bad result, I think.
Regards,
--
Masahiko Sawada
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Sep 7, 2015 at 11:18 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Sat, Sep 5, 2015 at 7:35 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 3 September 2015 at 18:23, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
The previous patch lacks some files for regression test.
Attached fixed v12 patch.This looks OK. You saw that I was proposing to solve this problem a
different way ("Summary of plans to avoid the annoyance of Freezing"),
suggesting that we wait for a few CFs to see if a patch emerges for that -
then fall back to this patch if it doesn't? So I am moving this patch to
next CF.I apologise for the personal annoyance caused by this; I hope whatever
solution we find we can work together on it.I had missed that thread actually, but have understood status of
around freeze avoidance topic.
It's no problem to me that we address Heikki's solution at first and
next is other plan(maybe frozen map).
But this frozen map patch is still under the reviewing and might have
serious problem, that is still need to be reviewed.
So I think we should continue to review this patch at least, while
reviewing Heikki's solution, and then we can select solution for
frozen map.
Otherwise, if frozen map has serious problem or other big problem is
occurred, the reviewing of patch will be not enough, and then it will
leads bad result, I think.
I agree!
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-09-04 23:35:42 +0100, Simon Riggs wrote:
This looks OK. You saw that I was proposing to solve this problem a
different way ("Summary of plans to avoid the annoyance of Freezing"),
suggesting that we wait for a few CFs to see if a patch emerges for that -
then fall back to this patch if it doesn't? So I am moving this patch to
next CF.
As noted on that other thread I don't think that's a good policy, and it
seems like Robert agrees with me. So I think we should move this back to
"Needs Review".
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 4, 2015 at 2:55 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Fri, Sep 4, 2015 at 10:35 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
On Fri, Sep 4, 2015 at 2:23 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Thu, Aug 27, 2015 at 1:54 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Thu, Aug 20, 2015 at 11:46 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:Jim Nasby wrote:
I think things like pageinspect are very different; I really can't see any
use for those beyond debugging (and debugging by an expert at that).I don't think that necessarily means it must continue to be in contrib.
Quite the contrary, I think it is a tool critical enough that it should
not be relegated to be a second-class citizen as it is now (let's face
it, being in contrib *is* second-class citizenship).Attached patch is latest patch.
The previous patch lacks some files for regression test.
Attached fixed v12 patch.The patch could be applied cleanly. "make check" could pass successfully.
But "make check-world -j 2" failed.Thank you for looking at this patch.
Could you tell me what test you got failed?
make check-world -j 2 or more is done successfully in my environment.
I tried to do the test again, but initdb failed with the following error.
creating template1 database in data/base/1 ... FATAL: invalid
input syntax for type oid: "f"
This error didn't happen when I tested before. So the commit which was
applied recently might interfere with the patch.
Regards,
--
Fujii Masao
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 18, 2015 at 6:13 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
On Fri, Sep 4, 2015 at 2:55 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Fri, Sep 4, 2015 at 10:35 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
On Fri, Sep 4, 2015 at 2:23 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Thu, Aug 27, 2015 at 1:54 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Thu, Aug 20, 2015 at 11:46 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:Jim Nasby wrote:
I think things like pageinspect are very different; I really can't see any
use for those beyond debugging (and debugging by an expert at that).I don't think that necessarily means it must continue to be in contrib.
Quite the contrary, I think it is a tool critical enough that it should
not be relegated to be a second-class citizen as it is now (let's face
it, being in contrib *is* second-class citizenship).Attached patch is latest patch.
The previous patch lacks some files for regression test.
Attached fixed v12 patch.The patch could be applied cleanly. "make check" could pass successfully.
But "make check-world -j 2" failed.Thank you for looking at this patch.
Could you tell me what test you got failed?
make check-world -j 2 or more is done successfully in my environment.I tried to do the test again, but initdb failed with the following error.
creating template1 database in data/base/1 ... FATAL: invalid
input syntax for type oid: "f"This error didn't happen when I tested before. So the commit which was
applied recently might interfere with the patch.
Thank you for testing!
Attached fixed version patch.
Regards,
--
Masahiko Sawada
Attachments:
000_add_frozen_bit_into_visibilitymap_v13.patchapplication/octet-stream; name=000_add_frozen_bit_into_visibilitymap_v13.patchDownload
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 22c5f7a..b1b6a06 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -87,7 +87,7 @@ statapprox_heap(Relation rel, output_type *stat)
* If the page has only visible tuples, then we can find out the free
* space from the FSM and move on.
*/
- if (visibilitymap_test(rel, blkno, &vmbuffer))
+ if (visibilitymap_test(rel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
freespace = GetRecordedFreeSpace(rel, blkno);
stat->tuple_len += BLCKSZ - freespace;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index bcf9871..fc33772 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2176,8 +2176,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
/*
- * Find buffer to insert this tuple into. If the page is all visible,
- * this will also pin the requisite visibility map page.
+ * Find buffer to insert this tuple into. If the page is all visible
+ * or all frozen, this will also pin the requisite visibility map and
+ * frozen map page.
*/
buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
InvalidBuffer, options, bistate,
@@ -2192,7 +2193,11 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
if (PageIsAllVisible(BufferGetPage(buffer)))
{
all_visible_cleared = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(BufferGetPage(buffer));
+ PageClearAllFrozen(BufferGetPage(buffer));
+
visibilitymap_clear(relation,
ItemPointerGetBlockNumber(&(heaptup->t_self)),
vmbuffer);
@@ -2493,7 +2498,11 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
if (PageIsAllVisible(page))
{
all_visible_cleared = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+
visibilitymap_clear(relation,
BufferGetBlockNumber(buffer),
vmbuffer);
@@ -2776,9 +2785,9 @@ heap_delete(Relation relation, ItemPointer tid,
/*
* If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, we'll have to unlock and
- * re-lock, to avoid holding the buffer lock across an I/O. That's a bit
- * unfortunate, but hopefully shouldn't happen often.
+ * visible or all frozen while we were busy locking the buffer, we'll
+ * have to unlock and re-lock, to avoid holding the buffer lock across an
+ * I/O. That's a bit unfortunate, but hopefully shouldn't happen often.
*/
if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
{
@@ -2972,10 +2981,15 @@ l1:
*/
PageSetPrunable(page, xid);
+ /* clear PD_ALL_VISIBLE and PD_ALL_FORZEN flags */
if (PageIsAllVisible(page))
{
all_visible_cleared = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+
visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
vmbuffer);
}
@@ -3254,7 +3268,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
* in the middle of changing this, so we'll need to recheck after we have
* the lock.
*/
- if (PageIsAllVisible(page))
+ if (PageIsAllVisible(page) || PageIsAllFrozen(page))
visibilitymap_pin(relation, block, &vmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
@@ -3850,14 +3864,22 @@ l2:
if (PageIsAllVisible(BufferGetPage(buffer)))
{
all_visible_cleared = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(BufferGetPage(buffer));
+ PageClearAllFrozen(BufferGetPage(buffer));
+
visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
vmbuffer);
}
if (newbuf != buffer && PageIsAllVisible(BufferGetPage(newbuf)))
{
all_visible_cleared_new = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(BufferGetPage(newbuf));
+ PageClearAllFrozen(BufferGetPage(newbuf));
+
visibilitymap_clear(relation, BufferGetBlockNumber(newbuf),
vmbuffer_new);
}
@@ -6942,7 +6964,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
*/
XLogRecPtr
log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
- TransactionId cutoff_xid)
+ TransactionId cutoff_xid, uint8 vmflags)
{
xl_heap_visible xlrec;
XLogRecPtr recptr;
@@ -6952,6 +6974,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
Assert(BufferIsValid(vm_buffer));
xlrec.cutoff_xid = cutoff_xid;
+ xlrec.flags = vmflags;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
@@ -7541,8 +7564,14 @@ heap_xlog_visible(XLogReaderState *record)
* the subsequent update won't be replayed to clear the flag.
*/
page = BufferGetPage(buffer);
- PageSetAllVisible(page);
+
+ if (xlrec->flags & VISIBILITYMAP_ALL_VISIBLE)
+ PageSetAllVisible(page);
+ if (xlrec->flags & VISIBILITYMAP_ALL_FROZEN)
+ PageSetAllFrozen(page);
+
MarkBufferDirty(buffer);
+
}
else if (action == BLK_RESTORED)
{
@@ -7593,7 +7622,7 @@ heap_xlog_visible(XLogReaderState *record)
*/
if (lsn > PageGetLSN(vmpage))
visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
- xlrec->cutoff_xid);
+ xlrec->cutoff_xid, xlrec->flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
@@ -7743,7 +7772,10 @@ heap_xlog_delete(XLogReaderState *record)
PageSetPrunable(page, XLogRecGetXid(record));
if (xlrec->flags & XLH_DELETE_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
/* Make sure there is no forward chain link in t_ctid */
htup->t_ctid = target_tid;
@@ -7847,7 +7879,10 @@ heap_xlog_insert(XLogReaderState *record)
PageSetLSN(page, lsn);
if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
MarkBufferDirty(buffer);
}
@@ -7986,7 +8021,10 @@ heap_xlog_multi_insert(XLogReaderState *record)
PageSetLSN(page, lsn);
if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
MarkBufferDirty(buffer);
}
@@ -8114,7 +8152,10 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
PageSetPrunable(page, XLogRecGetXid(record));
if (xlrec->flags & XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
PageSetLSN(page, lsn);
MarkBufferDirty(obuffer);
@@ -8249,7 +8290,10 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
elog(PANIC, "heap_update_redo: failed to add tuple");
if (xlrec->flags & XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 7c38772..a284b85 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -21,33 +21,45 @@
*
* NOTES
*
- * The visibility map is a bitmap with one bit per heap page. A set bit means
- * that all tuples on the page are known visible to all transactions, and
- * therefore the page doesn't need to be vacuumed. The map is conservative in
- * the sense that we make sure that whenever a bit is set, we know the
- * condition is true, but if a bit is not set, it might or might not be true.
+ * The visibility map is a bitmap with two bits (all-visible and all-frozen)
+ * per heap page. A set all-visible bit means that all tuples on the page are
+ * known visible to all transactions, and therefore the page doesn't need to
+ * be vacuumed. A set all-frozen bit means that all tuples on the page are
+ * completely frozen, and therefore the page doesn't need to be vacuumed even
+ * if whole table scanning vacuum is required (e.g. anti-wraparound vacuum).
+ * A all-frozen bit must be set only when the page is already all-visible.
+ * That is, all-frozen bit is always set with all-visible bit.
+ *
+ * The map is conservative in the sense that we make sure that whenever a bit
+ * is set, we know the condition is true, but if a bit is not set, it might or
+ * might not be true.
*
* Clearing a visibility map bit is not separately WAL-logged. The callers
* must make sure that whenever a bit is cleared, the bit is cleared on WAL
- * replay of the updating operation as well.
+ * replay of the updating operation as well. And all-frozen bit must be
+ * cleared with all-visible at the same time.
*
* When we *set* a visibility map during VACUUM, we must write WAL. This may
* seem counterintuitive, since the bit is basically a hint: if it is clear,
- * it may still be the case that every tuple on the page is visible to all
- * transactions; we just don't know that for certain. The difficulty is that
- * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
- * on the page itself, and the visibility map bit. If a crash occurs after the
- * visibility map page makes it to disk and before the updated heap page makes
- * it to disk, redo must set the bit on the heap page. Otherwise, the next
- * insert, update, or delete on the heap page will fail to realize that the
- * visibility map bit must be cleared, possibly causing index-only scans to
- * return wrong answers.
+ * it may still be the case that every tuple on the page is visible or frozen
+ * to all transactions; we just don't know that for certain. The difficulty is
+ * that there are two bits which are typically set together: the PD_ALL_VISIBLE
+ * or PD_ALL_FROZEN bit on the page itself, and the visibility map bit. If a
+ * crash occurs after the visibility map page makes it to disk and before the
+ * updated heap page makes it to disk, redo must set the bit on the heap page.
+ * Otherwise, the next insert, update, or delete on the heap page will fail to
+ * realize that the visibility map bit must be cleared, possibly causing index-only
+ * scans to return wrong answers.
*
* VACUUM will normally skip pages for which the visibility map bit is set;
* such pages can't contain any dead tuples and therefore don't need vacuuming.
- * The visibility map is not used for anti-wraparound vacuums, because
+ * The visibility map is not used for anti-wraparound vacuums before 9.5, because
* an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
* present in the table, even on pages that don't have any dead tuples.
+ * 9.6 or later, the visibility map has a additional bit which indicates all tuple
+ * on single page has been completely forzen, so the visibility map is also used for
+ * anti-wraparound vacuums.
+ *
*
* LOCKING
*
@@ -58,14 +70,14 @@
* section that logs the page modification. However, we don't want to hold
* the buffer lock over any I/O that may be required to read in the visibility
* map page. To avoid this, we examine the heap page before locking it;
- * if the page-level PD_ALL_VISIBLE bit is set, we pin the visibility map
- * bit. Then, we lock the buffer. But this creates a race condition: there
- * is a possibility that in the time it takes to lock the buffer, the
- * PD_ALL_VISIBLE bit gets set. If that happens, we have to unlock the
- * buffer, pin the visibility map page, and relock the buffer. This shouldn't
- * happen often, because only VACUUM currently sets visibility map bits,
- * and the race will only occur if VACUUM processes a given page at almost
- * exactly the same time that someone tries to further modify it.
+ * if the page-level PD_ALL_VISIBLE or PD_ALL_FROZEN bit is set, we pin the
+ * visibility map bit. Then, we lock the buffer. But this creates a race
+ * condition: there is a possibility that in the time it takes to lock the
+ * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that happens,
+ * we have to unlock the buffer, pin the visibility map page, and relock the
+ * buffer. This shouldn't happen often, because only VACUUM currently sets
+ * visibility map bits, and the race will only occur if VACUUM processes a given
+ * page at almost exactly the same time that someone tries to further modify it.
*
* To set a bit, you need to hold a lock on the heap page. That prevents
* the race condition where VACUUM sees that all tuples on the page are
@@ -101,11 +113,14 @@
*/
#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
-/* Number of bits allocated for each heap block. */
-#define BITS_PER_HEAPBLOCK 1
+/*
+ * Number of bits allocated for each heap block.
+ * One for all-visible, other for all-frozen.
+*/
+#define BITS_PER_HEAPBLOCK 2
/* Number of heap blocks we can represent in one byte. */
-#define HEAPBLOCKS_PER_BYTE 8
+#define HEAPBLOCKS_PER_BYTE 4
/* Number of heap blocks we can represent in one visibility map page. */
#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
@@ -115,24 +130,42 @@
#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
#define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
-/* table for fast counting of set bits */
-static const uint8 number_of_ones[256] = {
- 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
+/* tables for fast counting of set bits for visible and freeze */
+static const uint8 number_of_ones_for_visible[256] = {
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4
+};
+static const uint8 number_of_ones_for_frozen[256] = {
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4
};
/* prototypes for internal routines */
@@ -141,7 +174,7 @@ static void vm_extend(Relation rel, BlockNumber nvmblocks);
/*
- * visibilitymap_clear - clear a bit in visibility map
+ * visibilitymap_clear - clear all bits in visibility map
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -153,7 +186,8 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
int mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- uint8 mask = 1 << mapBit;
+ uint8 mask = (VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN) <<
+ (BITS_PER_HEAPBLOCK * mapBit);
char *map;
#ifdef TRACE_VISIBILITYMAP
@@ -225,7 +259,7 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
}
/*
- * visibilitymap_set - set a bit on a previously pinned page
+ * visibilitymap_set - set bit(s) on a previously pinned page
*
* recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
* or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
@@ -234,10 +268,11 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
* marked all-visible; it is needed for Hot Standby, and can be
* InvalidTransactionId if the page contains no tuples.
*
- * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
- * this function. Except in recovery, caller should also pass the heap
- * buffer. When checksums are enabled and we're not in recovery, we must add
- * the heap buffer to the WAL chain to protect it from being torn.
+ * Caller is expected to set the heap page's PD_ALL_VISIBLE or PD_ALL_FROZEN
+ * bit before calling this function. Except in recovery, caller should also
+ * pass the heap buffer and flags which indicates what flag we want to set.
+ * When checksums are enabled and we're not in recovery, we must add the heap
+ * buffer to the WAL chain to protect it from being torn.
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -245,7 +280,8 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
*/
void
visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid)
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -254,7 +290,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_set %s %d %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
@@ -272,11 +308,11 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map = PageGetContents(page);
LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
- if (!(map[mapByte] & (1 << mapBit)))
+ if (flags != (map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))))
{
START_CRIT_SECTION();
- map[mapByte] |= (1 << mapBit);
+ map[mapByte] |= (flags << (BITS_PER_HEAPBLOCK * mapBit));
MarkBufferDirty(vmBuf);
if (RelationNeedsWAL(rel))
@@ -285,7 +321,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Assert(!InRecovery);
recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
- cutoff_xid);
+ cutoff_xid, flags);
/*
* If data checksums are enabled (or wal_log_hints=on), we
@@ -295,11 +331,15 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Page heapPage = BufferGetPage(heapBuf);
- /* caller is expected to set PD_ALL_VISIBLE first */
- Assert(PageIsAllVisible(heapPage));
+ /*
+ * caller is expected to set PD_ALL_VISIBLE or
+ * PD_ALL_FROZEN first.
+ */
+ Assert(PageIsAllVisible(heapPage) || PageIsAllFrozen(heapPage));
PageSetLSN(heapPage, recptr);
}
}
+
PageSetLSN(page, recptr);
}
@@ -310,15 +350,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
}
/*
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_test - test if bit(s) is set
*
- * Are all tuples on heapBlk visible to all, according to the visibility map?
+ * Are all tuples on heapBlk visible or frozen to all, according to the visibility map?
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by an
* earlier call to visibilitymap_pin or visibilitymap_test on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk, or InvalidBuffer. The caller is responsible for
- * releasing *buf after it's done testing and setting bits.
+ * releasing *buf after it's done testing and setting bits, and must set flags
+ * which indicates what flag we want to test.
*
* NOTE: This function is typically called without a lock on the heap page,
* so somebody else could change the bit just after we look at it. In fact,
@@ -328,7 +369,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
* all concurrency issues!
*/
bool
-visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
+visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf, uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -337,7 +378,7 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_test %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_test %s %d %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
/* Reuse the old pinned buffer if possible */
@@ -360,11 +401,12 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
map = PageGetContents(BufferGetPage(*buf));
/*
- * A single-bit read is atomic. There could be memory-ordering effects
+ * A single or double bit read is atomic. There could be memory-ordering effects
* here, but for performance reasons we make it the caller's job to worry
* about that.
*/
- result = (map[mapByte] & (1 << mapBit)) ? true : false;
+ result = (map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))) ?
+ true : false;
return result;
}
@@ -374,10 +416,11 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
*
* Note: we ignore the possibility of race conditions when the table is being
* extended concurrently with the call. New pages added to the table aren't
- * going to be marked all-visible, so they won't affect the result.
+ * going to be marked all-visible or all-frozen, so they won't affect the result.
+ * The caller must set the flags which indicates what flag we want to count.
*/
BlockNumber
-visibilitymap_count(Relation rel)
+visibilitymap_count(Relation rel, uint8 flags)
{
BlockNumber result = 0;
BlockNumber mapBlock;
@@ -406,7 +449,10 @@ visibilitymap_count(Relation rel)
for (i = 0; i < MAPSIZE; i++)
{
- result += number_of_ones[map[i]];
+ if (flags & VISIBILITYMAP_ALL_VISIBLE)
+ result += number_of_ones_for_visible[map[i]];
+ if (flags & VISIBILITYMAP_ALL_FROZEN)
+ result += number_of_ones_for_frozen[map[i]];
}
ReleaseBuffer(mapBuffer);
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e59b163..10f8dc9 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1919,11 +1919,18 @@ index_update_stats(Relation rel,
{
BlockNumber relpages = RelationGetNumberOfBlocks(rel);
BlockNumber relallvisible;
+ BlockNumber relallfrozen;
if (rd_rel->relkind != RELKIND_INDEX)
- relallvisible = visibilitymap_count(rel);
+ {
+ relallvisible = visibilitymap_count(rel, VISIBILITYMAP_ALL_VISIBLE);
+ relallfrozen = visibilitymap_count(rel, VISIBILITYMAP_ALL_FROZEN);
+ }
else /* don't bother for indexes */
+ {
relallvisible = 0;
+ relallfrozen = 0;
+ }
if (rd_rel->relpages != (int32) relpages)
{
@@ -1940,6 +1947,11 @@ index_update_stats(Relation rel,
rd_rel->relallvisible = (int32) relallvisible;
dirty = true;
}
+ if (rd_rel->relallfrozen != (int32) relallfrozen)
+ {
+ rd_rel->relallfrozen = (int32) relallfrozen;
+ dirty = true;
+ }
}
/*
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 861048f..392c2a4 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -572,7 +572,8 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
vac_update_relstats(onerel,
relpages,
totalrows,
- visibilitymap_count(onerel),
+ visibilitymap_count(onerel, VISIBILITYMAP_ALL_VISIBLE),
+ visibilitymap_count(onerel, VISIBILITYMAP_ALL_FROZEN),
hasindex,
InvalidTransactionId,
InvalidMultiXactId,
@@ -595,6 +596,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
RelationGetNumberOfBlocks(Irel[ind]),
totalindexrows,
0,
+ 0,
false,
InvalidTransactionId,
InvalidMultiXactId,
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 7ab4874..d3725dd 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -22,6 +22,7 @@
#include "access/rewriteheap.h"
#include "access/transam.h"
#include "access/tuptoaster.h"
+#include "access/visibilitymap.h"
#include "access/xact.h"
#include "access/xlog.h"
#include "catalog/catalog.h"
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 85b0483..744bfff 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -747,6 +747,7 @@ void
vac_update_relstats(Relation relation,
BlockNumber num_pages, double num_tuples,
BlockNumber num_all_visible_pages,
+ BlockNumber num_all_frozen_pages,
bool hasindex, TransactionId frozenxid,
MultiXactId minmulti,
bool in_outer_xact)
@@ -784,6 +785,11 @@ vac_update_relstats(Relation relation,
pgcform->relallvisible = (int32) num_all_visible_pages;
dirty = true;
}
+ if (pgcform->relallfrozen != (int32) num_all_frozen_pages)
+ {
+ pgcform->relallfrozen = (int32) num_all_frozen_pages;
+ dirty = true;
+ }
/* Apply DDL updates, but not inside an outer transaction (see above) */
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index a01cfb4..120de63 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -106,6 +106,8 @@ typedef struct LVRelStats
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* number of pages we examined */
BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber vmskipped_frozen_pages; /* # of pages we skipped by all-frozen bit
+ of visibility map */
double scanned_tuples; /* counts only tuples on scanned pages */
double old_rel_tuples; /* previous value of pg_class.reltuples */
double new_rel_tuples; /* new estimated total # of tuples */
@@ -156,7 +158,7 @@ static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
static int vac_cmp_itemptr(const void *left, const void *right);
static bool heap_page_is_all_visible(Relation rel, Buffer buf,
- TransactionId *visibility_cutoff_xid);
+ TransactionId *visibility_cutoff_xid, bool *all_frozen);
/*
@@ -188,7 +190,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
MultiXactId mxactFullScanLimit;
BlockNumber new_rel_pages;
double new_rel_tuples;
- BlockNumber new_rel_allvisible;
+ BlockNumber new_rel_allvisible,
+ new_rel_allfrozen;
double new_live_tuples;
TransactionId new_frozen_xid;
MultiXactId new_min_multi;
@@ -222,6 +225,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* than or equal to the requested Xid full-table scan limit; or if the
* table's minimum MultiXactId is older than or equal to the requested
* mxid full-table scan limit.
+ * Even if scan_all is set so far, we could skip to scan some pages
+ * according by frozen map.
*/
scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
xidFullScanLimit);
@@ -253,7 +258,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
- if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages + vacrelstats->vmskipped_frozen_pages)
+ < vacrelstats->rel_pages)
{
Assert(!scan_all);
scanned_all = false;
@@ -301,10 +307,14 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_tuples = vacrelstats->old_rel_tuples;
}
- new_rel_allvisible = visibilitymap_count(onerel);
+ new_rel_allvisible = visibilitymap_count(onerel, VISIBILITYMAP_ALL_VISIBLE);
if (new_rel_allvisible > new_rel_pages)
new_rel_allvisible = new_rel_pages;
+ new_rel_allfrozen = visibilitymap_count(onerel, VISIBILITYMAP_ALL_FROZEN);
+ if (new_rel_allfrozen > new_rel_pages)
+ new_rel_allfrozen = new_rel_pages;
+
new_frozen_xid = scanned_all ? FreezeLimit : InvalidTransactionId;
new_min_multi = scanned_all ? MultiXactCutoff : InvalidMultiXactId;
@@ -312,6 +322,7 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_pages,
new_rel_tuples,
new_rel_allvisible,
+ new_rel_allfrozen,
vacrelstats->hasindex,
new_frozen_xid,
new_min_multi,
@@ -360,10 +371,11 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel),
vacrelstats->num_index_scans);
- appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins\n"),
+ appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped according to vm\n"),
vacrelstats->pages_removed,
vacrelstats->rel_pages,
- vacrelstats->pinskipped_pages);
+ vacrelstats->pinskipped_pages,
+ vacrelstats->vmskipped_frozen_pages);
appendStringInfo(&buf,
_("tuples: %.0f removed, %.0f remain, %.0f are dead but not yet removable\n"),
vacrelstats->tuples_deleted,
@@ -486,9 +498,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* consecutive pages. Since we're reading sequentially, the OS should be
* doing readahead for us, so there's no gain in skipping a page now and
* then; that's likely to disable readahead and so be counterproductive.
- * Also, skipping even a single page means that we can't update
- * relfrozenxid, so we only want to do it if we can skip a goodly number
- * of pages.
+ * Also, skipping even a single page accorinding to all-visible bit of
+ * visibility map means that we can't update relfrozenxid, so we only want
+ * to do it if we can skip a goodly number. On the other hand, we count
+ * both how many pages we skipped according to all-frozen bit of visibility
+ * map and how many pages we freeze page, so we can update relfrozenxid if
+ * the sum of them is as many as pages of table.
*
* Before entering the main loop, establish the invariant that
* next_not_all_visible_block is the next block number >= blkno that's not
@@ -515,7 +530,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer))
+ if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer,
+ VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN))
break;
vacuum_delay_point();
}
@@ -533,7 +549,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
bool tupgone,
hastup;
int prev_dead_count;
- int nfrozen;
+ int nfrozen; /* # of tuples is frozen */
+ int nalready_frozen; /* # of tuples is already frozen */
+ int ntotal_frozen; /* # of tuples is in single page */
+ int ntup_per_page;
Size freespace;
bool all_visible_according_to_vm;
bool all_visible;
@@ -548,7 +567,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block++)
{
if (!visibilitymap_test(onerel, next_not_all_visible_block,
- &vmbuffer))
+ &vmbuffer,
+ VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN))
break;
vacuum_delay_point();
}
@@ -566,9 +586,25 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
}
else
{
- /* Current block is all-visible */
- if (skipping_all_visible_blocks && !scan_all)
- continue;
+ /*
+ * This block is at least all-visible according to visibility map.
+ * We check whehter this block is all-frozen to skip to vacuum this
+ * page even if scanning whole page is required.
+ */
+ if (scan_all)
+ {
+ if (visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_FROZEN))
+ {
+ vacrelstats->vmskipped_frozen_pages++;
+ continue;
+ }
+ }
+ else
+ {
+ if (skipping_all_visible_blocks)
+ continue;
+ }
+
all_visible_according_to_vm = true;
}
@@ -740,7 +776,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
PageSetAllVisible(page);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, InvalidTransactionId);
+ vmbuffer, InvalidTransactionId,
+ VISIBILITYMAP_ALL_VISIBLE);
END_CRIT_SECTION();
}
@@ -764,6 +801,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
all_visible = true;
has_dead_tuples = false;
nfrozen = 0;
+ nalready_frozen = 0;
+ ntup_per_page = 0;
hastup = false;
prev_dead_count = vacrelstats->num_dead_tuples;
maxoff = PageGetMaxOffsetNumber(page);
@@ -918,8 +957,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
{
num_tuples += 1;
+ ntup_per_page += 1;;
hastup = true;
+ /* Check whether this tuple is alrady frozen or not */
+ if (HeapTupleHeaderXminFrozen(tuple.t_data))
+ nalready_frozen += 1;
+
/*
* Each non-removable tuple must be checked to see if it needs
* freezing. Note we already have exclusive buffer lock.
@@ -931,9 +975,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
} /* scan along page */
/*
- * If we froze any tuples, mark the buffer dirty, and write a WAL
- * record recording the changes. We must log the changes to be
- * crash-safe against future truncation of CLOG.
+ * If we froze any tuples or any tuples are already frozen,
+ * mark the buffer dirty, and write a WAL record recording the changes.
+ * We must log the changes to be crash-safe against future truncation
+ * of CLOG.
*/
if (nfrozen > 0)
{
@@ -966,6 +1011,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
END_CRIT_SECTION();
}
+ /* Compute the number of frozen tuples in a page */
+ ntotal_frozen = nfrozen + nalready_frozen;
+
/*
* If there are no indexes then we can vacuum the page right now
* instead of doing a second scan.
@@ -988,26 +1036,47 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
freespace = PageGetHeapFreeSpace(page);
- /* mark page all-visible, if appropriate */
- if (all_visible && !all_visible_according_to_vm)
+ /* This page is all visible */
+ if (all_visible)
{
- /*
- * It should never be the case that the visibility map page is set
- * while the page-level bit is clear, but the reverse is allowed
- * (if checksums are not enabled). Regardless, set the both bits
- * so that we get back in sync.
- *
- * NB: If the heap page is all-visible but the VM bit is not set,
- * we don't need to dirty the heap page. However, if checksums
- * are enabled, we do need to make sure that the heap page is
- * dirtied before passing it to visibilitymap_set(), because it
- * may be logged. Given that this situation should only happen in
- * rare cases after a crash, it is not worth optimizing.
- */
- PageSetAllVisible(page);
- MarkBufferDirty(buf);
- visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ /* mark page all-visible, if appropriate */
+ if (!all_visible_according_to_vm)
+ {
+ /*
+ * It should never be the case that the visibility map page is set
+ * while the page-level bit is clear, but the reverse is allowed
+ * (if checksums are not enabled). Regardless, set the both bits
+ * so that we get back in sync.
+ *
+ * NB: If the heap page is all-visible but the VM bit is not set,
+ * we don't need to dirty the heap page. However, if checksums
+ * are enabled, we do need to make sure that the heap page is
+ * dirtied before passing it to visibilitymap_set(), because it
+ * may be logged. Given that this situation should only happen in
+ * rare cases after a crash, it is not worth optimizing.
+ */
+ PageSetAllVisible(page);
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+ }
+
+ /* mark page all-frozen, if all tuples are frozen in total */
+ if ((ntotal_frozen == ntup_per_page) &&
+ !visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_FROZEN))
+ {
+ Assert(PageIsAllVisible(page));
+
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ {
+ MarkBufferDirty(buf);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+ vmbuffer, visibility_cutoff_xid, flags);
+ }
}
/*
@@ -1018,7 +1087,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* that something bad has happened.
*/
else if (all_visible_according_to_vm && !PageIsAllVisible(page)
- && visibilitymap_test(onerel, blkno, &vmbuffer))
+ && visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
relname, blkno);
@@ -1047,6 +1116,17 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
visibilitymap_clear(onerel, blkno, vmbuffer);
}
+ /*
+ * As a result of scanning a page, we set VM all-frozen bit and page header
+ * if all tuples of single page are frozen.
+ */
+ if (ntotal_frozen == ntup_per_page)
+ {
+ PageSetAllFrozen(page);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr, vmbuffer,
+ InvalidTransactionId, VISIBILITYMAP_ALL_FROZEN);
+ }
+
UnlockReleaseBuffer(buf);
/* Remember the location of the last page with nonremovable tuples */
@@ -1078,7 +1158,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
num_tuples);
/*
- * Release any remaining pin on visibility map page.
+ * Release any remaining pin on visibility map and frozen map page.
*/
if (BufferIsValid(vmbuffer))
{
@@ -1115,6 +1195,14 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
tups_vacuumed, vacuumed_pages)));
/*
+ * This information would be effective for how much effect all-frozen bit
+ * of VM had for freezing tuples.
+ */
+ ereport(elevel,
+ (errmsg("Skipped %d frozen pages acoording to visibility map",
+ vacrelstats->vmskipped_frozen_pages)));
+
+ /*
* This is pretty messy, but we split it up so that we can skip emitting
* individual parts of the message when not applicable.
*/
@@ -1226,6 +1314,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
OffsetNumber unused[MaxOffsetNumber];
int uncnt = 0;
TransactionId visibility_cutoff_xid;
+ bool all_frozen;
START_CRIT_SECTION();
@@ -1277,19 +1366,31 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
* dirty, exclusively locked, and, if needed, a full page image has been
* emitted in the log_heap_clean() above.
*/
- if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid))
+ if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid, &all_frozen))
PageSetAllVisible(page);
/*
* All the changes to the heap page have been done. If the all-visible
- * flag is now set, also set the VM bit.
+ * flag is now set, also set the VM all-visible bit.
+ * Also, if this page is all-frozen, set VM all-frozen bit and flag.
*/
- if (PageIsAllVisible(page) &&
- !visibilitymap_test(onerel, blkno, vmbuffer))
+ if (PageIsAllVisible(page))
{
- Assert(BufferIsValid(*vmbuffer));
- visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
- visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ if (!visibilitymap_test(onerel, blkno, vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+
+ /* mark page all-frozen, and set VM all-frozen bit */
+ if (all_frozen)
+ {
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
+ visibility_cutoff_xid, flags);
}
return tupindex;
@@ -1408,6 +1509,7 @@ lazy_cleanup_index(Relation indrel,
stats->num_pages,
stats->num_index_tuples,
0,
+ 0,
false,
InvalidTransactionId,
InvalidMultiXactId,
@@ -1782,7 +1884,8 @@ vac_cmp_itemptr(const void *left, const void *right)
* xmin amongst the visible tuples.
*/
static bool
-heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid)
+heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid,
+ bool *all_frozen)
{
Page page = BufferGetPage(buf);
BlockNumber blockno = BufferGetBlockNumber(buf);
@@ -1791,6 +1894,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
bool all_visible = true;
*visibility_cutoff_xid = InvalidTransactionId;
+ *all_frozen = true;
/*
* This is a stripped down version of the line pointer scan in
@@ -1814,7 +1918,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/*
* Dead line pointers can have index pointers pointing to them. So
- * they can't be treated as visible
+ * they can't be treated as visible and frozen.
*/
if (ItemIdIsDead(itemid))
{
@@ -1855,6 +1959,10 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
+
+ /* Check whether this tuple is alrady frozen or not */
+ if (!HeapTupleHeaderXminFrozen(tuple.t_data))
+ *all_frozen = false;
}
break;
@@ -1863,6 +1971,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
case HEAPTUPLE_INSERT_IN_PROGRESS:
case HEAPTUPLE_DELETE_IN_PROGRESS:
all_visible = false;
+ *all_frozen = false;
break;
default:
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 9f54c46..08df289 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -116,7 +116,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
*/
if (!visibilitymap_test(scandesc->heapRelation,
ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
+ &node->ioss_VMBuffer, VISIBILITYMAP_ALL_VISIBLE))
{
/*
* Rats, we have to visit the heap to check visibility.
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 1ef76d0..ee49ddf 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -127,7 +127,7 @@ ExecCheckPlanOutput(Relation resultRel, List *targetList)
if (attno != resultDesc->natts)
ereport(ERROR,
(errcode(ERRCODE_DATATYPE_MISMATCH),
- errmsg("table row type and query-specified row type do not match"),
+ errmsg("table row type and query-specified row type do not match"),
errdetail("Query has too few columns.")));
}
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 79d9390..ac3e0f9 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -10,6 +10,7 @@
#include "postgres_fe.h"
#include "pg_upgrade.h"
+#include "storage/bufpage.h"
#include <fcntl.h>
@@ -21,6 +22,27 @@ static int copy_file(const char *fromfile, const char *tofile, bool force);
static int win32_pghardlink(const char *src, const char *dst);
#endif
+static int rewrite_vm_to_vfm(const char *fromfile, const char *tofile, bool force);
+
+/* table for fast rewriting vm file to vfm file */
+static const uint16 rewrite_vm_to_vfm_table[256] = {
+ 0, 1, 4, 5, 16, 17, 20, 21, 64, 65, 68, 69, 80, 81, 84, 85,
+ 256, 257, 260, 261, 272, 273, 276, 277, 320, 321, 324, 325, 336, 337, 340, 341,
+ 1024, 1025, 1028, 1029, 1040, 1041, 1044, 1045, 1088, 1089, 1092, 1093, 1104, 1105, 1108, 1109,
+ 1280, 1281, 1284, 1285, 1296, 1297, 1300, 1301, 1344, 1345, 1348, 1349, 1360, 1361, 1364, 1365,
+ 4096, 4097, 4100, 4101, 4112, 4113, 4116, 4117, 4160, 4161, 4164, 4165, 4176, 4177, 4180, 4181,
+ 4352, 4353, 4356, 4357, 4368, 4369, 4372, 4373, 4416, 4417, 4420, 4421, 4432, 4433, 4436, 4437,
+ 5120, 5121, 5124, 5125, 5136, 5137, 5140, 5141, 5184, 5185, 5188, 5189, 5200, 5201, 5204, 5205,
+ 5376, 5377, 5380, 5381, 5392, 5393, 5396, 5397, 5440, 5441, 5444, 5445, 5456, 5457, 5460, 5461,
+ 16384, 16385, 16388, 16389, 16400, 16401, 16404, 16405, 16448, 16449, 16452, 16453, 16464, 16465, 16468, 16469,
+ 16640, 16641, 16644, 16645, 16656, 16657, 16660, 16661, 16704, 16705, 16708, 16709, 16720, 16721, 16724, 16725,
+ 17408, 17409, 17412, 17413, 17424, 17425, 17428, 17429, 17472, 17473, 17476, 17477, 17488, 17489, 17492, 17493,
+ 17664, 17665, 17668, 17669, 17680, 17681, 17684, 17685, 17728, 17729, 17732, 17733, 17744, 17745, 17748, 17749,
+ 20480, 20481, 20484, 20485, 20496, 20497, 20500, 20501, 20544, 20545, 20548, 20549, 20560, 20561, 20564, 20565,
+ 20736, 20737, 20740, 20741, 20752, 20753, 20756, 20757, 20800, 20801, 20804, 20805, 20816, 20817, 20820, 20821,
+ 21504, 21505, 21508, 21509, 21520, 21521, 21524, 21525, 21568, 21569, 21572, 21573, 21584, 21585, 21588, 21589,
+ 21760, 21761, 21764, 21765, 21776, 21777, 21780, 21781, 21824, 21825, 21828, 21829, 21840, 21841, 21844, 21845
+};
/*
* copyAndUpdateFile()
@@ -30,11 +52,19 @@ static int win32_pghardlink(const char *src, const char *dst);
*/
const char *
copyAndUpdateFile(pageCnvCtx *pageConverter,
- const char *src, const char *dst, bool force)
+ const char *src, const char *dst, bool force, bool rewrite_vm)
{
+
if (pageConverter == NULL)
{
- if (pg_copy_file(src, dst, force) == -1)
+ int ret;
+
+ if (rewrite_vm)
+ ret = rewrite_vm_to_vfm(src, dst, force);
+ else
+ ret = pg_copy_file(src, dst, force);
+
+ if (ret)
return getErrorText(errno);
else
return NULL;
@@ -99,7 +129,6 @@ copyAndUpdateFile(pageCnvCtx *pageConverter,
}
}
-
/*
* linkAndUpdateFile()
*
@@ -201,6 +230,110 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
#endif
+/*
+ * rewrite_vm_to_vfm()
+ *
+ * A additional bit which indicates that all tuples on page is completely
+ * frozen is added into visibility map at PG 9.6. So the format of visibiilty
+ * map has been changed.
+ * Copies a visibility map file while adding all-frozen bit(0) into each bit.
+ */
+static int
+rewrite_vm_to_vfm(const char *fromfile, const char *tofile, bool force)
+{
+#define REWRITE_BUF_SIZE (50 * BLCKSZ)
+#define BITS_PER_HEAPBLOCK 2
+
+ int src_fd, dst_fd;
+ uint16 vfm_bits;
+ ssize_t nbytes;
+ char *buffer;
+ int ret = 0;
+ int save_errno = 0;
+
+ if ((fromfile == NULL) || (tofile == NULL))
+ {
+ errno = EINVAL;
+ return -1;
+ }
+
+ if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
+ return -1;
+
+ if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+ {
+ save_errno = errno;
+ if (src_fd != 0)
+ close(src_fd);
+
+ errno = save_errno;
+ return -1;
+ }
+
+ buffer = (char *) pg_malloc(REWRITE_BUF_SIZE);
+
+ /* Copy page header data in advance */
+ if ((nbytes = read(src_fd, buffer, MAXALIGN(SizeOfPageHeaderData))) <= 0)
+ {
+ save_errno = errno;
+ return -1;
+ }
+
+ if (write(dst_fd, buffer, nbytes) != nbytes)
+ {
+ /* if write didn't set errno, assume problem is no disk space */
+ if (errno == 0)
+ errno = ENOSPC;
+ save_errno = errno;
+ return -1;
+ }
+
+ /* perform data rewriting i.e read src srouce, write to destination */
+ while (true)
+ {
+ ssize_t nbytes = read(src_fd, buffer, REWRITE_BUF_SIZE);
+ char *cur, *end;
+
+ if (nbytes < 0)
+ {
+ ret = -1;
+ break;
+ }
+
+ if (nbytes == 0)
+ break;
+
+ cur = buffer;
+ end = buffer + nbytes;
+
+ /*
+ * Rewrite a byte and write dest_fd per BITS_PER_HEAPBLOCK bytes.
+ */
+ while (end > cur)
+ {
+ /* Get rewritten bit from table and its string representation */
+ vfm_bits = rewrite_vm_to_vfm_table[(uint8) *cur];
+
+ if (write(dst_fd, &vfm_bits, BITS_PER_HEAPBLOCK) != BITS_PER_HEAPBLOCK)
+ {
+ ret = -1;
+ break;
+ }
+ cur++;
+ }
+ }
+
+ pg_free(buffer);
+
+ if (src_fd != 0)
+ close(src_fd);
+
+ if (dst_fd != 0)
+ close(dst_fd);
+
+ return ret;
+}
+
void
check_hard_link(void)
{
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 13aa891..d407666 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -112,6 +112,11 @@ extern char *output_files[];
#define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031
/*
+ * The format of visibility map changed with this 9.6 commit,
+ *
+ */
+#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201509181
+/*
* pg_multixact format changed in 9.3 commit 0ac5ad5134f2769ccbaefec73844f85,
* ("Improve concurrency of foreign key locking") which also updated catalog
* version to this value. pg_upgrade behavior depends on whether old and new
@@ -397,7 +402,7 @@ typedef void *pageCnvCtx;
#endif
const char *copyAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
- const char *dst, bool force);
+ const char *dst, bool force, bool rewrite_vm);
const char *linkAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
const char *dst);
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index c22df42..766a473 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -18,7 +18,7 @@
static void transfer_single_new_db(pageCnvCtx *pageConverter,
FileNameMap *maps, int size, char *old_tablespace);
static void transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *suffix);
+ const char *type_old_suffix, const char *type_new_suffix);
/*
@@ -171,6 +171,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
{
int mapnum;
bool vm_crashsafe_match = true;
+ bool vm_rewrite_needed = false;
/*
* Do the old and new cluster disagree on the crash-safetiness of the vm
@@ -180,13 +181,20 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
vm_crashsafe_match = false;
+ /*
+ * Do we need to rewrite "vm" to "vfm".
+ */
+ if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ vm_rewrite_needed = true;
+
for (mapnum = 0; mapnum < size; mapnum++)
{
if (old_tablespace == NULL ||
strcmp(maps[mapnum].old_tablespace, old_tablespace) == 0)
{
/* transfer primary file */
- transfer_relfile(pageConverter, &maps[mapnum], "");
+ transfer_relfile(pageConverter, &maps[mapnum], "", "");
/* fsm/vm files added in PG 8.4 */
if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)
@@ -194,9 +202,17 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
/*
* Copy/link any fsm and vm files, if they exist
*/
- transfer_relfile(pageConverter, &maps[mapnum], "_fsm");
+ transfer_relfile(pageConverter, &maps[mapnum], "_fsm", "_fsm");
if (vm_crashsafe_match)
- transfer_relfile(pageConverter, &maps[mapnum], "_vm");
+ {
+ /*
+ * vm file is changed to vfm file in PG 9.6.
+ */
+ if (vm_rewrite_needed)
+ transfer_relfile(pageConverter, &maps[mapnum], "_vm", "_vfm");
+ else
+ transfer_relfile(pageConverter, &maps[mapnum], "_vm", "_vm");
+ }
}
}
}
@@ -210,7 +226,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
*/
static void
transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *type_suffix)
+ const char *type_old_suffix, const char *type_new_suffix)
{
const char *msg;
char old_file[MAXPGPATH];
@@ -218,6 +234,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
int fd;
int segno;
char extent_suffix[65];
+ bool rewrite_vm = false;
/*
* Now copy/link any related segments as well. Remember, PG breaks large
@@ -236,18 +253,18 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
map->old_tablespace_suffix,
map->old_db_oid,
map->old_relfilenode,
- type_suffix,
+ type_old_suffix,
extent_suffix);
snprintf(new_file, sizeof(new_file), "%s%s/%u/%u%s%s",
map->new_tablespace,
map->new_tablespace_suffix,
map->new_db_oid,
map->new_relfilenode,
- type_suffix,
+ type_new_suffix,
extent_suffix);
/* Is it an extent, fsm, or vm file? */
- if (type_suffix[0] != '\0' || segno != 0)
+ if (type_old_suffix[0] != '\0' || segno != 0)
{
/* Did file open fail? */
if ((fd = open(old_file, O_RDONLY, 0)) == -1)
@@ -276,7 +293,11 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
+ /* We need to rewrite vm file to vfm file. */
+ if (strcmp(type_old_suffix, type_new_suffix) != 0)
+ rewrite_vm = true;
+
+ if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true, rewrite_vm)) != NULL)
pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
diff --git a/src/bin/pg_upgrade/test.sh b/src/bin/pg_upgrade/test.sh
index aa7f399..6b058d4 100644
--- a/src/bin/pg_upgrade/test.sh
+++ b/src/bin/pg_upgrade/test.sh
@@ -174,6 +174,11 @@ if "$MAKE" -C "$oldsrc" installcheck; then
mv "$temp_root"/dump1.sql "$temp_root"/dump1.sql.orig
sed "s;$oldsrc;$newsrc;g" "$temp_root"/dump1.sql.orig >"$temp_root"/dump1.sql
fi
+
+ vm_sql="SELECT c.relname, c.relallvisible FROM pg_class as c, pg_namespace as n WHERE c.relnamespace = n.oid AND n.nspname NOT IN ('information_schema', 'pg_toast', 'pg_catalog') ORDER BY c.relname;"
+ # Test for rewriting visibility map
+ vacuumdb -d regression || visibilitymap_vacuum1_status=$?
+ psql -d regression -c "$vm_sql" > "$temp_root"/vm_test1.txt || visibilitymap_test1_status=$?
else
make_installcheck_status=$?
fi
@@ -188,6 +193,14 @@ if [ -n "$pg_dumpall1_status" ]; then
echo "pg_dumpall of pre-upgrade database cluster failed"
exit 1
fi
+if [ -n "$visibilitymap_vacuum1_status" ];then
+ echo "VACUUM of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+if [ -n "$visibilitymap_test1_status" ];then
+ echo "SELECT of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
PGDATA=$BASE_PGDATA
@@ -203,6 +216,8 @@ case $testhost in
esac
pg_dumpall -f "$temp_root"/dump2.sql || pg_dumpall2_status=$?
+vacuumdb -d regression || visibilitymap_vacuum2_status=$?
+psql -d regression -c "$vm_sql" > "$temp_root"/vm_test2.txt || visibilitymap_test2_status=$?
pg_ctl -m fast stop
# no need to echo commands anymore
@@ -214,11 +229,26 @@ if [ -n "$pg_dumpall2_status" ]; then
exit 1
fi
+if [ -n "$visibilitymap_vacuum2_status" ];then
+ echo "VACUUM of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
+if [ -n "$visibilitymap_test2_status" ];then
+ echo "SELECT of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
case $testhost in
MINGW*) cmd /c delete_old_cluster.bat ;;
*) sh ./delete_old_cluster.sh ;;
esac
+if ! diff "$temp_root"/vm_test1.txt "$temp_root"/vm_test2.txt >/dev/null; then
+ echo "Visibility map rewriting test failed"
+ exit 1
+fi
+
if diff "$temp_root"/dump1.sql "$temp_root"/dump2.sql >/dev/null; then
echo PASSED
exit 0
diff --git a/src/common/relpath.c b/src/common/relpath.c
index 66dfef1..5898f1b 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -30,11 +30,14 @@
* If you add a new entry, remember to update the errhint in
* forkname_to_number() below, and update the SGML documentation for
* pg_relation_size().
+ * 9.6 or later, the visibility map fork name is changed from "vm" to
+ * "vfm" bacause visibility map has not only information about all-visible
+ * but also information about all-frozen.
*/
const char *const forkNames[] = {
"main", /* MAIN_FORKNUM */
"fsm", /* FSM_FORKNUM */
- "vm", /* VISIBILITYMAP_FORKNUM */
+ "vfm", /* VISIBILITYMAP_FORKNUM */
"init" /* INIT_FORKNUM */
};
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index caa0f14..93afb10 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -320,9 +320,10 @@ typedef struct xl_heap_freeze_page
typedef struct xl_heap_visible
{
TransactionId cutoff_xid;
+ uint8 flags;
} xl_heap_visible;
-#define SizeOfHeapVisible (offsetof(xl_heap_visible, cutoff_xid) + sizeof(TransactionId))
+#define SizeOfHeapVisible (offsetof(xl_heap_visible, flags) + sizeof(uint8))
typedef struct xl_heap_new_cid
{
@@ -389,6 +390,6 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
xl_heap_freeze_tuple *xlrec_tp);
extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
- Buffer vm_buffer, TransactionId cutoff_xid);
+ Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
#endif /* HEAPAM_XLOG_H */
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 0c0e0ef..7270609 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -19,15 +19,20 @@
#include "storage/buf.h"
#include "utils/relcache.h"
-extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk,
- Buffer vmbuf);
+/* Flags for bit map */
+#define VISIBILITYMAP_ALL_VISIBLE 0x01
+#define VISIBILITYMAP_ALL_FROZEN 0x02
+
+extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
Buffer *vmbuf);
extern bool visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid);
-extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
-extern BlockNumber visibilitymap_count(Relation rel);
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags);
+extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf,
+ uint8 flags);
+extern BlockNumber visibilitymap_count(Relation rel, uint8 flags);
extern void visibilitymap_truncate(Relation rel, BlockNumber nheapblocks);
#endif /* VISIBILITYMAP_H */
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 9730561..45b117c 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 201509161
+#define CATALOG_VERSION_NO 201509181
#endif
diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
index 25247b5..e64a1c8 100644
--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -47,6 +47,8 @@ CATALOG(pg_class,1259) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83) BKI_SCHEMA_MACRO
float4 reltuples; /* # of tuples (not always up-to-date) */
int32 relallvisible; /* # of all-visible blocks (not always
* up-to-date) */
+ int32 relallfrozen; /* # of all-frozen blocks (not always
+ up-to-date) */
Oid reltoastrelid; /* OID of toast table; 0 if none */
bool relhasindex; /* T if has (or has had) any indexes */
bool relisshared; /* T if shared across databases */
@@ -95,7 +97,7 @@ typedef FormData_pg_class *Form_pg_class;
* ----------------
*/
-#define Natts_pg_class 30
+#define Natts_pg_class 31
#define Anum_pg_class_relname 1
#define Anum_pg_class_relnamespace 2
#define Anum_pg_class_reltype 3
@@ -107,25 +109,26 @@ typedef FormData_pg_class *Form_pg_class;
#define Anum_pg_class_relpages 9
#define Anum_pg_class_reltuples 10
#define Anum_pg_class_relallvisible 11
-#define Anum_pg_class_reltoastrelid 12
-#define Anum_pg_class_relhasindex 13
-#define Anum_pg_class_relisshared 14
-#define Anum_pg_class_relpersistence 15
-#define Anum_pg_class_relkind 16
-#define Anum_pg_class_relnatts 17
-#define Anum_pg_class_relchecks 18
-#define Anum_pg_class_relhasoids 19
-#define Anum_pg_class_relhaspkey 20
-#define Anum_pg_class_relhasrules 21
-#define Anum_pg_class_relhastriggers 22
-#define Anum_pg_class_relhassubclass 23
-#define Anum_pg_class_relrowsecurity 24
-#define Anum_pg_class_relispopulated 25
-#define Anum_pg_class_relreplident 26
-#define Anum_pg_class_relfrozenxid 27
-#define Anum_pg_class_relminmxid 28
-#define Anum_pg_class_relacl 29
-#define Anum_pg_class_reloptions 30
+#define Anum_pg_class_relallfrozen 12
+#define Anum_pg_class_reltoastrelid 13
+#define Anum_pg_class_relhasindex 14
+#define Anum_pg_class_relisshared 15
+#define Anum_pg_class_relpersistence 16
+#define Anum_pg_class_relkind 17
+#define Anum_pg_class_relnatts 18
+#define Anum_pg_class_relchecks 19
+#define Anum_pg_class_relhasoids 20
+#define Anum_pg_class_relhaspkey 21
+#define Anum_pg_class_relhasrules 22
+#define Anum_pg_class_relhastriggers 23
+#define Anum_pg_class_relhassubclass 24
+#define Anum_pg_class_relrowsecurity 25
+#define Anum_pg_class_relispopulated 26
+#define Anum_pg_class_relreplident 27
+#define Anum_pg_class_relfrozenxid 28
+#define Anum_pg_class_relminmxid 29
+#define Anum_pg_class_relacl 30
+#define Anum_pg_class_reloptions 31
/* ----------------
* initial contents of pg_class
@@ -140,13 +143,13 @@ typedef FormData_pg_class *Form_pg_class;
* Note: "3" in the relfrozenxid column stands for FirstNormalTransactionId;
* similarly, "1" in relminmxid stands for FirstMultiXactId
*/
-DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 f f p r 29 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 0 f f p r 29 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 0 f f p r 31 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index e3a31af..d2bae2d 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -172,6 +172,7 @@ extern void vac_update_relstats(Relation relation,
BlockNumber num_pages,
double num_tuples,
BlockNumber num_all_visible_pages,
+ BlockNumber num_all_frozen_pages,
bool hasindex,
TransactionId frozenxid,
MultiXactId minmulti,
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index a2f78ee..7bf2718 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -178,8 +178,10 @@ typedef PageHeaderData *PageHeader;
* tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
+#define PD_ALL_FROZEN 0x0008 /* all tuples on page are completely
+ frozen */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x000F /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -369,6 +371,13 @@ typedef PageHeaderData *PageHeader;
#define PageClearAllVisible(page) \
(((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
+#define PageIsAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags & PD_ALL_FROZEN)
+#define PageSetAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags |= PD_ALL_FROZEN)
+#define PageClearAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags &= ~PD_ALL_FROZEN)
+
#define PageIsPrunable(page, oldestxmin) \
( \
AssertMacro(TransactionIdIsNormal(oldestxmin)), \
diff --git a/src/test/regress/expected/visibilitymap.out b/src/test/regress/expected/visibilitymap.out
new file mode 100644
index 0000000..a410553
--- /dev/null
+++ b/src/test/regress/expected/visibilitymap.out
@@ -0,0 +1,29 @@
+--
+-- Visibility map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+\set VERBOSITY terse
+-- All pages are become all-visible
+VACUUM vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+ ?column?
+----------
+ t
+(1 row)
+
+-- All pages are become all-frozen
+VACUUM FREEZE vmtest;
+SELECT relallfrozen = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+ ?column?
+----------
+ t
+(1 row)
+
+-- All pages are skipped acoording to VM
+VACUUM FREEZE VERBOSE vmtest;
+INFO: vacuuming "public.vmtest"
+INFO: index "vmtest_pkey" now contains 10000 row versions in 30 pages
+INFO: Skipped 45 frozen pages acoording to visibility map
+INFO: "vmtest": found 0 removable, 0 nonremovable row versions in 0 out of 45 pages
+DROP TABLE vmtest;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 6fc5d1e..a5ff786 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -108,5 +108,8 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare without_oid c
# event triggers cannot run concurrently with any test that runs DDL
test: event_trigger
+# visibility map and vacuum test cannot run concurrently with any test that runs SQL
+test: visibilitymap
+
# run stats by itself because its delay may be insufficient under heavy load
test: stats
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 2ae51cf..d386d67 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -158,3 +158,4 @@ test: with
test: xml
test: event_trigger
test: stats
+test: visibilitymap
\ No newline at end of file
diff --git a/src/test/regress/sql/visibilitymap.sql b/src/test/regress/sql/visibilitymap.sql
new file mode 100644
index 0000000..9bf9094
--- /dev/null
+++ b/src/test/regress/sql/visibilitymap.sql
@@ -0,0 +1,20 @@
+--
+-- Visibility map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+
+\set VERBOSITY terse
+
+-- All pages are become all-visible
+VACUUM vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+
+-- All pages are become all-frozen
+VACUUM FREEZE vmtest;
+SELECT relallfrozen = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+
+-- All pages are skipped acoording to VM
+VACUUM FREEZE VERBOSE vmtest;
+
+DROP TABLE vmtest;
On Fri, Sep 18, 2015 at 8:14 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Fri, Sep 18, 2015 at 6:13 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
On Fri, Sep 4, 2015 at 2:55 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Fri, Sep 4, 2015 at 10:35 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
On Fri, Sep 4, 2015 at 2:23 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Thu, Aug 27, 2015 at 1:54 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Thu, Aug 20, 2015 at 11:46 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:Jim Nasby wrote:
I think things like pageinspect are very different; I really can't see any
use for those beyond debugging (and debugging by an expert at that).I don't think that necessarily means it must continue to be in contrib.
Quite the contrary, I think it is a tool critical enough that it should
not be relegated to be a second-class citizen as it is now (let's face
it, being in contrib *is* second-class citizenship).Attached patch is latest patch.
The previous patch lacks some files for regression test.
Attached fixed v12 patch.The patch could be applied cleanly. "make check" could pass successfully.
But "make check-world -j 2" failed.Thank you for looking at this patch.
Could you tell me what test you got failed?
make check-world -j 2 or more is done successfully in my environment.I tried to do the test again, but initdb failed with the following error.
creating template1 database in data/base/1 ... FATAL: invalid
input syntax for type oid: "f"This error didn't happen when I tested before. So the commit which was
applied recently might interfere with the patch.Thank you for testing!
Attached fixed version patch.
Thanks for updating the patch! Here are comments.
+#include "access/visibilitymap.h"
visibilitymap.h doesn't need to be included in cluster.c.
- errmsg("table row type and query-specified row type do not match"),
+ errmsg("table row type and query-specified row type
do not match"),
This change doesn't seem to be necessary.
+#define Anum_pg_class_relallfrozen 12
Why is pg_class.relallfrozen necessary? ISTM that there is no user of it now.
lazy_scan_heap() calls PageClearAllVisible() when the page containing
dead tuples is marked as all-visible. Shouldn't PageClearAllFrozen() be
called at the same time?
- "vm", /* VISIBILITYMAP_FORKNUM */
+ "vfm", /* VISIBILITYMAP_FORKNUM */
I wonder how much it's worth renaming only the file extension while
there are many places where "visibility map" and "vm" are used,
for example, log messages, function names, variables, etc.
Regards,
--
Fujii Masao
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
I wonder how much it's worth renaming only the file extension while
there are many places where "visibility map" and "vm" are used,
for example, log messages, function names, variables, etc.
I'd be inclined to keep calling it the visibility map (vm) even if it
also contains freeze information.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 10/01/2015 07:43 AM, Robert Haas wrote:
On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
I wonder how much it's worth renaming only the file extension while
there are many places where "visibility map" and "vm" are used,
for example, log messages, function names, variables, etc.I'd be inclined to keep calling it the visibility map (vm) even if it
also contains freeze information.
-1 to rename. Visibility Map is a perfectly good name.
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: WM72e220b9b92f94915a8801fd6060ac6796b8c53c207f718d2081addbe42ba0b889630de04f8251c2f324e0f5bdc7a602@asav-1.01.com
On Fri, Oct 2, 2015 at 7:30 AM, Josh Berkus <josh@agliodbs.com> wrote:
On 10/01/2015 07:43 AM, Robert Haas wrote:
On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
I wonder how much it's worth renaming only the file extension while
there are many places where "visibility map" and "vm" are used,
for example, log messages, function names, variables, etc.I'd be inclined to keep calling it the visibility map (vm) even if it
also contains freeze information.-1 to rename. Visibility Map is a perfectly good name.
Thank you for taking time to review this patch.
Attached latest v14 patch.
v14 patch is changed so that I don't change file name of visibilitymap
to "vfm", and contains some bug fix.
+#include "access/visibilitymap.h"
visibilitymap.h doesn't need to be included in cluster.c.
Fixed.
- errmsg("table row type and query-specified row type do not match"), + errmsg("table row type and query-specified row type do not match"), This change doesn't seem to be necessary.
Fixed.
+#define Anum_pg_class_relallfrozen 12
Why is pg_class.relallfrozen necessary? ISTM that there is no user of it now.
The relallfrozen would be useful for user to estimate time to vacuum
freeze or anti-wrapping vacuum before being done them actually.
(Also this value is used on regression test.)
But this information is not used on planning like relallvisible, so it
would be good to move this information to another system view like
pg_stat_*_tables.
lazy_scan_heap() calls PageClearAllVisible() when the page containing
dead tuples is marked as all-visible. Shouldn't PageClearAllFrozen() be
called at the same time?
Fixed.
- "vm", /* VISIBILITYMAP_FORKNUM */ + "vfm", /* VISIBILITYMAP_FORKNUM */ I wonder how much it's worth renaming only the file extension while there are many places where "visibility map" and "vm" are used, for example, log messages, function names, variables, etc.I'd be inclined to keep calling it the visibility map (vm) even if it
also contains freeze information.-1 to rename. Visibility Map is a perfectly good name.
Yeah, I agree with this.
The latest v14 patch is changed so.
Regards,
--
Masahiko Sawada
Attachments:
000_add_frozen_bit_into_visibilitymap_v14.patchtext/x-patch; charset=US-ASCII; name=000_add_frozen_bit_into_visibilitymap_v14.patchDownload
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 22c5f7a..b1b6a06 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -87,7 +87,7 @@ statapprox_heap(Relation rel, output_type *stat)
* If the page has only visible tuples, then we can find out the free
* space from the FSM and move on.
*/
- if (visibilitymap_test(rel, blkno, &vmbuffer))
+ if (visibilitymap_test(rel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
freespace = GetRecordedFreeSpace(rel, blkno);
stat->tuple_len += BLCKSZ - freespace;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index bcf9871..fc33772 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2176,8 +2176,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
/*
- * Find buffer to insert this tuple into. If the page is all visible,
- * this will also pin the requisite visibility map page.
+ * Find buffer to insert this tuple into. If the page is all visible
+ * or all frozen, this will also pin the requisite visibility map and
+ * frozen map page.
*/
buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
InvalidBuffer, options, bistate,
@@ -2192,7 +2193,11 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
if (PageIsAllVisible(BufferGetPage(buffer)))
{
all_visible_cleared = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(BufferGetPage(buffer));
+ PageClearAllFrozen(BufferGetPage(buffer));
+
visibilitymap_clear(relation,
ItemPointerGetBlockNumber(&(heaptup->t_self)),
vmbuffer);
@@ -2493,7 +2498,11 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
if (PageIsAllVisible(page))
{
all_visible_cleared = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+
visibilitymap_clear(relation,
BufferGetBlockNumber(buffer),
vmbuffer);
@@ -2776,9 +2785,9 @@ heap_delete(Relation relation, ItemPointer tid,
/*
* If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, we'll have to unlock and
- * re-lock, to avoid holding the buffer lock across an I/O. That's a bit
- * unfortunate, but hopefully shouldn't happen often.
+ * visible or all frozen while we were busy locking the buffer, we'll
+ * have to unlock and re-lock, to avoid holding the buffer lock across an
+ * I/O. That's a bit unfortunate, but hopefully shouldn't happen often.
*/
if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
{
@@ -2972,10 +2981,15 @@ l1:
*/
PageSetPrunable(page, xid);
+ /* clear PD_ALL_VISIBLE and PD_ALL_FORZEN flags */
if (PageIsAllVisible(page))
{
all_visible_cleared = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+
visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
vmbuffer);
}
@@ -3254,7 +3268,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
* in the middle of changing this, so we'll need to recheck after we have
* the lock.
*/
- if (PageIsAllVisible(page))
+ if (PageIsAllVisible(page) || PageIsAllFrozen(page))
visibilitymap_pin(relation, block, &vmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
@@ -3850,14 +3864,22 @@ l2:
if (PageIsAllVisible(BufferGetPage(buffer)))
{
all_visible_cleared = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(BufferGetPage(buffer));
+ PageClearAllFrozen(BufferGetPage(buffer));
+
visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
vmbuffer);
}
if (newbuf != buffer && PageIsAllVisible(BufferGetPage(newbuf)))
{
all_visible_cleared_new = true;
+
+ /* all-frozen information is also cleared at the same time */
PageClearAllVisible(BufferGetPage(newbuf));
+ PageClearAllFrozen(BufferGetPage(newbuf));
+
visibilitymap_clear(relation, BufferGetBlockNumber(newbuf),
vmbuffer_new);
}
@@ -6942,7 +6964,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
*/
XLogRecPtr
log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
- TransactionId cutoff_xid)
+ TransactionId cutoff_xid, uint8 vmflags)
{
xl_heap_visible xlrec;
XLogRecPtr recptr;
@@ -6952,6 +6974,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
Assert(BufferIsValid(vm_buffer));
xlrec.cutoff_xid = cutoff_xid;
+ xlrec.flags = vmflags;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
@@ -7541,8 +7564,14 @@ heap_xlog_visible(XLogReaderState *record)
* the subsequent update won't be replayed to clear the flag.
*/
page = BufferGetPage(buffer);
- PageSetAllVisible(page);
+
+ if (xlrec->flags & VISIBILITYMAP_ALL_VISIBLE)
+ PageSetAllVisible(page);
+ if (xlrec->flags & VISIBILITYMAP_ALL_FROZEN)
+ PageSetAllFrozen(page);
+
MarkBufferDirty(buffer);
+
}
else if (action == BLK_RESTORED)
{
@@ -7593,7 +7622,7 @@ heap_xlog_visible(XLogReaderState *record)
*/
if (lsn > PageGetLSN(vmpage))
visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
- xlrec->cutoff_xid);
+ xlrec->cutoff_xid, xlrec->flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
@@ -7743,7 +7772,10 @@ heap_xlog_delete(XLogReaderState *record)
PageSetPrunable(page, XLogRecGetXid(record));
if (xlrec->flags & XLH_DELETE_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
/* Make sure there is no forward chain link in t_ctid */
htup->t_ctid = target_tid;
@@ -7847,7 +7879,10 @@ heap_xlog_insert(XLogReaderState *record)
PageSetLSN(page, lsn);
if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
MarkBufferDirty(buffer);
}
@@ -7986,7 +8021,10 @@ heap_xlog_multi_insert(XLogReaderState *record)
PageSetLSN(page, lsn);
if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
MarkBufferDirty(buffer);
}
@@ -8114,7 +8152,10 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
PageSetPrunable(page, XLogRecGetXid(record));
if (xlrec->flags & XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
PageSetLSN(page, lsn);
MarkBufferDirty(obuffer);
@@ -8249,7 +8290,10 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
elog(PANIC, "heap_update_redo: failed to add tuple");
if (xlrec->flags & XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED)
+ {
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
+ }
freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 7c38772..a284b85 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -21,33 +21,45 @@
*
* NOTES
*
- * The visibility map is a bitmap with one bit per heap page. A set bit means
- * that all tuples on the page are known visible to all transactions, and
- * therefore the page doesn't need to be vacuumed. The map is conservative in
- * the sense that we make sure that whenever a bit is set, we know the
- * condition is true, but if a bit is not set, it might or might not be true.
+ * The visibility map is a bitmap with two bits (all-visible and all-frozen)
+ * per heap page. A set all-visible bit means that all tuples on the page are
+ * known visible to all transactions, and therefore the page doesn't need to
+ * be vacuumed. A set all-frozen bit means that all tuples on the page are
+ * completely frozen, and therefore the page doesn't need to be vacuumed even
+ * if whole table scanning vacuum is required (e.g. anti-wraparound vacuum).
+ * A all-frozen bit must be set only when the page is already all-visible.
+ * That is, all-frozen bit is always set with all-visible bit.
+ *
+ * The map is conservative in the sense that we make sure that whenever a bit
+ * is set, we know the condition is true, but if a bit is not set, it might or
+ * might not be true.
*
* Clearing a visibility map bit is not separately WAL-logged. The callers
* must make sure that whenever a bit is cleared, the bit is cleared on WAL
- * replay of the updating operation as well.
+ * replay of the updating operation as well. And all-frozen bit must be
+ * cleared with all-visible at the same time.
*
* When we *set* a visibility map during VACUUM, we must write WAL. This may
* seem counterintuitive, since the bit is basically a hint: if it is clear,
- * it may still be the case that every tuple on the page is visible to all
- * transactions; we just don't know that for certain. The difficulty is that
- * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
- * on the page itself, and the visibility map bit. If a crash occurs after the
- * visibility map page makes it to disk and before the updated heap page makes
- * it to disk, redo must set the bit on the heap page. Otherwise, the next
- * insert, update, or delete on the heap page will fail to realize that the
- * visibility map bit must be cleared, possibly causing index-only scans to
- * return wrong answers.
+ * it may still be the case that every tuple on the page is visible or frozen
+ * to all transactions; we just don't know that for certain. The difficulty is
+ * that there are two bits which are typically set together: the PD_ALL_VISIBLE
+ * or PD_ALL_FROZEN bit on the page itself, and the visibility map bit. If a
+ * crash occurs after the visibility map page makes it to disk and before the
+ * updated heap page makes it to disk, redo must set the bit on the heap page.
+ * Otherwise, the next insert, update, or delete on the heap page will fail to
+ * realize that the visibility map bit must be cleared, possibly causing index-only
+ * scans to return wrong answers.
*
* VACUUM will normally skip pages for which the visibility map bit is set;
* such pages can't contain any dead tuples and therefore don't need vacuuming.
- * The visibility map is not used for anti-wraparound vacuums, because
+ * The visibility map is not used for anti-wraparound vacuums before 9.5, because
* an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
* present in the table, even on pages that don't have any dead tuples.
+ * 9.6 or later, the visibility map has a additional bit which indicates all tuple
+ * on single page has been completely forzen, so the visibility map is also used for
+ * anti-wraparound vacuums.
+ *
*
* LOCKING
*
@@ -58,14 +70,14 @@
* section that logs the page modification. However, we don't want to hold
* the buffer lock over any I/O that may be required to read in the visibility
* map page. To avoid this, we examine the heap page before locking it;
- * if the page-level PD_ALL_VISIBLE bit is set, we pin the visibility map
- * bit. Then, we lock the buffer. But this creates a race condition: there
- * is a possibility that in the time it takes to lock the buffer, the
- * PD_ALL_VISIBLE bit gets set. If that happens, we have to unlock the
- * buffer, pin the visibility map page, and relock the buffer. This shouldn't
- * happen often, because only VACUUM currently sets visibility map bits,
- * and the race will only occur if VACUUM processes a given page at almost
- * exactly the same time that someone tries to further modify it.
+ * if the page-level PD_ALL_VISIBLE or PD_ALL_FROZEN bit is set, we pin the
+ * visibility map bit. Then, we lock the buffer. But this creates a race
+ * condition: there is a possibility that in the time it takes to lock the
+ * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that happens,
+ * we have to unlock the buffer, pin the visibility map page, and relock the
+ * buffer. This shouldn't happen often, because only VACUUM currently sets
+ * visibility map bits, and the race will only occur if VACUUM processes a given
+ * page at almost exactly the same time that someone tries to further modify it.
*
* To set a bit, you need to hold a lock on the heap page. That prevents
* the race condition where VACUUM sees that all tuples on the page are
@@ -101,11 +113,14 @@
*/
#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
-/* Number of bits allocated for each heap block. */
-#define BITS_PER_HEAPBLOCK 1
+/*
+ * Number of bits allocated for each heap block.
+ * One for all-visible, other for all-frozen.
+*/
+#define BITS_PER_HEAPBLOCK 2
/* Number of heap blocks we can represent in one byte. */
-#define HEAPBLOCKS_PER_BYTE 8
+#define HEAPBLOCKS_PER_BYTE 4
/* Number of heap blocks we can represent in one visibility map page. */
#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
@@ -115,24 +130,42 @@
#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
#define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
-/* table for fast counting of set bits */
-static const uint8 number_of_ones[256] = {
- 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
+/* tables for fast counting of set bits for visible and freeze */
+static const uint8 number_of_ones_for_visible[256] = {
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4
+};
+static const uint8 number_of_ones_for_frozen[256] = {
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4
};
/* prototypes for internal routines */
@@ -141,7 +174,7 @@ static void vm_extend(Relation rel, BlockNumber nvmblocks);
/*
- * visibilitymap_clear - clear a bit in visibility map
+ * visibilitymap_clear - clear all bits in visibility map
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -153,7 +186,8 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
int mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- uint8 mask = 1 << mapBit;
+ uint8 mask = (VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN) <<
+ (BITS_PER_HEAPBLOCK * mapBit);
char *map;
#ifdef TRACE_VISIBILITYMAP
@@ -225,7 +259,7 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
}
/*
- * visibilitymap_set - set a bit on a previously pinned page
+ * visibilitymap_set - set bit(s) on a previously pinned page
*
* recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
* or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
@@ -234,10 +268,11 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
* marked all-visible; it is needed for Hot Standby, and can be
* InvalidTransactionId if the page contains no tuples.
*
- * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
- * this function. Except in recovery, caller should also pass the heap
- * buffer. When checksums are enabled and we're not in recovery, we must add
- * the heap buffer to the WAL chain to protect it from being torn.
+ * Caller is expected to set the heap page's PD_ALL_VISIBLE or PD_ALL_FROZEN
+ * bit before calling this function. Except in recovery, caller should also
+ * pass the heap buffer and flags which indicates what flag we want to set.
+ * When checksums are enabled and we're not in recovery, we must add the heap
+ * buffer to the WAL chain to protect it from being torn.
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -245,7 +280,8 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
*/
void
visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid)
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -254,7 +290,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_set %s %d %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
@@ -272,11 +308,11 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map = PageGetContents(page);
LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
- if (!(map[mapByte] & (1 << mapBit)))
+ if (flags != (map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))))
{
START_CRIT_SECTION();
- map[mapByte] |= (1 << mapBit);
+ map[mapByte] |= (flags << (BITS_PER_HEAPBLOCK * mapBit));
MarkBufferDirty(vmBuf);
if (RelationNeedsWAL(rel))
@@ -285,7 +321,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Assert(!InRecovery);
recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
- cutoff_xid);
+ cutoff_xid, flags);
/*
* If data checksums are enabled (or wal_log_hints=on), we
@@ -295,11 +331,15 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Page heapPage = BufferGetPage(heapBuf);
- /* caller is expected to set PD_ALL_VISIBLE first */
- Assert(PageIsAllVisible(heapPage));
+ /*
+ * caller is expected to set PD_ALL_VISIBLE or
+ * PD_ALL_FROZEN first.
+ */
+ Assert(PageIsAllVisible(heapPage) || PageIsAllFrozen(heapPage));
PageSetLSN(heapPage, recptr);
}
}
+
PageSetLSN(page, recptr);
}
@@ -310,15 +350,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
}
/*
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_test - test if bit(s) is set
*
- * Are all tuples on heapBlk visible to all, according to the visibility map?
+ * Are all tuples on heapBlk visible or frozen to all, according to the visibility map?
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by an
* earlier call to visibilitymap_pin or visibilitymap_test on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk, or InvalidBuffer. The caller is responsible for
- * releasing *buf after it's done testing and setting bits.
+ * releasing *buf after it's done testing and setting bits, and must set flags
+ * which indicates what flag we want to test.
*
* NOTE: This function is typically called without a lock on the heap page,
* so somebody else could change the bit just after we look at it. In fact,
@@ -328,7 +369,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
* all concurrency issues!
*/
bool
-visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
+visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf, uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -337,7 +378,7 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_test %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_test %s %d %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
/* Reuse the old pinned buffer if possible */
@@ -360,11 +401,12 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
map = PageGetContents(BufferGetPage(*buf));
/*
- * A single-bit read is atomic. There could be memory-ordering effects
+ * A single or double bit read is atomic. There could be memory-ordering effects
* here, but for performance reasons we make it the caller's job to worry
* about that.
*/
- result = (map[mapByte] & (1 << mapBit)) ? true : false;
+ result = (map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))) ?
+ true : false;
return result;
}
@@ -374,10 +416,11 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
*
* Note: we ignore the possibility of race conditions when the table is being
* extended concurrently with the call. New pages added to the table aren't
- * going to be marked all-visible, so they won't affect the result.
+ * going to be marked all-visible or all-frozen, so they won't affect the result.
+ * The caller must set the flags which indicates what flag we want to count.
*/
BlockNumber
-visibilitymap_count(Relation rel)
+visibilitymap_count(Relation rel, uint8 flags)
{
BlockNumber result = 0;
BlockNumber mapBlock;
@@ -406,7 +449,10 @@ visibilitymap_count(Relation rel)
for (i = 0; i < MAPSIZE; i++)
{
- result += number_of_ones[map[i]];
+ if (flags & VISIBILITYMAP_ALL_VISIBLE)
+ result += number_of_ones_for_visible[map[i]];
+ if (flags & VISIBILITYMAP_ALL_FROZEN)
+ result += number_of_ones_for_frozen[map[i]];
}
ReleaseBuffer(mapBuffer);
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e59b163..10f8dc9 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1919,11 +1919,18 @@ index_update_stats(Relation rel,
{
BlockNumber relpages = RelationGetNumberOfBlocks(rel);
BlockNumber relallvisible;
+ BlockNumber relallfrozen;
if (rd_rel->relkind != RELKIND_INDEX)
- relallvisible = visibilitymap_count(rel);
+ {
+ relallvisible = visibilitymap_count(rel, VISIBILITYMAP_ALL_VISIBLE);
+ relallfrozen = visibilitymap_count(rel, VISIBILITYMAP_ALL_FROZEN);
+ }
else /* don't bother for indexes */
+ {
relallvisible = 0;
+ relallfrozen = 0;
+ }
if (rd_rel->relpages != (int32) relpages)
{
@@ -1940,6 +1947,11 @@ index_update_stats(Relation rel,
rd_rel->relallvisible = (int32) relallvisible;
dirty = true;
}
+ if (rd_rel->relallfrozen != (int32) relallfrozen)
+ {
+ rd_rel->relallfrozen = (int32) relallfrozen;
+ dirty = true;
+ }
}
/*
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index ddb68ab..ee13f41 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -572,7 +572,8 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
vac_update_relstats(onerel,
relpages,
totalrows,
- visibilitymap_count(onerel),
+ visibilitymap_count(onerel, VISIBILITYMAP_ALL_VISIBLE),
+ visibilitymap_count(onerel, VISIBILITYMAP_ALL_FROZEN),
hasindex,
InvalidTransactionId,
InvalidMultiXactId,
@@ -595,6 +596,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
RelationGetNumberOfBlocks(Irel[ind]),
totalindexrows,
0,
+ 0,
false,
InvalidTransactionId,
InvalidMultiXactId,
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 6d55148..e5df123 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -747,6 +747,7 @@ void
vac_update_relstats(Relation relation,
BlockNumber num_pages, double num_tuples,
BlockNumber num_all_visible_pages,
+ BlockNumber num_all_frozen_pages,
bool hasindex, TransactionId frozenxid,
MultiXactId minmulti,
bool in_outer_xact)
@@ -784,6 +785,11 @@ vac_update_relstats(Relation relation,
pgcform->relallvisible = (int32) num_all_visible_pages;
dirty = true;
}
+ if (pgcform->relallfrozen != (int32) num_all_frozen_pages)
+ {
+ pgcform->relallfrozen = (int32) num_all_frozen_pages;
+ dirty = true;
+ }
/* Apply DDL updates, but not inside an outer transaction (see above) */
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index a01cfb4..35a025e 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -106,6 +106,8 @@ typedef struct LVRelStats
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* number of pages we examined */
BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber vmskipped_frozen_pages; /* # of pages we skipped by all-frozen bit
+ of visibility map */
double scanned_tuples; /* counts only tuples on scanned pages */
double old_rel_tuples; /* previous value of pg_class.reltuples */
double new_rel_tuples; /* new estimated total # of tuples */
@@ -156,7 +158,7 @@ static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
static int vac_cmp_itemptr(const void *left, const void *right);
static bool heap_page_is_all_visible(Relation rel, Buffer buf,
- TransactionId *visibility_cutoff_xid);
+ TransactionId *visibility_cutoff_xid, bool *all_frozen);
/*
@@ -188,7 +190,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
MultiXactId mxactFullScanLimit;
BlockNumber new_rel_pages;
double new_rel_tuples;
- BlockNumber new_rel_allvisible;
+ BlockNumber new_rel_allvisible,
+ new_rel_allfrozen;
double new_live_tuples;
TransactionId new_frozen_xid;
MultiXactId new_min_multi;
@@ -222,6 +225,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* than or equal to the requested Xid full-table scan limit; or if the
* table's minimum MultiXactId is older than or equal to the requested
* mxid full-table scan limit.
+ * Even if scan_all is set so far, we could skip to scan some pages
+ * according by frozen map.
*/
scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
xidFullScanLimit);
@@ -253,7 +258,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
- if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages + vacrelstats->vmskipped_frozen_pages)
+ < vacrelstats->rel_pages)
{
Assert(!scan_all);
scanned_all = false;
@@ -301,10 +307,14 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_tuples = vacrelstats->old_rel_tuples;
}
- new_rel_allvisible = visibilitymap_count(onerel);
+ new_rel_allvisible = visibilitymap_count(onerel, VISIBILITYMAP_ALL_VISIBLE);
if (new_rel_allvisible > new_rel_pages)
new_rel_allvisible = new_rel_pages;
+ new_rel_allfrozen = visibilitymap_count(onerel, VISIBILITYMAP_ALL_FROZEN);
+ if (new_rel_allfrozen > new_rel_pages)
+ new_rel_allfrozen = new_rel_pages;
+
new_frozen_xid = scanned_all ? FreezeLimit : InvalidTransactionId;
new_min_multi = scanned_all ? MultiXactCutoff : InvalidMultiXactId;
@@ -312,6 +322,7 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_pages,
new_rel_tuples,
new_rel_allvisible,
+ new_rel_allfrozen,
vacrelstats->hasindex,
new_frozen_xid,
new_min_multi,
@@ -360,10 +371,11 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel),
vacrelstats->num_index_scans);
- appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins\n"),
+ appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped according to vm\n"),
vacrelstats->pages_removed,
vacrelstats->rel_pages,
- vacrelstats->pinskipped_pages);
+ vacrelstats->pinskipped_pages,
+ vacrelstats->vmskipped_frozen_pages);
appendStringInfo(&buf,
_("tuples: %.0f removed, %.0f remain, %.0f are dead but not yet removable\n"),
vacrelstats->tuples_deleted,
@@ -486,9 +498,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* consecutive pages. Since we're reading sequentially, the OS should be
* doing readahead for us, so there's no gain in skipping a page now and
* then; that's likely to disable readahead and so be counterproductive.
- * Also, skipping even a single page means that we can't update
- * relfrozenxid, so we only want to do it if we can skip a goodly number
- * of pages.
+ * Also, skipping even a single page accorinding to all-visible bit of
+ * visibility map means that we can't update relfrozenxid, so we only want
+ * to do it if we can skip a goodly number. On the other hand, we count
+ * both how many pages we skipped according to all-frozen bit of visibility
+ * map and how many pages we freeze page, so we can update relfrozenxid if
+ * the sum of them is as many as pages of table.
*
* Before entering the main loop, establish the invariant that
* next_not_all_visible_block is the next block number >= blkno that's not
@@ -498,7 +513,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* started skipping blocks, we may as well skip everything up to the next
* not-all-visible block.
*
- * Note: if scan_all is true, we won't actually skip any pages; but we
+ * Note: if scan_all is true, we might not actually skip any pages; but we
* maintain next_not_all_visible_block anyway, so as to set up the
* all_visible_according_to_vm flag correctly for each page.
*
@@ -515,7 +530,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer))
+ if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
break;
vacuum_delay_point();
}
@@ -533,9 +548,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
bool tupgone,
hastup;
int prev_dead_count;
- int nfrozen;
+ int nfrozen; /* # of tuples is frozen */
+ int nalready_frozen; /* # of tuples is already frozen */
+ int ntotal_frozen; /* # of frozen tuples in single page */
+ int ntup_per_page;
Size freespace;
bool all_visible_according_to_vm;
+ bool all_frozen_according_to_vm;
bool all_visible;
bool has_dead_tuples;
TransactionId visibility_cutoff_xid = InvalidTransactionId;
@@ -548,7 +567,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block++)
{
if (!visibilitymap_test(onerel, next_not_all_visible_block,
- &vmbuffer))
+ &vmbuffer,
+ VISIBILITYMAP_ALL_VISIBLE))
break;
vacuum_delay_point();
}
@@ -563,13 +583,32 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
skipping_all_visible_blocks = false;
all_visible_according_to_vm = false;
+ all_frozen_according_to_vm = false;
}
else
{
- /* Current block is all-visible */
- if (skipping_all_visible_blocks && !scan_all)
- continue;
+ /*
+ * This block is at least all-visible according to visibility map.
+ * We check whehter this block is all-frozen to skip to vacuum this
+ * page even if scanning whole page is required.
+ */
+ bool all_frozen = visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_FROZEN);
+ if (scan_all)
+ {
+ if (all_frozen)
+ {
+ vacrelstats->vmskipped_frozen_pages++;
+ continue;
+ }
+ }
+ else
+ {
+ if (skipping_all_visible_blocks)
+ continue;
+ }
+
all_visible_according_to_vm = true;
+ all_frozen_according_to_vm = all_frozen;
}
vacuum_delay_point();
@@ -740,7 +779,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
PageSetAllVisible(page);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, InvalidTransactionId);
+ vmbuffer, InvalidTransactionId,
+ VISIBILITYMAP_ALL_VISIBLE);
END_CRIT_SECTION();
}
@@ -764,6 +804,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
all_visible = true;
has_dead_tuples = false;
nfrozen = 0;
+ nalready_frozen = 0;
+ ntup_per_page = 0;
hastup = false;
prev_dead_count = vacrelstats->num_dead_tuples;
maxoff = PageGetMaxOffsetNumber(page);
@@ -918,8 +960,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
{
num_tuples += 1;
+ ntup_per_page += 1;;
hastup = true;
+ /* Check whether this tuple is alrady frozen or not */
+ if (HeapTupleHeaderXminFrozen(tuple.t_data))
+ nalready_frozen += 1;
+
/*
* Each non-removable tuple must be checked to see if it needs
* freezing. Note we already have exclusive buffer lock.
@@ -931,9 +978,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
} /* scan along page */
/*
- * If we froze any tuples, mark the buffer dirty, and write a WAL
- * record recording the changes. We must log the changes to be
- * crash-safe against future truncation of CLOG.
+ * If we froze any tuples or any tuples are already frozen,
+ * mark the buffer dirty, and write a WAL record recording the changes.
+ * We must log the changes to be crash-safe against future truncation
+ * of CLOG.
*/
if (nfrozen > 0)
{
@@ -966,6 +1014,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
END_CRIT_SECTION();
}
+ /* Compute the number of frozen tuples in a page */
+ ntotal_frozen = nfrozen + nalready_frozen;
+
/*
* If there are no indexes then we can vacuum the page right now
* instead of doing a second scan.
@@ -988,26 +1039,46 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
freespace = PageGetHeapFreeSpace(page);
- /* mark page all-visible, if appropriate */
- if (all_visible && !all_visible_according_to_vm)
+ /* This page is all visible */
+ if (all_visible)
{
- /*
- * It should never be the case that the visibility map page is set
- * while the page-level bit is clear, but the reverse is allowed
- * (if checksums are not enabled). Regardless, set the both bits
- * so that we get back in sync.
- *
- * NB: If the heap page is all-visible but the VM bit is not set,
- * we don't need to dirty the heap page. However, if checksums
- * are enabled, we do need to make sure that the heap page is
- * dirtied before passing it to visibilitymap_set(), because it
- * may be logged. Given that this situation should only happen in
- * rare cases after a crash, it is not worth optimizing.
- */
- PageSetAllVisible(page);
- MarkBufferDirty(buf);
- visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ /* mark page all-visible, if appropriate */
+ if (!all_visible_according_to_vm)
+ {
+ /*
+ * It should never be the case that the visibility map page is set
+ * while the page-level bit is clear, but the reverse is allowed
+ * (if checksums are not enabled). Regardless, set the both bits
+ * so that we get back in sync.
+ *
+ * NB: If the heap page is all-visible but the VM bit is not set,
+ * we don't need to dirty the heap page. However, if checksums
+ * are enabled, we do need to make sure that the heap page is
+ * dirtied before passing it to visibilitymap_set(), because it
+ * may be logged. Given that this situation should only happen in
+ * rare cases after a crash, it is not worth optimizing.
+ */
+ PageSetAllVisible(page);
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+ }
+
+ /* mark page all-frozen, if all tuples are frozen and not marked yet */
+ if ((ntotal_frozen == ntup_per_page) && !all_frozen_according_to_vm)
+ {
+ Assert(PageIsAllVisible(page));
+
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ {
+ MarkBufferDirty(buf);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+ vmbuffer, visibility_cutoff_xid, flags);
+ }
}
/*
@@ -1018,7 +1089,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* that something bad has happened.
*/
else if (all_visible_according_to_vm && !PageIsAllVisible(page)
- && visibilitymap_test(onerel, blkno, &vmbuffer))
+ && visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
relname, blkno);
@@ -1043,6 +1114,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
relname, blkno);
PageClearAllVisible(page);
+ PageClearAllFrozen(page);
MarkBufferDirty(buf);
visibilitymap_clear(onerel, blkno, vmbuffer);
}
@@ -1078,7 +1150,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
num_tuples);
/*
- * Release any remaining pin on visibility map page.
+ * Release any remaining pin on visibility map and frozen map page.
*/
if (BufferIsValid(vmbuffer))
{
@@ -1115,6 +1187,14 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
tups_vacuumed, vacuumed_pages)));
/*
+ * This information would be effective for how much effect all-frozen bit
+ * of VM had for freezing tuples.
+ */
+ ereport(elevel,
+ (errmsg("Skipped %d frozen pages acoording to visibility map",
+ vacrelstats->vmskipped_frozen_pages)));
+
+ /*
* This is pretty messy, but we split it up so that we can skip emitting
* individual parts of the message when not applicable.
*/
@@ -1226,6 +1306,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
OffsetNumber unused[MaxOffsetNumber];
int uncnt = 0;
TransactionId visibility_cutoff_xid;
+ bool all_frozen;
START_CRIT_SECTION();
@@ -1277,19 +1358,31 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
* dirty, exclusively locked, and, if needed, a full page image has been
* emitted in the log_heap_clean() above.
*/
- if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid))
+ if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid, &all_frozen))
PageSetAllVisible(page);
/*
* All the changes to the heap page have been done. If the all-visible
- * flag is now set, also set the VM bit.
+ * flag is now set, also set the VM all-visible bit.
+ * Also, if this page is all-frozen, set VM all-frozen bit and flag.
*/
- if (PageIsAllVisible(page) &&
- !visibilitymap_test(onerel, blkno, vmbuffer))
+ if (PageIsAllVisible(page))
{
- Assert(BufferIsValid(*vmbuffer));
- visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
- visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ if (!visibilitymap_test(onerel, blkno, vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+
+ /* mark page all-frozen, and set VM all-frozen bit */
+ if (all_frozen)
+ {
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
+ visibility_cutoff_xid, flags);
}
return tupindex;
@@ -1408,6 +1501,7 @@ lazy_cleanup_index(Relation indrel,
stats->num_pages,
stats->num_index_tuples,
0,
+ 0,
false,
InvalidTransactionId,
InvalidMultiXactId,
@@ -1779,10 +1873,12 @@ vac_cmp_itemptr(const void *left, const void *right)
/*
* Check if every tuple in the given page is visible to all current and future
* transactions. Also return the visibility_cutoff_xid which is the highest
- * xmin amongst the visible tuples.
+ * xmin amongst the visible tuples, and all_forzen which implies that all tuples
+ * of this page are frozen.
*/
static bool
-heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid)
+heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid,
+ bool *all_frozen)
{
Page page = BufferGetPage(buf);
BlockNumber blockno = BufferGetBlockNumber(buf);
@@ -1791,6 +1887,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
bool all_visible = true;
*visibility_cutoff_xid = InvalidTransactionId;
+ *all_frozen = true;
/*
* This is a stripped down version of the line pointer scan in
@@ -1814,11 +1911,12 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/*
* Dead line pointers can have index pointers pointing to them. So
- * they can't be treated as visible
+ * they can't be treated as visible and frozen.
*/
if (ItemIdIsDead(itemid))
{
all_visible = false;
+ *all_frozen = false;
break;
}
@@ -1855,6 +1953,10 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
+
+ /* Check whether this tuple is alrady frozen or not */
+ if (!HeapTupleHeaderXminFrozen(tuple.t_data))
+ *all_frozen = false;
}
break;
@@ -1863,6 +1965,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
case HEAPTUPLE_INSERT_IN_PROGRESS:
case HEAPTUPLE_DELETE_IN_PROGRESS:
all_visible = false;
+ *all_frozen = false;
break;
default:
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 9f54c46..08df289 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -116,7 +116,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
*/
if (!visibilitymap_test(scandesc->heapRelation,
ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
+ &node->ioss_VMBuffer, VISIBILITYMAP_ALL_VISIBLE))
{
/*
* Rats, we have to visit the heap to check visibility.
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 41d4606..c34b5da 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -231,6 +231,15 @@ check_cluster_versions(void)
if (old_cluster.major_version > new_cluster.major_version)
pg_fatal("This utility cannot be used to downgrade to older major PostgreSQL versions.\n");
+ /*
+ * We cant't allow upgrading with link mode between 9.5 or before and 9.6 or later,
+ * because the format of visibility map has changed on version 9.6.
+ */
+ if (user_opts.transfer_mode == TRANSFER_MODE_LINK &&
+ GET_MAJOR_VERSION(old_cluster.major_version) < 906 &&
+ GET_MAJOR_VERSION(new_cluster.major_version) >= 906)
+ pg_fatal("This utility cannot upgrade from PostgreSQL version from 9.5 or before to 9.6 or later with link mode.\n");
+
/* get old and new binary versions */
get_bin_version(&old_cluster);
get_bin_version(&new_cluster);
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 79d9390..392d474 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -10,6 +10,7 @@
#include "postgres_fe.h"
#include "pg_upgrade.h"
+#include "storage/bufpage.h"
#include <fcntl.h>
@@ -21,6 +22,43 @@ static int copy_file(const char *fromfile, const char *tofile, bool force);
static int win32_pghardlink(const char *src, const char *dst);
#endif
+static const char *rewriteVisibilitymap(const char *fromfile, const char *tofile,
+ bool force);
+
+/* table for fast rewriting vm file in order to add all-frozen information */
+static const uint16 rewrite_vm_table[256] = {
+ 0, 1, 4, 5, 16, 17, 20, 21, 64, 65, 68, 69, 80, 81, 84, 85,
+ 256, 257, 260, 261, 272, 273, 276, 277, 320, 321, 324, 325, 336, 337, 340, 341,
+ 1024, 1025, 1028, 1029, 1040, 1041, 1044, 1045, 1088, 1089, 1092, 1093, 1104, 1105, 1108, 1109,
+ 1280, 1281, 1284, 1285, 1296, 1297, 1300, 1301, 1344, 1345, 1348, 1349, 1360, 1361, 1364, 1365,
+ 4096, 4097, 4100, 4101, 4112, 4113, 4116, 4117, 4160, 4161, 4164, 4165, 4176, 4177, 4180, 4181,
+ 4352, 4353, 4356, 4357, 4368, 4369, 4372, 4373, 4416, 4417, 4420, 4421, 4432, 4433, 4436, 4437,
+ 5120, 5121, 5124, 5125, 5136, 5137, 5140, 5141, 5184, 5185, 5188, 5189, 5200, 5201, 5204, 5205,
+ 5376, 5377, 5380, 5381, 5392, 5393, 5396, 5397, 5440, 5441, 5444, 5445, 5456, 5457, 5460, 5461,
+ 16384, 16385, 16388, 16389, 16400, 16401, 16404, 16405, 16448, 16449, 16452, 16453, 16464, 16465, 16468, 16469,
+ 16640, 16641, 16644, 16645, 16656, 16657, 16660, 16661, 16704, 16705, 16708, 16709, 16720, 16721, 16724, 16725,
+ 17408, 17409, 17412, 17413, 17424, 17425, 17428, 17429, 17472, 17473, 17476, 17477, 17488, 17489, 17492, 17493,
+ 17664, 17665, 17668, 17669, 17680, 17681, 17684, 17685, 17728, 17729, 17732, 17733, 17744, 17745, 17748, 17749,
+ 20480, 20481, 20484, 20485, 20496, 20497, 20500, 20501, 20544, 20545, 20548, 20549, 20560, 20561, 20564, 20565,
+ 20736, 20737, 20740, 20741, 20752, 20753, 20756, 20757, 20800, 20801, 20804, 20805, 20816, 20817, 20820, 20821,
+ 21504, 21505, 21508, 21509, 21520, 21521, 21524, 21525, 21568, 21569, 21572, 21573, 21584, 21585, 21588, 21589,
+ 21760, 21761, 21764, 21765, 21776, 21777, 21780, 21781, 21824, 21825, 21828, 21829, 21840, 21841, 21844, 21845
+};
+
+/*
+ * copyOrRewriteFile()
+ * This function copies file or rewrite visibility map file.
+ * If rewrite_vm is true, we have to rewrite visibility map regardless value of pageConverter.
+ */
+const char *
+copyOrRewriteFile(pageCnvCtx *pageConverter,
+ const char *src, const char *dst, bool force, bool rewrite_vm)
+{
+ if (rewrite_vm)
+ return rewriteVisibilitymap(src, dst, force);
+ else
+ return copyAndUpdateFile(pageConverter, src, dst, force);
+}
/*
* copyAndUpdateFile()
@@ -32,6 +70,7 @@ const char *
copyAndUpdateFile(pageCnvCtx *pageConverter,
const char *src, const char *dst, bool force)
{
+
if (pageConverter == NULL)
{
if (pg_copy_file(src, dst, force) == -1)
@@ -99,7 +138,6 @@ copyAndUpdateFile(pageCnvCtx *pageConverter,
}
}
-
/*
* linkAndUpdateFile()
*
@@ -201,6 +239,110 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
#endif
+/*
+ * rewriteVisibilitymap()
+ *
+ * A additional bit which indicates that all tuples on page is completely
+ * frozen is added into visibility map at PG 9.6. So the format of visibiilty
+ * map has been changed.
+ * Copies a visibility map file while adding all-frozen bit(0) into each bit.
+ */
+static const char *
+rewriteVisibilitymap(const char *fromfile, const char *tofile, bool force)
+{
+#define REWRITE_BUF_SIZE (50 * BLCKSZ)
+#define BITS_PER_HEAPBLOCK 2
+
+ int src_fd, dst_fd;
+ uint16 vm_bits;
+ ssize_t nbytes;
+ char *buffer;
+ int ret = 0;
+ int save_errno = 0;
+
+ if ((fromfile == NULL) || (tofile == NULL))
+ {
+ errno = EINVAL;
+ return getErrorText(errno);
+ }
+
+ if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
+ return getErrorText(errno);
+
+ if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+ {
+ save_errno = errno;
+ if (src_fd != 0)
+ close(src_fd);
+
+ errno = save_errno;
+ return getErrorText(errno);
+ }
+
+ buffer = (char *) pg_malloc(REWRITE_BUF_SIZE);
+
+ /* Copy page header data in advance */
+ if ((nbytes = read(src_fd, buffer, MAXALIGN(SizeOfPageHeaderData))) <= 0)
+ {
+ save_errno = errno;
+ return getErrorText(errno);
+ }
+
+ if (write(dst_fd, buffer, nbytes) != nbytes)
+ {
+ /* if write didn't set errno, assume problem is no disk space */
+ if (errno == 0)
+ errno = ENOSPC;
+ save_errno = errno;
+ return getErrorText(errno);
+ }
+
+ /* perform data rewriting i.e read src srouce, write to destination */
+ while (true)
+ {
+ ssize_t nbytes = read(src_fd, buffer, REWRITE_BUF_SIZE);
+ char *cur, *end;
+
+ if (nbytes < 0)
+ {
+ ret = -1;
+ break;
+ }
+
+ if (nbytes == 0)
+ break;
+
+ cur = buffer;
+ end = buffer + nbytes;
+
+ /*
+ * Rewrite a byte and write dest_fd per BITS_PER_HEAPBLOCK bytes.
+ */
+ while (end > cur)
+ {
+ /* Get rewritten bit from table and its string representation */
+ vm_bits = rewrite_vm_table[(uint8) *cur];
+
+ if (write(dst_fd, &vm_bits, BITS_PER_HEAPBLOCK) != BITS_PER_HEAPBLOCK)
+ {
+ ret = -1;
+ break;
+ }
+ cur++;
+ }
+ }
+
+ pg_free(buffer);
+
+ if (src_fd != 0)
+ close(src_fd);
+
+ if (dst_fd != 0)
+ close(dst_fd);
+
+ return NULL;
+}
+
void
check_hard_link(void)
{
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 13aa891..fc92a5f 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -112,6 +112,11 @@ extern char *output_files[];
#define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031
/*
+ * The format of visibility map changed with this 9.6 commit,
+ *
+ */
+#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201509181
+/*
* pg_multixact format changed in 9.3 commit 0ac5ad5134f2769ccbaefec73844f85,
* ("Improve concurrency of foreign key locking") which also updated catalog
* version to this value. pg_upgrade behavior depends on whether old and new
@@ -396,6 +401,8 @@ const pageCnvCtx *setupPageConverter(void);
typedef void *pageCnvCtx;
#endif
+const char *copyOrRewriteFile(pageCnvCtx *pageConverter, const char *src,
+ const char *dst, bool force, bool rewrite_vm);
const char *copyAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
const char *dst, bool force);
const char *linkAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index c22df42..2fa5b47 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -171,6 +171,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
{
int mapnum;
bool vm_crashsafe_match = true;
+ bool vm_rewrite_needed = false;
/*
* Do the old and new cluster disagree on the crash-safetiness of the vm
@@ -180,6 +181,13 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
vm_crashsafe_match = false;
+ /*
+ * Do we need to rewrite visibilitymap?
+ */
+ if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ vm_rewrite_needed = true;
+
for (mapnum = 0; mapnum < size; mapnum++)
{
if (old_tablespace == NULL ||
@@ -195,7 +203,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
* Copy/link any fsm and vm files, if they exist
*/
transfer_relfile(pageConverter, &maps[mapnum], "_fsm");
- if (vm_crashsafe_match)
+ if (vm_crashsafe_match || vm_rewrite_needed)
transfer_relfile(pageConverter, &maps[mapnum], "_vm");
}
}
@@ -218,6 +226,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
int fd;
int segno;
char extent_suffix[65];
+ bool rewrite_vm = false;
/*
* Now copy/link any related segments as well. Remember, PG breaks large
@@ -276,7 +285,15 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
+ /*
+ * Do we need to rewrite visibilitymap?
+ */
+ if (strcmp(type_suffix, "_vm") == 0 &&
+ old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ rewrite_vm = true;
+
+ if ((msg = copyOrRewriteFile(pageConverter, old_file, new_file, true, rewrite_vm)) != NULL)
pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
diff --git a/src/bin/pg_upgrade/test.sh b/src/bin/pg_upgrade/test.sh
index aa7f399..6b058d4 100644
--- a/src/bin/pg_upgrade/test.sh
+++ b/src/bin/pg_upgrade/test.sh
@@ -174,6 +174,11 @@ if "$MAKE" -C "$oldsrc" installcheck; then
mv "$temp_root"/dump1.sql "$temp_root"/dump1.sql.orig
sed "s;$oldsrc;$newsrc;g" "$temp_root"/dump1.sql.orig >"$temp_root"/dump1.sql
fi
+
+ vm_sql="SELECT c.relname, c.relallvisible FROM pg_class as c, pg_namespace as n WHERE c.relnamespace = n.oid AND n.nspname NOT IN ('information_schema', 'pg_toast', 'pg_catalog') ORDER BY c.relname;"
+ # Test for rewriting visibility map
+ vacuumdb -d regression || visibilitymap_vacuum1_status=$?
+ psql -d regression -c "$vm_sql" > "$temp_root"/vm_test1.txt || visibilitymap_test1_status=$?
else
make_installcheck_status=$?
fi
@@ -188,6 +193,14 @@ if [ -n "$pg_dumpall1_status" ]; then
echo "pg_dumpall of pre-upgrade database cluster failed"
exit 1
fi
+if [ -n "$visibilitymap_vacuum1_status" ];then
+ echo "VACUUM of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+if [ -n "$visibilitymap_test1_status" ];then
+ echo "SELECT of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
PGDATA=$BASE_PGDATA
@@ -203,6 +216,8 @@ case $testhost in
esac
pg_dumpall -f "$temp_root"/dump2.sql || pg_dumpall2_status=$?
+vacuumdb -d regression || visibilitymap_vacuum2_status=$?
+psql -d regression -c "$vm_sql" > "$temp_root"/vm_test2.txt || visibilitymap_test2_status=$?
pg_ctl -m fast stop
# no need to echo commands anymore
@@ -214,11 +229,26 @@ if [ -n "$pg_dumpall2_status" ]; then
exit 1
fi
+if [ -n "$visibilitymap_vacuum2_status" ];then
+ echo "VACUUM of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
+if [ -n "$visibilitymap_test2_status" ];then
+ echo "SELECT of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
case $testhost in
MINGW*) cmd /c delete_old_cluster.bat ;;
*) sh ./delete_old_cluster.sh ;;
esac
+if ! diff "$temp_root"/vm_test1.txt "$temp_root"/vm_test2.txt >/dev/null; then
+ echo "Visibility map rewriting test failed"
+ exit 1
+fi
+
if diff "$temp_root"/dump1.sql "$temp_root"/dump2.sql >/dev/null; then
echo PASSED
exit 0
diff --git a/src/common/relpath.c b/src/common/relpath.c
index 66dfef1..52ff14e 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -30,6 +30,9 @@
* If you add a new entry, remember to update the errhint in
* forkname_to_number() below, and update the SGML documentation for
* pg_relation_size().
+ * 9.6 or later, the visibility map fork name is changed from "vm" to
+ * "vfm" bacause visibility map has not only information about all-visible
+ * but also information about all-frozen.
*/
const char *const forkNames[] = {
"main", /* MAIN_FORKNUM */
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index caa0f14..93afb10 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -320,9 +320,10 @@ typedef struct xl_heap_freeze_page
typedef struct xl_heap_visible
{
TransactionId cutoff_xid;
+ uint8 flags;
} xl_heap_visible;
-#define SizeOfHeapVisible (offsetof(xl_heap_visible, cutoff_xid) + sizeof(TransactionId))
+#define SizeOfHeapVisible (offsetof(xl_heap_visible, flags) + sizeof(uint8))
typedef struct xl_heap_new_cid
{
@@ -389,6 +390,6 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
xl_heap_freeze_tuple *xlrec_tp);
extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
- Buffer vm_buffer, TransactionId cutoff_xid);
+ Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
#endif /* HEAPAM_XLOG_H */
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 0c0e0ef..7270609 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -19,15 +19,20 @@
#include "storage/buf.h"
#include "utils/relcache.h"
-extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk,
- Buffer vmbuf);
+/* Flags for bit map */
+#define VISIBILITYMAP_ALL_VISIBLE 0x01
+#define VISIBILITYMAP_ALL_FROZEN 0x02
+
+extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
Buffer *vmbuf);
extern bool visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid);
-extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
-extern BlockNumber visibilitymap_count(Relation rel);
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags);
+extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf,
+ uint8 flags);
+extern BlockNumber visibilitymap_count(Relation rel, uint8 flags);
extern void visibilitymap_truncate(Relation rel, BlockNumber nheapblocks);
#endif /* VISIBILITYMAP_H */
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 9730561..45b117c 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 201509161
+#define CATALOG_VERSION_NO 201509181
#endif
diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
index 25247b5..e64a1c8 100644
--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -47,6 +47,8 @@ CATALOG(pg_class,1259) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83) BKI_SCHEMA_MACRO
float4 reltuples; /* # of tuples (not always up-to-date) */
int32 relallvisible; /* # of all-visible blocks (not always
* up-to-date) */
+ int32 relallfrozen; /* # of all-frozen blocks (not always
+ up-to-date) */
Oid reltoastrelid; /* OID of toast table; 0 if none */
bool relhasindex; /* T if has (or has had) any indexes */
bool relisshared; /* T if shared across databases */
@@ -95,7 +97,7 @@ typedef FormData_pg_class *Form_pg_class;
* ----------------
*/
-#define Natts_pg_class 30
+#define Natts_pg_class 31
#define Anum_pg_class_relname 1
#define Anum_pg_class_relnamespace 2
#define Anum_pg_class_reltype 3
@@ -107,25 +109,26 @@ typedef FormData_pg_class *Form_pg_class;
#define Anum_pg_class_relpages 9
#define Anum_pg_class_reltuples 10
#define Anum_pg_class_relallvisible 11
-#define Anum_pg_class_reltoastrelid 12
-#define Anum_pg_class_relhasindex 13
-#define Anum_pg_class_relisshared 14
-#define Anum_pg_class_relpersistence 15
-#define Anum_pg_class_relkind 16
-#define Anum_pg_class_relnatts 17
-#define Anum_pg_class_relchecks 18
-#define Anum_pg_class_relhasoids 19
-#define Anum_pg_class_relhaspkey 20
-#define Anum_pg_class_relhasrules 21
-#define Anum_pg_class_relhastriggers 22
-#define Anum_pg_class_relhassubclass 23
-#define Anum_pg_class_relrowsecurity 24
-#define Anum_pg_class_relispopulated 25
-#define Anum_pg_class_relreplident 26
-#define Anum_pg_class_relfrozenxid 27
-#define Anum_pg_class_relminmxid 28
-#define Anum_pg_class_relacl 29
-#define Anum_pg_class_reloptions 30
+#define Anum_pg_class_relallfrozen 12
+#define Anum_pg_class_reltoastrelid 13
+#define Anum_pg_class_relhasindex 14
+#define Anum_pg_class_relisshared 15
+#define Anum_pg_class_relpersistence 16
+#define Anum_pg_class_relkind 17
+#define Anum_pg_class_relnatts 18
+#define Anum_pg_class_relchecks 19
+#define Anum_pg_class_relhasoids 20
+#define Anum_pg_class_relhaspkey 21
+#define Anum_pg_class_relhasrules 22
+#define Anum_pg_class_relhastriggers 23
+#define Anum_pg_class_relhassubclass 24
+#define Anum_pg_class_relrowsecurity 25
+#define Anum_pg_class_relispopulated 26
+#define Anum_pg_class_relreplident 27
+#define Anum_pg_class_relfrozenxid 28
+#define Anum_pg_class_relminmxid 29
+#define Anum_pg_class_relacl 30
+#define Anum_pg_class_reloptions 31
/* ----------------
* initial contents of pg_class
@@ -140,13 +143,13 @@ typedef FormData_pg_class *Form_pg_class;
* Note: "3" in the relfrozenxid column stands for FirstNormalTransactionId;
* similarly, "1" in relminmxid stands for FirstMultiXactId
*/
-DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 f f p r 29 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 0 f f p r 29 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 0 f f p r 31 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index e3a31af..d2bae2d 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -172,6 +172,7 @@ extern void vac_update_relstats(Relation relation,
BlockNumber num_pages,
double num_tuples,
BlockNumber num_all_visible_pages,
+ BlockNumber num_all_frozen_pages,
bool hasindex,
TransactionId frozenxid,
MultiXactId minmulti,
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index a2f78ee..7bf2718 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -178,8 +178,10 @@ typedef PageHeaderData *PageHeader;
* tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
+#define PD_ALL_FROZEN 0x0008 /* all tuples on page are completely
+ frozen */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x000F /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -369,6 +371,13 @@ typedef PageHeaderData *PageHeader;
#define PageClearAllVisible(page) \
(((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
+#define PageIsAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags & PD_ALL_FROZEN)
+#define PageSetAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags |= PD_ALL_FROZEN)
+#define PageClearAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags &= ~PD_ALL_FROZEN)
+
#define PageIsPrunable(page, oldestxmin) \
( \
AssertMacro(TransactionIdIsNormal(oldestxmin)), \
diff --git a/src/test/regress/expected/visibilitymap.out b/src/test/regress/expected/visibilitymap.out
new file mode 100644
index 0000000..a410553
--- /dev/null
+++ b/src/test/regress/expected/visibilitymap.out
@@ -0,0 +1,29 @@
+--
+-- Visibility map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+\set VERBOSITY terse
+-- All pages are become all-visible
+VACUUM vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+ ?column?
+----------
+ t
+(1 row)
+
+-- All pages are become all-frozen
+VACUUM FREEZE vmtest;
+SELECT relallfrozen = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+ ?column?
+----------
+ t
+(1 row)
+
+-- All pages are skipped acoording to VM
+VACUUM FREEZE VERBOSE vmtest;
+INFO: vacuuming "public.vmtest"
+INFO: index "vmtest_pkey" now contains 10000 row versions in 30 pages
+INFO: Skipped 45 frozen pages acoording to visibility map
+INFO: "vmtest": found 0 removable, 0 nonremovable row versions in 0 out of 45 pages
+DROP TABLE vmtest;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 6fc5d1e..a5ff786 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -108,5 +108,8 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare without_oid c
# event triggers cannot run concurrently with any test that runs DDL
test: event_trigger
+# visibility map and vacuum test cannot run concurrently with any test that runs SQL
+test: visibilitymap
+
# run stats by itself because its delay may be insufficient under heavy load
test: stats
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 2ae51cf..d386d67 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -158,3 +158,4 @@ test: with
test: xml
test: event_trigger
test: stats
+test: visibilitymap
\ No newline at end of file
diff --git a/src/test/regress/sql/visibilitymap.sql b/src/test/regress/sql/visibilitymap.sql
new file mode 100644
index 0000000..9bf9094
--- /dev/null
+++ b/src/test/regress/sql/visibilitymap.sql
@@ -0,0 +1,20 @@
+--
+-- Visibility map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+
+\set VERBOSITY terse
+
+-- All pages are become all-visible
+VACUUM vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+
+-- All pages are become all-frozen
+VACUUM FREEZE vmtest;
+SELECT relallfrozen = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+
+-- All pages are skipped acoording to VM
+VACUUM FREEZE VERBOSE vmtest;
+
+DROP TABLE vmtest;
Masahiko Sawada wrote:
@@ -2972,10 +2981,15 @@ l1:
*/
PageSetPrunable(page, xid);+ /* clear PD_ALL_VISIBLE and PD_ALL_FORZEN flags */
Typo "FORZEN".
if (PageIsAllVisible(page)) { all_visible_cleared = true; + + /* all-frozen information is also cleared at the same time */ PageClearAllVisible(page); + PageClearAllFrozen(page);
I wonder if it makes sense to have a macro to clear both in unison,
which seems a very common pattern.
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c index 7c38772..a284b85 100644 --- a/src/backend/access/heap/visibilitymap.c +++ b/src/backend/access/heap/visibilitymap.c @@ -21,33 +21,45 @@ * * NOTES * - * The visibility map is a bitmap with one bit per heap page. A set bit means - * that all tuples on the page are known visible to all transactions, and - * therefore the page doesn't need to be vacuumed. The map is conservative in - * the sense that we make sure that whenever a bit is set, we know the - * condition is true, but if a bit is not set, it might or might not be true. + * The visibility map is a bitmap with two bits (all-visible and all-frozen) + * per heap page. A set all-visible bit means that all tuples on the page are + * known visible to all transactions, and therefore the page doesn't need to + * be vacuumed. A set all-frozen bit means that all tuples on the page are + * completely frozen, and therefore the page doesn't need to be vacuumed even + * if whole table scanning vacuum is required (e.g. anti-wraparound vacuum). + * A all-frozen bit must be set only when the page is already all-visible. + * That is, all-frozen bit is always set with all-visible bit.
"A all-frozen" -> "The all-frozen" (but "A set all-xyz" is correct).
* When we *set* a visibility map during VACUUM, we must write WAL. This may * seem counterintuitive, since the bit is basically a hint: if it is clear, - * it may still be the case that every tuple on the page is visible to all - * transactions; we just don't know that for certain. The difficulty is that - * there are two bits which are typically set together: the PD_ALL_VISIBLE bit - * on the page itself, and the visibility map bit. If a crash occurs after the - * visibility map page makes it to disk and before the updated heap page makes - * it to disk, redo must set the bit on the heap page. Otherwise, the next - * insert, update, or delete on the heap page will fail to realize that the - * visibility map bit must be cleared, possibly causing index-only scans to - * return wrong answers. + * it may still be the case that every tuple on the page is visible or frozen + * to all transactions; we just don't know that for certain. The difficulty is + * that there are two bits which are typically set together: the PD_ALL_VISIBLE + * or PD_ALL_FROZEN bit on the page itself, and the visibility map bit. If a + * crash occurs after the visibility map page makes it to disk and before the + * updated heap page makes it to disk, redo must set the bit on the heap page. + * Otherwise, the next insert, update, or delete on the heap page will fail to + * realize that the visibility map bit must be cleared, possibly causing index-only + * scans to return wrong answers.
In the "The difficulty ..." para, I would add the word "corresponding" before
"visibility". Otherwise, it is not clear what the plural means exactly.
* VACUUM will normally skip pages for which the visibility map bit is set; * such pages can't contain any dead tuples and therefore don't need vacuuming. - * The visibility map is not used for anti-wraparound vacuums, because + * The visibility map is not used for anti-wraparound vacuums before 9.5, because * an anti-wraparound vacuum needs to freeze tuples and observe the latest xid * present in the table, even on pages that don't have any dead tuples. + * 9.6 or later, the visibility map has a additional bit which indicates all tuple + * on single page has been completely forzen, so the visibility map is also used for + * anti-wraparound vacuums.
This should not mention database versions. Just explain how the code
behaves today, not how it behaved in the past. Those who want to
understand how it behaved in 9.5 can read the 9.5 code. (Again typo
"forzen".)
@@ -1115,6 +1187,14 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
tups_vacuumed, vacuumed_pages)));/* + * This information would be effective for how much effect all-frozen bit + * of VM had for freezing tuples. + */ + ereport(elevel, + (errmsg("Skipped %d frozen pages acoording to visibility map", + vacrelstats->vmskipped_frozen_pages)));
Message must start on lowercase letter. I don't understand what the
comment means. Can you rephrase it?
@@ -1779,10 +1873,12 @@ vac_cmp_itemptr(const void *left, const void *right) /* * Check if every tuple in the given page is visible to all current and future * transactions. Also return the visibility_cutoff_xid which is the highest - * xmin amongst the visible tuples. + * xmin amongst the visible tuples, and all_forzen which implies that all tuples + * of this page are frozen.
Typo "forzen" here again.
@@ -201,6 +239,110 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
#endif+/* + * rewriteVisibilitymap() + * + * A additional bit which indicates that all tuples on page is completely + * frozen is added into visibility map at PG 9.6. So the format of visibiilty + * map has been changed. + * Copies a visibility map file while adding all-frozen bit(0) into each bit. + */ +static const char * +rewriteVisibilitymap(const char *fromfile, const char *tofile, bool force) +{ +#define REWRITE_BUF_SIZE (50 * BLCKSZ) +#define BITS_PER_HEAPBLOCK 2 + + int src_fd, dst_fd; + uint16 vm_bits; + ssize_t nbytes; + char *buffer; + int ret = 0; + int save_errno = 0; + + if ((fromfile == NULL) || (tofile == NULL)) + { + errno = EINVAL; + return getErrorText(errno); + } + + if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0) + return getErrorText(errno); + + if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0) + { + save_errno = errno; + if (src_fd != 0) + close(src_fd); + + errno = save_errno; + return getErrorText(errno); + } + + buffer = (char *) pg_malloc(REWRITE_BUF_SIZE); + + /* Copy page header data in advance */ + if ((nbytes = read(src_fd, buffer, MAXALIGN(SizeOfPageHeaderData))) <= 0) + { + save_errno = errno; + return getErrorText(errno); + }
Not clear why you bother with save_errno in this path. Forgot to
close()? (Though I wonder why you bother to close() if the program is
going to exit shortly thereafter anyway.)
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h index 13aa891..fc92a5f 100644 --- a/src/bin/pg_upgrade/pg_upgrade.h +++ b/src/bin/pg_upgrade/pg_upgrade.h @@ -112,6 +112,11 @@ extern char *output_files[]; #define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031/* + * The format of visibility map changed with this 9.6 commit, + * + */ +#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201509181
Useless empty line in comment.
diff --git a/src/common/relpath.c b/src/common/relpath.c index 66dfef1..52ff14e 100644 --- a/src/common/relpath.c +++ b/src/common/relpath.c @@ -30,6 +30,9 @@ * If you add a new entry, remember to update the errhint in * forkname_to_number() below, and update the SGML documentation for * pg_relation_size(). + * 9.6 or later, the visibility map fork name is changed from "vm" to + * "vfm" bacause visibility map has not only information about all-visible + * but also information about all-frozen. */ const char *const forkNames[] = { "main", /* MAIN_FORKNUM */
Drop the change in comment? There's no "vfm" in this version of the
patch, is there?
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Oct 3, 2015 at 12:23 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
Masahiko Sawada wrote:
Thank you for taking time to review this feature.
Attached the latest version patch (v15).
@@ -2972,10 +2981,15 @@ l1:
*/
PageSetPrunable(page, xid);+ /* clear PD_ALL_VISIBLE and PD_ALL_FORZEN flags */
Typo "FORZEN".
Fixed.
if (PageIsAllVisible(page)) { all_visible_cleared = true; + + /* all-frozen information is also cleared at the same time */ PageClearAllVisible(page); + PageClearAllFrozen(page);I wonder if it makes sense to have a macro to clear both in unison,
which seems a very common pattern.
Fixed.
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c index 7c38772..a284b85 100644 --- a/src/backend/access/heap/visibilitymap.c +++ b/src/backend/access/heap/visibilitymap.c @@ -21,33 +21,45 @@ * * NOTES * - * The visibility map is a bitmap with one bit per heap page. A set bit means - * that all tuples on the page are known visible to all transactions, and - * therefore the page doesn't need to be vacuumed. The map is conservative in - * the sense that we make sure that whenever a bit is set, we know the - * condition is true, but if a bit is not set, it might or might not be true. + * The visibility map is a bitmap with two bits (all-visible and all-frozen) + * per heap page. A set all-visible bit means that all tuples on the page are + * known visible to all transactions, and therefore the page doesn't need to + * be vacuumed. A set all-frozen bit means that all tuples on the page are + * completely frozen, and therefore the page doesn't need to be vacuumed even + * if whole table scanning vacuum is required (e.g. anti-wraparound vacuum). + * A all-frozen bit must be set only when the page is already all-visible. + * That is, all-frozen bit is always set with all-visible bit."A all-frozen" -> "The all-frozen" (but "A set all-xyz" is correct).
Fixed.
* When we *set* a visibility map during VACUUM, we must write WAL. This may * seem counterintuitive, since the bit is basically a hint: if it is clear, - * it may still be the case that every tuple on the page is visible to all - * transactions; we just don't know that for certain. The difficulty is that - * there are two bits which are typically set together: the PD_ALL_VISIBLE bit - * on the page itself, and the visibility map bit. If a crash occurs after the - * visibility map page makes it to disk and before the updated heap page makes - * it to disk, redo must set the bit on the heap page. Otherwise, the next - * insert, update, or delete on the heap page will fail to realize that the - * visibility map bit must be cleared, possibly causing index-only scans to - * return wrong answers. + * it may still be the case that every tuple on the page is visible or frozen + * to all transactions; we just don't know that for certain. The difficulty is + * that there are two bits which are typically set together: the PD_ALL_VISIBLE + * or PD_ALL_FROZEN bit on the page itself, and the visibility map bit. If a + * crash occurs after the visibility map page makes it to disk and before the + * updated heap page makes it to disk, redo must set the bit on the heap page. + * Otherwise, the next insert, update, or delete on the heap page will fail to + * realize that the visibility map bit must be cleared, possibly causing index-only + * scans to return wrong answers.In the "The difficulty ..." para, I would add the word "corresponding" before
"visibility". Otherwise, it is not clear what the plural means exactly.
Fixed.
* VACUUM will normally skip pages for which the visibility map bit is set; * such pages can't contain any dead tuples and therefore don't need vacuuming. - * The visibility map is not used for anti-wraparound vacuums, because + * The visibility map is not used for anti-wraparound vacuums before 9.5, because * an anti-wraparound vacuum needs to freeze tuples and observe the latest xid * present in the table, even on pages that don't have any dead tuples. + * 9.6 or later, the visibility map has a additional bit which indicates all tuple + * on single page has been completely forzen, so the visibility map is also used for + * anti-wraparound vacuums.This should not mention database versions. Just explain how the code
behaves today, not how it behaved in the past. Those who want to
understand how it behaved in 9.5 can read the 9.5 code. (Again typo
"forzen".)
Changed these comment.
Sorry for the same typo frequently..
@@ -1115,6 +1187,14 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
tups_vacuumed, vacuumed_pages)));/* + * This information would be effective for how much effect all-frozen bit + * of VM had for freezing tuples. + */ + ereport(elevel, + (errmsg("Skipped %d frozen pages acoording to visibility map", + vacrelstats->vmskipped_frozen_pages)));Message must start on lowercase letter. I don't understand what the
comment means. Can you rephrase it?
Fixed.
@@ -1779,10 +1873,12 @@ vac_cmp_itemptr(const void *left, const void *right) /* * Check if every tuple in the given page is visible to all current and future * transactions. Also return the visibility_cutoff_xid which is the highest - * xmin amongst the visible tuples. + * xmin amongst the visible tuples, and all_forzen which implies that all tuples + * of this page are frozen.Typo "forzen" here again.
Fixed.
@@ -201,6 +239,110 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
#endif+/* + * rewriteVisibilitymap() + * + * A additional bit which indicates that all tuples on page is completely + * frozen is added into visibility map at PG 9.6. So the format of visibiilty + * map has been changed. + * Copies a visibility map file while adding all-frozen bit(0) into each bit. + */ +static const char * +rewriteVisibilitymap(const char *fromfile, const char *tofile, bool force) +{ +#define REWRITE_BUF_SIZE (50 * BLCKSZ) +#define BITS_PER_HEAPBLOCK 2 + + int src_fd, dst_fd; + uint16 vm_bits; + ssize_t nbytes; + char *buffer; + int ret = 0; + int save_errno = 0; + + if ((fromfile == NULL) || (tofile == NULL)) + { + errno = EINVAL; + return getErrorText(errno); + } + + if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0) + return getErrorText(errno); + + if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0) + { + save_errno = errno; + if (src_fd != 0) + close(src_fd); + + errno = save_errno; + return getErrorText(errno); + } + + buffer = (char *) pg_malloc(REWRITE_BUF_SIZE); + + /* Copy page header data in advance */ + if ((nbytes = read(src_fd, buffer, MAXALIGN(SizeOfPageHeaderData))) <= 0) + { + save_errno = errno; + return getErrorText(errno); + }Not clear why you bother with save_errno in this path. Forgot to
close()? (Though I wonder why you bother to close() if the program is
going to exit shortly thereafter anyway.)
Fixed.
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h index 13aa891..fc92a5f 100644 --- a/src/bin/pg_upgrade/pg_upgrade.h +++ b/src/bin/pg_upgrade/pg_upgrade.h @@ -112,6 +112,11 @@ extern char *output_files[]; #define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031/* + * The format of visibility map changed with this 9.6 commit, + * + */ +#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201509181Useless empty line in comment.
Fixed.
diff --git a/src/common/relpath.c b/src/common/relpath.c index 66dfef1..52ff14e 100644 --- a/src/common/relpath.c +++ b/src/common/relpath.c @@ -30,6 +30,9 @@ * If you add a new entry, remember to update the errhint in * forkname_to_number() below, and update the SGML documentation for * pg_relation_size(). + * 9.6 or later, the visibility map fork name is changed from "vm" to + * "vfm" bacause visibility map has not only information about all-visible + * but also information about all-frozen. */ const char *const forkNames[] = { "main", /* MAIN_FORKNUM */Drop the change in comment? There's no "vfm" in this version of the
patch, is there?
Fixed.
Regards,
--
Masahiko Sawada
Attachments:
000_add_frozen_bit_into_visibilitymap_v15.patchapplication/octet-stream; name=000_add_frozen_bit_into_visibilitymap_v15.patchDownload
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 22c5f7a..b1b6a06 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -87,7 +87,7 @@ statapprox_heap(Relation rel, output_type *stat)
* If the page has only visible tuples, then we can find out the free
* space from the FSM and move on.
*/
- if (visibilitymap_test(rel, blkno, &vmbuffer))
+ if (visibilitymap_test(rel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
freespace = GetRecordedFreeSpace(rel, blkno);
stat->tuple_len += BLCKSZ - freespace;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index bcf9871..f7adea6 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2176,8 +2176,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
/*
- * Find buffer to insert this tuple into. If the page is all visible,
- * this will also pin the requisite visibility map page.
+ * Find buffer to insert this tuple into. If the page is all visible
+ * or all frozen, this will also pin the requisite visibility map and
+ * frozen map page.
*/
buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
InvalidBuffer, options, bistate,
@@ -2192,7 +2193,10 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
if (PageIsAllVisible(BufferGetPage(buffer)))
{
all_visible_cleared = true;
- PageClearAllVisible(BufferGetPage(buffer));
+
+ /* all-visible and all-frozen information are cleared at the same time */
+ PageClearAllVisibleFrozen(BufferGetPage(buffer));
+
visibilitymap_clear(relation,
ItemPointerGetBlockNumber(&(heaptup->t_self)),
vmbuffer);
@@ -2493,7 +2497,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
if (PageIsAllVisible(page))
{
all_visible_cleared = true;
- PageClearAllVisible(page);
+
+ /* all-visible and all-frozen information are cleared at the same time */
+ PageClearAllVisibleFrozen(page);
+
visibilitymap_clear(relation,
BufferGetBlockNumber(buffer),
vmbuffer);
@@ -2776,9 +2783,9 @@ heap_delete(Relation relation, ItemPointer tid,
/*
* If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, we'll have to unlock and
- * re-lock, to avoid holding the buffer lock across an I/O. That's a bit
- * unfortunate, but hopefully shouldn't happen often.
+ * visible or all frozen while we were busy locking the buffer, we'll
+ * have to unlock and re-lock, to avoid holding the buffer lock across an
+ * I/O. That's a bit unfortunate, but hopefully shouldn't happen often.
*/
if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
{
@@ -2972,10 +2979,14 @@ l1:
*/
PageSetPrunable(page, xid);
+ /* clear PD_ALL_VISIBLE and PD_ALL_FROZEN flags */
if (PageIsAllVisible(page))
{
all_visible_cleared = true;
- PageClearAllVisible(page);
+
+ /* all-visible and all-frozen information are cleared at the same time */
+ PageClearAllVisibleFrozen(page);
+
visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
vmbuffer);
}
@@ -3254,7 +3265,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
* in the middle of changing this, so we'll need to recheck after we have
* the lock.
*/
- if (PageIsAllVisible(page))
+ if (PageIsAllVisible(page) || PageIsAllFrozen(page))
visibilitymap_pin(relation, block, &vmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
@@ -3850,14 +3861,20 @@ l2:
if (PageIsAllVisible(BufferGetPage(buffer)))
{
all_visible_cleared = true;
- PageClearAllVisible(BufferGetPage(buffer));
+
+ /* all-visible and all-frozen information are cleared at the same time */
+ PageClearAllVisibleFrozen(BufferGetPage(buffer));
+
visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
vmbuffer);
}
if (newbuf != buffer && PageIsAllVisible(BufferGetPage(newbuf)))
{
all_visible_cleared_new = true;
- PageClearAllVisible(BufferGetPage(newbuf));
+
+ /* all-visible and all-frozen information are cleared at the same time */
+ PageClearAllVisibleFrozen(BufferGetPage(newbuf));
+
visibilitymap_clear(relation, BufferGetBlockNumber(newbuf),
vmbuffer_new);
}
@@ -6942,7 +6959,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
*/
XLogRecPtr
log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
- TransactionId cutoff_xid)
+ TransactionId cutoff_xid, uint8 vmflags)
{
xl_heap_visible xlrec;
XLogRecPtr recptr;
@@ -6952,6 +6969,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
Assert(BufferIsValid(vm_buffer));
xlrec.cutoff_xid = cutoff_xid;
+ xlrec.flags = vmflags;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
@@ -7541,7 +7559,12 @@ heap_xlog_visible(XLogReaderState *record)
* the subsequent update won't be replayed to clear the flag.
*/
page = BufferGetPage(buffer);
- PageSetAllVisible(page);
+
+ if (xlrec->flags & VISIBILITYMAP_ALL_VISIBLE)
+ PageSetAllVisible(page);
+ if (xlrec->flags & VISIBILITYMAP_ALL_FROZEN)
+ PageSetAllFrozen(page);
+
MarkBufferDirty(buffer);
}
else if (action == BLK_RESTORED)
@@ -7593,7 +7616,7 @@ heap_xlog_visible(XLogReaderState *record)
*/
if (lsn > PageGetLSN(vmpage))
visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
- xlrec->cutoff_xid);
+ xlrec->cutoff_xid, xlrec->flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
@@ -7743,7 +7766,7 @@ heap_xlog_delete(XLogReaderState *record)
PageSetPrunable(page, XLogRecGetXid(record));
if (xlrec->flags & XLH_DELETE_ALL_VISIBLE_CLEARED)
- PageClearAllVisible(page);
+ PageClearAllVisibleFrozen(page);
/* Make sure there is no forward chain link in t_ctid */
htup->t_ctid = target_tid;
@@ -7847,7 +7870,7 @@ heap_xlog_insert(XLogReaderState *record)
PageSetLSN(page, lsn);
if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
- PageClearAllVisible(page);
+ PageClearAllVisibleFrozen(page);
MarkBufferDirty(buffer);
}
@@ -7986,7 +8009,7 @@ heap_xlog_multi_insert(XLogReaderState *record)
PageSetLSN(page, lsn);
if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
- PageClearAllVisible(page);
+ PageClearAllVisibleFrozen(page);
MarkBufferDirty(buffer);
}
@@ -8114,7 +8137,7 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
PageSetPrunable(page, XLogRecGetXid(record));
if (xlrec->flags & XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED)
- PageClearAllVisible(page);
+ PageClearAllVisibleFrozen(page);
PageSetLSN(page, lsn);
MarkBufferDirty(obuffer);
@@ -8249,7 +8272,7 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
elog(PANIC, "heap_update_redo: failed to add tuple");
if (xlrec->flags & XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED)
- PageClearAllVisible(page);
+ PageClearAllVisibleFrozen(page);
freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 7c38772..5242325 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -21,33 +21,41 @@
*
* NOTES
*
- * The visibility map is a bitmap with one bit per heap page. A set bit means
- * that all tuples on the page are known visible to all transactions, and
- * therefore the page doesn't need to be vacuumed. The map is conservative in
- * the sense that we make sure that whenever a bit is set, we know the
- * condition is true, but if a bit is not set, it might or might not be true.
+ * The visibility map is a bitmap with two bits (all-visible and all-frozen)
+ * per heap page. A set all-visible bit means that all tuples on the page are
+ * known visible to all transactions, and therefore the page doesn't need to
+ * be vacuumed. A set all-frozen bit means that all tuples on the page are
+ * completely frozen, and therefore the page doesn't need to be vacuumed even
+ * if whole table scanning vacuum is required (e.g. anti-wraparound vacuum).
+ * The all-frozen bit must be set only when the page is already all-visible.
+ * That is, all-frozen bit is always set with all-visible bit.
+ *
+ * The map is conservative in the sense that we make sure that whenever a bit
+ * is set, we know the condition is true, but if a bit is not set, it might or
+ * might not be true.
*
* Clearing a visibility map bit is not separately WAL-logged. The callers
* must make sure that whenever a bit is cleared, the bit is cleared on WAL
- * replay of the updating operation as well.
+ * replay of the updating operation as well. And all-frozen bit must be
+ * cleared with all-visible at the same time.
*
* When we *set* a visibility map during VACUUM, we must write WAL. This may
* seem counterintuitive, since the bit is basically a hint: if it is clear,
- * it may still be the case that every tuple on the page is visible to all
- * transactions; we just don't know that for certain. The difficulty is that
- * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
- * on the page itself, and the visibility map bit. If a crash occurs after the
- * visibility map page makes it to disk and before the updated heap page makes
- * it to disk, redo must set the bit on the heap page. Otherwise, the next
- * insert, update, or delete on the heap page will fail to realize that the
- * visibility map bit must be cleared, possibly causing index-only scans to
- * return wrong answers.
+ * it may still be the case that every tuple on the page is visible or frozen
+ * to all transactions; we just don't know that for certain. The difficulty is
+ * that there are two bits which are typically set together: the PD_ALL_VISIBLE
+ * or PD_ALL_FROZEN bit on the page itself, and the corresponding visibility map
+ * bit. If a crash occurs after the visibility map page makes it to disk and before
+ * the updated heap page makes it to disk, redo must set the bit on the heap page.
+ * Otherwise, the next insert, update, or delete on the heap page will fail to
+ * realize that the visibility map bit must be cleared, possibly causing index-only
+ * scans to return wrong answers.
*
* VACUUM will normally skip pages for which the visibility map bit is set;
* such pages can't contain any dead tuples and therefore don't need vacuuming.
- * The visibility map is not used for anti-wraparound vacuums, because
- * an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
- * present in the table, even on pages that don't have any dead tuples.
+ * The visibility map has the all-frozen bit which indicates all tuple on
+ * corresponding page has been completely frozen, so the visibility map is also
+ * used for anti-wraparound vacuum, even if freezing tuples is required.
*
* LOCKING
*
@@ -58,14 +66,14 @@
* section that logs the page modification. However, we don't want to hold
* the buffer lock over any I/O that may be required to read in the visibility
* map page. To avoid this, we examine the heap page before locking it;
- * if the page-level PD_ALL_VISIBLE bit is set, we pin the visibility map
- * bit. Then, we lock the buffer. But this creates a race condition: there
- * is a possibility that in the time it takes to lock the buffer, the
- * PD_ALL_VISIBLE bit gets set. If that happens, we have to unlock the
- * buffer, pin the visibility map page, and relock the buffer. This shouldn't
- * happen often, because only VACUUM currently sets visibility map bits,
- * and the race will only occur if VACUUM processes a given page at almost
- * exactly the same time that someone tries to further modify it.
+ * if the page-level PD_ALL_VISIBLE or PD_ALL_FROZEN bit is set, we pin the
+ * visibility map bit. Then, we lock the buffer. But this creates a race
+ * condition: there is a possibility that in the time it takes to lock the
+ * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that happens,
+ * we have to unlock the buffer, pin the visibility map page, and relock the
+ * buffer. This shouldn't happen often, because only VACUUM currently sets
+ * visibility map bits, and the race will only occur if VACUUM processes a given
+ * page at almost exactly the same time that someone tries to further modify it.
*
* To set a bit, you need to hold a lock on the heap page. That prevents
* the race condition where VACUUM sees that all tuples on the page are
@@ -101,11 +109,14 @@
*/
#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
-/* Number of bits allocated for each heap block. */
-#define BITS_PER_HEAPBLOCK 1
+/*
+ * Number of bits allocated for each heap block.
+ * One for all-visible, other for all-frozen.
+*/
+#define BITS_PER_HEAPBLOCK 2
/* Number of heap blocks we can represent in one byte. */
-#define HEAPBLOCKS_PER_BYTE 8
+#define HEAPBLOCKS_PER_BYTE 4
/* Number of heap blocks we can represent in one visibility map page. */
#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
@@ -115,24 +126,42 @@
#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
#define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
-/* table for fast counting of set bits */
-static const uint8 number_of_ones[256] = {
- 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
+/* tables for fast counting of set bits for visible and freeze */
+static const uint8 number_of_ones_for_visible[256] = {
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4
+};
+static const uint8 number_of_ones_for_frozen[256] = {
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4
};
/* prototypes for internal routines */
@@ -141,7 +170,7 @@ static void vm_extend(Relation rel, BlockNumber nvmblocks);
/*
- * visibilitymap_clear - clear a bit in visibility map
+ * visibilitymap_clear - clear all bits in visibility map
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -153,7 +182,8 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
int mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- uint8 mask = 1 << mapBit;
+ uint8 mask = (VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN) <<
+ (BITS_PER_HEAPBLOCK * mapBit);
char *map;
#ifdef TRACE_VISIBILITYMAP
@@ -225,7 +255,7 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
}
/*
- * visibilitymap_set - set a bit on a previously pinned page
+ * visibilitymap_set - set bit(s) on a previously pinned page
*
* recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
* or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
@@ -234,10 +264,11 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
* marked all-visible; it is needed for Hot Standby, and can be
* InvalidTransactionId if the page contains no tuples.
*
- * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
- * this function. Except in recovery, caller should also pass the heap
- * buffer. When checksums are enabled and we're not in recovery, we must add
- * the heap buffer to the WAL chain to protect it from being torn.
+ * Caller is expected to set the heap page's PD_ALL_VISIBLE or PD_ALL_FROZEN
+ * bit before calling this function. Except in recovery, caller should also
+ * pass the heap buffer and flags which indicates what flag we want to set.
+ * When checksums are enabled and we're not in recovery, we must add the heap
+ * buffer to the WAL chain to protect it from being torn.
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -245,7 +276,8 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
*/
void
visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid)
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -254,7 +286,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_set %s %d %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
@@ -272,11 +304,11 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map = PageGetContents(page);
LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
- if (!(map[mapByte] & (1 << mapBit)))
+ if (flags != (map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))))
{
START_CRIT_SECTION();
- map[mapByte] |= (1 << mapBit);
+ map[mapByte] |= (flags << (BITS_PER_HEAPBLOCK * mapBit));
MarkBufferDirty(vmBuf);
if (RelationNeedsWAL(rel))
@@ -285,7 +317,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Assert(!InRecovery);
recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
- cutoff_xid);
+ cutoff_xid, flags);
/*
* If data checksums are enabled (or wal_log_hints=on), we
@@ -295,11 +327,15 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Page heapPage = BufferGetPage(heapBuf);
- /* caller is expected to set PD_ALL_VISIBLE first */
- Assert(PageIsAllVisible(heapPage));
+ /*
+ * caller is expected to set PD_ALL_VISIBLE or
+ * PD_ALL_FROZEN first.
+ */
+ Assert(PageIsAllVisible(heapPage) || PageIsAllFrozen(heapPage));
PageSetLSN(heapPage, recptr);
}
}
+
PageSetLSN(page, recptr);
}
@@ -310,15 +346,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
}
/*
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_test - test if bit(s) is set
*
- * Are all tuples on heapBlk visible to all, according to the visibility map?
+ * Are all tuples on heapBlk visible or frozen to all, according to the visibility map?
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by an
* earlier call to visibilitymap_pin or visibilitymap_test on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk, or InvalidBuffer. The caller is responsible for
- * releasing *buf after it's done testing and setting bits.
+ * releasing *buf after it's done testing and setting bits, and must set flags
+ * which indicates what flag we want to test.
*
* NOTE: This function is typically called without a lock on the heap page,
* so somebody else could change the bit just after we look at it. In fact,
@@ -328,7 +365,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
* all concurrency issues!
*/
bool
-visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
+visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf, uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -337,7 +374,7 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_test %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_test %s %d %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
/* Reuse the old pinned buffer if possible */
@@ -360,11 +397,12 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
map = PageGetContents(BufferGetPage(*buf));
/*
- * A single-bit read is atomic. There could be memory-ordering effects
+ * A single or double bit read is atomic. There could be memory-ordering effects
* here, but for performance reasons we make it the caller's job to worry
* about that.
*/
- result = (map[mapByte] & (1 << mapBit)) ? true : false;
+ result = (map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))) ?
+ true : false;
return result;
}
@@ -374,10 +412,11 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
*
* Note: we ignore the possibility of race conditions when the table is being
* extended concurrently with the call. New pages added to the table aren't
- * going to be marked all-visible, so they won't affect the result.
+ * going to be marked all-visible or all-frozen, so they won't affect the result.
+ * The caller must set the flags which indicates what flag we want to count.
*/
BlockNumber
-visibilitymap_count(Relation rel)
+visibilitymap_count(Relation rel, uint8 flags)
{
BlockNumber result = 0;
BlockNumber mapBlock;
@@ -406,7 +445,10 @@ visibilitymap_count(Relation rel)
for (i = 0; i < MAPSIZE; i++)
{
- result += number_of_ones[map[i]];
+ if (flags & VISIBILITYMAP_ALL_VISIBLE)
+ result += number_of_ones_for_visible[map[i]];
+ if (flags & VISIBILITYMAP_ALL_FROZEN)
+ result += number_of_ones_for_frozen[map[i]];
}
ReleaseBuffer(mapBuffer);
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e59b163..10f8dc9 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1919,11 +1919,18 @@ index_update_stats(Relation rel,
{
BlockNumber relpages = RelationGetNumberOfBlocks(rel);
BlockNumber relallvisible;
+ BlockNumber relallfrozen;
if (rd_rel->relkind != RELKIND_INDEX)
- relallvisible = visibilitymap_count(rel);
+ {
+ relallvisible = visibilitymap_count(rel, VISIBILITYMAP_ALL_VISIBLE);
+ relallfrozen = visibilitymap_count(rel, VISIBILITYMAP_ALL_FROZEN);
+ }
else /* don't bother for indexes */
+ {
relallvisible = 0;
+ relallfrozen = 0;
+ }
if (rd_rel->relpages != (int32) relpages)
{
@@ -1940,6 +1947,11 @@ index_update_stats(Relation rel,
rd_rel->relallvisible = (int32) relallvisible;
dirty = true;
}
+ if (rd_rel->relallfrozen != (int32) relallfrozen)
+ {
+ rd_rel->relallfrozen = (int32) relallfrozen;
+ dirty = true;
+ }
}
/*
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index ddb68ab..ee13f41 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -572,7 +572,8 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
vac_update_relstats(onerel,
relpages,
totalrows,
- visibilitymap_count(onerel),
+ visibilitymap_count(onerel, VISIBILITYMAP_ALL_VISIBLE),
+ visibilitymap_count(onerel, VISIBILITYMAP_ALL_FROZEN),
hasindex,
InvalidTransactionId,
InvalidMultiXactId,
@@ -595,6 +596,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
RelationGetNumberOfBlocks(Irel[ind]),
totalindexrows,
0,
+ 0,
false,
InvalidTransactionId,
InvalidMultiXactId,
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 6d55148..e5df123 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -747,6 +747,7 @@ void
vac_update_relstats(Relation relation,
BlockNumber num_pages, double num_tuples,
BlockNumber num_all_visible_pages,
+ BlockNumber num_all_frozen_pages,
bool hasindex, TransactionId frozenxid,
MultiXactId minmulti,
bool in_outer_xact)
@@ -784,6 +785,11 @@ vac_update_relstats(Relation relation,
pgcform->relallvisible = (int32) num_all_visible_pages;
dirty = true;
}
+ if (pgcform->relallfrozen != (int32) num_all_frozen_pages)
+ {
+ pgcform->relallfrozen = (int32) num_all_frozen_pages;
+ dirty = true;
+ }
/* Apply DDL updates, but not inside an outer transaction (see above) */
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index a01cfb4..4407b14 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -106,6 +106,8 @@ typedef struct LVRelStats
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* number of pages we examined */
BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber vmskipped_frozen_pages; /* # of pages we skipped by all-frozen bit
+ of visibility map */
double scanned_tuples; /* counts only tuples on scanned pages */
double old_rel_tuples; /* previous value of pg_class.reltuples */
double new_rel_tuples; /* new estimated total # of tuples */
@@ -156,7 +158,7 @@ static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
static int vac_cmp_itemptr(const void *left, const void *right);
static bool heap_page_is_all_visible(Relation rel, Buffer buf,
- TransactionId *visibility_cutoff_xid);
+ TransactionId *visibility_cutoff_xid, bool *all_frozen);
/*
@@ -188,7 +190,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
MultiXactId mxactFullScanLimit;
BlockNumber new_rel_pages;
double new_rel_tuples;
- BlockNumber new_rel_allvisible;
+ BlockNumber new_rel_allvisible,
+ new_rel_allfrozen;
double new_live_tuples;
TransactionId new_frozen_xid;
MultiXactId new_min_multi;
@@ -222,6 +225,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* than or equal to the requested Xid full-table scan limit; or if the
* table's minimum MultiXactId is older than or equal to the requested
* mxid full-table scan limit.
+ * Even if scan_all is set so far, we could skip to scan some pages
+ * according by frozen map.
*/
scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
xidFullScanLimit);
@@ -253,7 +258,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
- if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages + vacrelstats->vmskipped_frozen_pages)
+ < vacrelstats->rel_pages)
{
Assert(!scan_all);
scanned_all = false;
@@ -301,10 +307,14 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_tuples = vacrelstats->old_rel_tuples;
}
- new_rel_allvisible = visibilitymap_count(onerel);
+ new_rel_allvisible = visibilitymap_count(onerel, VISIBILITYMAP_ALL_VISIBLE);
if (new_rel_allvisible > new_rel_pages)
new_rel_allvisible = new_rel_pages;
+ new_rel_allfrozen = visibilitymap_count(onerel, VISIBILITYMAP_ALL_FROZEN);
+ if (new_rel_allfrozen > new_rel_pages)
+ new_rel_allfrozen = new_rel_pages;
+
new_frozen_xid = scanned_all ? FreezeLimit : InvalidTransactionId;
new_min_multi = scanned_all ? MultiXactCutoff : InvalidMultiXactId;
@@ -312,6 +322,7 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_pages,
new_rel_tuples,
new_rel_allvisible,
+ new_rel_allfrozen,
vacrelstats->hasindex,
new_frozen_xid,
new_min_multi,
@@ -360,10 +371,11 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel),
vacrelstats->num_index_scans);
- appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins\n"),
+ appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped according to vm\n"),
vacrelstats->pages_removed,
vacrelstats->rel_pages,
- vacrelstats->pinskipped_pages);
+ vacrelstats->pinskipped_pages,
+ vacrelstats->vmskipped_frozen_pages);
appendStringInfo(&buf,
_("tuples: %.0f removed, %.0f remain, %.0f are dead but not yet removable\n"),
vacrelstats->tuples_deleted,
@@ -486,9 +498,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* consecutive pages. Since we're reading sequentially, the OS should be
* doing readahead for us, so there's no gain in skipping a page now and
* then; that's likely to disable readahead and so be counterproductive.
- * Also, skipping even a single page means that we can't update
- * relfrozenxid, so we only want to do it if we can skip a goodly number
- * of pages.
+ * Also, skipping even a single page accorinding to all-visible bit of
+ * visibility map means that we can't update relfrozenxid, so we only want
+ * to do it if we can skip a goodly number. On the other hand, we count
+ * both how many pages we skipped according to all-frozen bit of visibility
+ * map and how many pages we freeze page, so we can update relfrozenxid if
+ * the sum of them is as many as pages of table.
*
* Before entering the main loop, establish the invariant that
* next_not_all_visible_block is the next block number >= blkno that's not
@@ -498,7 +513,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* started skipping blocks, we may as well skip everything up to the next
* not-all-visible block.
*
- * Note: if scan_all is true, we won't actually skip any pages; but we
+ * Note: if scan_all is true, we might not actually skip any pages; but we
* maintain next_not_all_visible_block anyway, so as to set up the
* all_visible_according_to_vm flag correctly for each page.
*
@@ -515,7 +530,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer))
+ if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
break;
vacuum_delay_point();
}
@@ -533,9 +548,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
bool tupgone,
hastup;
int prev_dead_count;
- int nfrozen;
+ int nfrozen; /* # of tuples is frozen */
+ int nalready_frozen; /* # of tuples is already frozen */
+ int ntotal_frozen; /* # of frozen tuples in single page */
+ int ntup_per_page;
Size freespace;
bool all_visible_according_to_vm;
+ bool all_frozen_according_to_vm;
bool all_visible;
bool has_dead_tuples;
TransactionId visibility_cutoff_xid = InvalidTransactionId;
@@ -548,7 +567,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block++)
{
if (!visibilitymap_test(onerel, next_not_all_visible_block,
- &vmbuffer))
+ &vmbuffer,
+ VISIBILITYMAP_ALL_VISIBLE))
break;
vacuum_delay_point();
}
@@ -563,13 +583,32 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
skipping_all_visible_blocks = false;
all_visible_according_to_vm = false;
+ all_frozen_according_to_vm = false;
}
else
{
- /* Current block is all-visible */
- if (skipping_all_visible_blocks && !scan_all)
- continue;
+ /*
+ * This block is at least all-visible according to visibility map.
+ * We check whehter this block is all-frozen to skip to vacuum this
+ * page even if scanning whole page is required.
+ */
+ bool all_frozen = visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_FROZEN);
+ if (scan_all)
+ {
+ if (all_frozen)
+ {
+ vacrelstats->vmskipped_frozen_pages++;
+ continue;
+ }
+ }
+ else
+ {
+ if (skipping_all_visible_blocks)
+ continue;
+ }
+
all_visible_according_to_vm = true;
+ all_frozen_according_to_vm = all_frozen;
}
vacuum_delay_point();
@@ -740,7 +779,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
PageSetAllVisible(page);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, InvalidTransactionId);
+ vmbuffer, InvalidTransactionId,
+ VISIBILITYMAP_ALL_VISIBLE);
END_CRIT_SECTION();
}
@@ -764,6 +804,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
all_visible = true;
has_dead_tuples = false;
nfrozen = 0;
+ nalready_frozen = 0;
+ ntup_per_page = 0;
hastup = false;
prev_dead_count = vacrelstats->num_dead_tuples;
maxoff = PageGetMaxOffsetNumber(page);
@@ -918,8 +960,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
{
num_tuples += 1;
+ ntup_per_page += 1;;
hastup = true;
+ /* Check whether this tuple is alrady frozen or not */
+ if (HeapTupleHeaderXminFrozen(tuple.t_data))
+ nalready_frozen += 1;
+
/*
* Each non-removable tuple must be checked to see if it needs
* freezing. Note we already have exclusive buffer lock.
@@ -931,9 +978,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
} /* scan along page */
/*
- * If we froze any tuples, mark the buffer dirty, and write a WAL
- * record recording the changes. We must log the changes to be
- * crash-safe against future truncation of CLOG.
+ * If we froze any tuples or any tuples are already frozen,
+ * mark the buffer dirty, and write a WAL record recording the changes.
+ * We must log the changes to be crash-safe against future truncation
+ * of CLOG.
*/
if (nfrozen > 0)
{
@@ -966,6 +1014,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
END_CRIT_SECTION();
}
+ /* Compute the number of frozen tuples in a page */
+ ntotal_frozen = nfrozen + nalready_frozen;
+
/*
* If there are no indexes then we can vacuum the page right now
* instead of doing a second scan.
@@ -988,26 +1039,46 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
freespace = PageGetHeapFreeSpace(page);
- /* mark page all-visible, if appropriate */
- if (all_visible && !all_visible_according_to_vm)
+ /* This page is all visible */
+ if (all_visible)
{
- /*
- * It should never be the case that the visibility map page is set
- * while the page-level bit is clear, but the reverse is allowed
- * (if checksums are not enabled). Regardless, set the both bits
- * so that we get back in sync.
- *
- * NB: If the heap page is all-visible but the VM bit is not set,
- * we don't need to dirty the heap page. However, if checksums
- * are enabled, we do need to make sure that the heap page is
- * dirtied before passing it to visibilitymap_set(), because it
- * may be logged. Given that this situation should only happen in
- * rare cases after a crash, it is not worth optimizing.
- */
- PageSetAllVisible(page);
- MarkBufferDirty(buf);
- visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ /* mark page all-visible, if appropriate */
+ if (!all_visible_according_to_vm)
+ {
+ /*
+ * It should never be the case that the visibility map page is set
+ * while the page-level bit is clear, but the reverse is allowed
+ * (if checksums are not enabled). Regardless, set the both bits
+ * so that we get back in sync.
+ *
+ * NB: If the heap page is all-visible but the VM bit is not set,
+ * we don't need to dirty the heap page. However, if checksums
+ * are enabled, we do need to make sure that the heap page is
+ * dirtied before passing it to visibilitymap_set(), because it
+ * may be logged. Given that this situation should only happen in
+ * rare cases after a crash, it is not worth optimizing.
+ */
+ PageSetAllVisible(page);
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+ }
+
+ /* mark page all-frozen, if all tuples are frozen and not marked yet */
+ if ((ntotal_frozen == ntup_per_page) && !all_frozen_according_to_vm)
+ {
+ Assert(PageIsAllVisible(page));
+
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ {
+ MarkBufferDirty(buf);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+ vmbuffer, visibility_cutoff_xid, flags);
+ }
}
/*
@@ -1018,7 +1089,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* that something bad has happened.
*/
else if (all_visible_according_to_vm && !PageIsAllVisible(page)
- && visibilitymap_test(onerel, blkno, &vmbuffer))
+ && visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
relname, blkno);
@@ -1042,7 +1113,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
{
elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
relname, blkno);
- PageClearAllVisible(page);
+ PageClearAllVisibleFrozen(page);
MarkBufferDirty(buf);
visibilitymap_clear(onerel, blkno, vmbuffer);
}
@@ -1078,7 +1149,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
num_tuples);
/*
- * Release any remaining pin on visibility map page.
+ * Release any remaining pin on visibility map and frozen map page.
*/
if (BufferIsValid(vmbuffer))
{
@@ -1114,6 +1185,11 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
RelationGetRelationName(onerel),
tups_vacuumed, vacuumed_pages)));
+ /* Report how many frozen pages vacuum skipped according to visibility map */
+ ereport(elevel,
+ (errmsg("skipped %d frozen pages acoording to visibility map",
+ vacrelstats->vmskipped_frozen_pages)));
+
/*
* This is pretty messy, but we split it up so that we can skip emitting
* individual parts of the message when not applicable.
@@ -1226,6 +1302,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
OffsetNumber unused[MaxOffsetNumber];
int uncnt = 0;
TransactionId visibility_cutoff_xid;
+ bool all_frozen;
START_CRIT_SECTION();
@@ -1277,19 +1354,31 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
* dirty, exclusively locked, and, if needed, a full page image has been
* emitted in the log_heap_clean() above.
*/
- if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid))
+ if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid, &all_frozen))
PageSetAllVisible(page);
/*
* All the changes to the heap page have been done. If the all-visible
- * flag is now set, also set the VM bit.
+ * flag is now set, also set the VM all-visible bit.
+ * Also, if this page is all-frozen, set VM all-frozen bit and flag.
*/
- if (PageIsAllVisible(page) &&
- !visibilitymap_test(onerel, blkno, vmbuffer))
+ if (PageIsAllVisible(page))
{
- Assert(BufferIsValid(*vmbuffer));
- visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
- visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ if (!visibilitymap_test(onerel, blkno, vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+
+ /* mark page all-frozen, and set VM all-frozen bit */
+ if (all_frozen)
+ {
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
+ visibility_cutoff_xid, flags);
}
return tupindex;
@@ -1408,6 +1497,7 @@ lazy_cleanup_index(Relation indrel,
stats->num_pages,
stats->num_index_tuples,
0,
+ 0,
false,
InvalidTransactionId,
InvalidMultiXactId,
@@ -1779,10 +1869,12 @@ vac_cmp_itemptr(const void *left, const void *right)
/*
* Check if every tuple in the given page is visible to all current and future
* transactions. Also return the visibility_cutoff_xid which is the highest
- * xmin amongst the visible tuples.
+ * xmin amongst the visible tuples, and all_frozen which implies that all tuples
+ * of this page are frozen.
*/
static bool
-heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid)
+heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid,
+ bool *all_frozen)
{
Page page = BufferGetPage(buf);
BlockNumber blockno = BufferGetBlockNumber(buf);
@@ -1791,6 +1883,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
bool all_visible = true;
*visibility_cutoff_xid = InvalidTransactionId;
+ *all_frozen = true;
/*
* This is a stripped down version of the line pointer scan in
@@ -1814,11 +1907,12 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/*
* Dead line pointers can have index pointers pointing to them. So
- * they can't be treated as visible
+ * they can't be treated as visible and frozen.
*/
if (ItemIdIsDead(itemid))
{
all_visible = false;
+ *all_frozen = false;
break;
}
@@ -1855,6 +1949,10 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
+
+ /* Check whether this tuple is alrady frozen or not */
+ if (!HeapTupleHeaderXminFrozen(tuple.t_data))
+ *all_frozen = false;
}
break;
@@ -1863,6 +1961,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
case HEAPTUPLE_INSERT_IN_PROGRESS:
case HEAPTUPLE_DELETE_IN_PROGRESS:
all_visible = false;
+ *all_frozen = false;
break;
default:
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 9f54c46..08df289 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -116,7 +116,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
*/
if (!visibilitymap_test(scandesc->heapRelation,
ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
+ &node->ioss_VMBuffer, VISIBILITYMAP_ALL_VISIBLE))
{
/*
* Rats, we have to visit the heap to check visibility.
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 41d4606..c34b5da 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -231,6 +231,15 @@ check_cluster_versions(void)
if (old_cluster.major_version > new_cluster.major_version)
pg_fatal("This utility cannot be used to downgrade to older major PostgreSQL versions.\n");
+ /*
+ * We cant't allow upgrading with link mode between 9.5 or before and 9.6 or later,
+ * because the format of visibility map has changed on version 9.6.
+ */
+ if (user_opts.transfer_mode == TRANSFER_MODE_LINK &&
+ GET_MAJOR_VERSION(old_cluster.major_version) < 906 &&
+ GET_MAJOR_VERSION(new_cluster.major_version) >= 906)
+ pg_fatal("This utility cannot upgrade from PostgreSQL version from 9.5 or before to 9.6 or later with link mode.\n");
+
/* get old and new binary versions */
get_bin_version(&old_cluster);
get_bin_version(&new_cluster);
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 79d9390..9bae08c 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -10,6 +10,7 @@
#include "postgres_fe.h"
#include "pg_upgrade.h"
+#include "storage/bufpage.h"
#include <fcntl.h>
@@ -21,6 +22,43 @@ static int copy_file(const char *fromfile, const char *tofile, bool force);
static int win32_pghardlink(const char *src, const char *dst);
#endif
+static const char *rewriteVisibilitymap(const char *fromfile, const char *tofile,
+ bool force);
+
+/* table for fast rewriting vm file in order to add all-frozen information */
+static const uint16 rewrite_vm_table[256] = {
+ 0, 1, 4, 5, 16, 17, 20, 21, 64, 65, 68, 69, 80, 81, 84, 85,
+ 256, 257, 260, 261, 272, 273, 276, 277, 320, 321, 324, 325, 336, 337, 340, 341,
+ 1024, 1025, 1028, 1029, 1040, 1041, 1044, 1045, 1088, 1089, 1092, 1093, 1104, 1105, 1108, 1109,
+ 1280, 1281, 1284, 1285, 1296, 1297, 1300, 1301, 1344, 1345, 1348, 1349, 1360, 1361, 1364, 1365,
+ 4096, 4097, 4100, 4101, 4112, 4113, 4116, 4117, 4160, 4161, 4164, 4165, 4176, 4177, 4180, 4181,
+ 4352, 4353, 4356, 4357, 4368, 4369, 4372, 4373, 4416, 4417, 4420, 4421, 4432, 4433, 4436, 4437,
+ 5120, 5121, 5124, 5125, 5136, 5137, 5140, 5141, 5184, 5185, 5188, 5189, 5200, 5201, 5204, 5205,
+ 5376, 5377, 5380, 5381, 5392, 5393, 5396, 5397, 5440, 5441, 5444, 5445, 5456, 5457, 5460, 5461,
+ 16384, 16385, 16388, 16389, 16400, 16401, 16404, 16405, 16448, 16449, 16452, 16453, 16464, 16465, 16468, 16469,
+ 16640, 16641, 16644, 16645, 16656, 16657, 16660, 16661, 16704, 16705, 16708, 16709, 16720, 16721, 16724, 16725,
+ 17408, 17409, 17412, 17413, 17424, 17425, 17428, 17429, 17472, 17473, 17476, 17477, 17488, 17489, 17492, 17493,
+ 17664, 17665, 17668, 17669, 17680, 17681, 17684, 17685, 17728, 17729, 17732, 17733, 17744, 17745, 17748, 17749,
+ 20480, 20481, 20484, 20485, 20496, 20497, 20500, 20501, 20544, 20545, 20548, 20549, 20560, 20561, 20564, 20565,
+ 20736, 20737, 20740, 20741, 20752, 20753, 20756, 20757, 20800, 20801, 20804, 20805, 20816, 20817, 20820, 20821,
+ 21504, 21505, 21508, 21509, 21520, 21521, 21524, 21525, 21568, 21569, 21572, 21573, 21584, 21585, 21588, 21589,
+ 21760, 21761, 21764, 21765, 21776, 21777, 21780, 21781, 21824, 21825, 21828, 21829, 21840, 21841, 21844, 21845
+};
+
+/*
+ * copyOrRewriteFile()
+ * This function copies file or rewrite visibility map file.
+ * If rewrite_vm is true, we have to rewrite visibility map regardless value of pageConverter.
+ */
+const char *
+copyOrRewriteFile(pageCnvCtx *pageConverter,
+ const char *src, const char *dst, bool force, bool rewrite_vm)
+{
+ if (rewrite_vm)
+ return rewriteVisibilitymap(src, dst, force);
+ else
+ return copyAndUpdateFile(pageConverter, src, dst, force);
+}
/*
* copyAndUpdateFile()
@@ -201,6 +239,99 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
#endif
+/*
+ * rewriteVisibilitymap()
+ *
+ * A additional bit which indicates that all tuples on page is completely
+ * frozen is added into visibility map at PG 9.6. So the format of visibiilty
+ * map has been changed.
+ * Copies a visibility map file while adding all-frozen bit(0) into each bit.
+ */
+static const char *
+rewriteVisibilitymap(const char *fromfile, const char *tofile, bool force)
+{
+#define REWRITE_BUF_SIZE (50 * BLCKSZ)
+#define BITS_PER_HEAPBLOCK 2
+
+ int src_fd, dst_fd;
+ uint16 vm_bits;
+ ssize_t nbytes;
+ char *buffer;
+ int ret = 0;
+
+ if ((fromfile == NULL) || (tofile == NULL))
+ {
+ errno = EINVAL;
+ goto err;
+ }
+
+ if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
+ goto err;
+
+ if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+ goto err;
+
+ buffer = (char *) pg_malloc(REWRITE_BUF_SIZE);
+
+ /* Copy page header data in advance */
+ if ((nbytes = read(src_fd, buffer, MAXALIGN(SizeOfPageHeaderData))) <= 0)
+ goto err;
+
+ if (write(dst_fd, buffer, nbytes) != nbytes)
+ {
+ /* if write didn't set errno, assume problem is no disk space */
+ if (errno == 0)
+ errno = ENOSPC;
+ goto err;
+ }
+
+ /* perform data rewriting i.e read src srouce, write to destination */
+ while (true)
+ {
+ ssize_t nbytes = read(src_fd, buffer, REWRITE_BUF_SIZE);
+ char *cur, *end;
+
+ if (nbytes < 0)
+ {
+ ret = -1;
+ break;
+ }
+
+ if (nbytes == 0)
+ break;
+
+ cur = buffer;
+ end = buffer + nbytes;
+
+ /* Rewrite a byte and write dest_fd per BITS_PER_HEAPBLOCK bytes */
+ while (end > cur)
+ {
+ /* Get rewritten bit from table and its string representation */
+ vm_bits = rewrite_vm_table[(uint8) *cur];
+
+ if (write(dst_fd, &vm_bits, BITS_PER_HEAPBLOCK) != BITS_PER_HEAPBLOCK)
+ {
+ ret = -1;
+ break;
+ }
+ cur++;
+ }
+ }
+
+err:
+
+ if (!buffer)
+ pg_free(buffer);
+
+ if (src_fd != 0)
+ close(src_fd);
+
+ if (dst_fd != 0)
+ close(dst_fd);
+
+ return (errno == 0) ? NULL : getErrorText(errno);
+}
+
void
check_hard_link(void)
{
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 13aa891..024f7af 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -112,6 +112,10 @@ extern char *output_files[];
#define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031
/*
+ * The format of visibility map changed with this 9.6 commit,
+ */
+#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201509181
+/*
* pg_multixact format changed in 9.3 commit 0ac5ad5134f2769ccbaefec73844f85,
* ("Improve concurrency of foreign key locking") which also updated catalog
* version to this value. pg_upgrade behavior depends on whether old and new
@@ -396,6 +400,8 @@ const pageCnvCtx *setupPageConverter(void);
typedef void *pageCnvCtx;
#endif
+const char *copyOrRewriteFile(pageCnvCtx *pageConverter, const char *src,
+ const char *dst, bool force, bool rewrite_vm);
const char *copyAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
const char *dst, bool force);
const char *linkAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index c22df42..2fa5b47 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -171,6 +171,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
{
int mapnum;
bool vm_crashsafe_match = true;
+ bool vm_rewrite_needed = false;
/*
* Do the old and new cluster disagree on the crash-safetiness of the vm
@@ -180,6 +181,13 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
vm_crashsafe_match = false;
+ /*
+ * Do we need to rewrite visibilitymap?
+ */
+ if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ vm_rewrite_needed = true;
+
for (mapnum = 0; mapnum < size; mapnum++)
{
if (old_tablespace == NULL ||
@@ -195,7 +203,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
* Copy/link any fsm and vm files, if they exist
*/
transfer_relfile(pageConverter, &maps[mapnum], "_fsm");
- if (vm_crashsafe_match)
+ if (vm_crashsafe_match || vm_rewrite_needed)
transfer_relfile(pageConverter, &maps[mapnum], "_vm");
}
}
@@ -218,6 +226,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
int fd;
int segno;
char extent_suffix[65];
+ bool rewrite_vm = false;
/*
* Now copy/link any related segments as well. Remember, PG breaks large
@@ -276,7 +285,15 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
+ /*
+ * Do we need to rewrite visibilitymap?
+ */
+ if (strcmp(type_suffix, "_vm") == 0 &&
+ old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ rewrite_vm = true;
+
+ if ((msg = copyOrRewriteFile(pageConverter, old_file, new_file, true, rewrite_vm)) != NULL)
pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
diff --git a/src/bin/pg_upgrade/test.sh b/src/bin/pg_upgrade/test.sh
index aa7f399..6b058d4 100644
--- a/src/bin/pg_upgrade/test.sh
+++ b/src/bin/pg_upgrade/test.sh
@@ -174,6 +174,11 @@ if "$MAKE" -C "$oldsrc" installcheck; then
mv "$temp_root"/dump1.sql "$temp_root"/dump1.sql.orig
sed "s;$oldsrc;$newsrc;g" "$temp_root"/dump1.sql.orig >"$temp_root"/dump1.sql
fi
+
+ vm_sql="SELECT c.relname, c.relallvisible FROM pg_class as c, pg_namespace as n WHERE c.relnamespace = n.oid AND n.nspname NOT IN ('information_schema', 'pg_toast', 'pg_catalog') ORDER BY c.relname;"
+ # Test for rewriting visibility map
+ vacuumdb -d regression || visibilitymap_vacuum1_status=$?
+ psql -d regression -c "$vm_sql" > "$temp_root"/vm_test1.txt || visibilitymap_test1_status=$?
else
make_installcheck_status=$?
fi
@@ -188,6 +193,14 @@ if [ -n "$pg_dumpall1_status" ]; then
echo "pg_dumpall of pre-upgrade database cluster failed"
exit 1
fi
+if [ -n "$visibilitymap_vacuum1_status" ];then
+ echo "VACUUM of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+if [ -n "$visibilitymap_test1_status" ];then
+ echo "SELECT of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
PGDATA=$BASE_PGDATA
@@ -203,6 +216,8 @@ case $testhost in
esac
pg_dumpall -f "$temp_root"/dump2.sql || pg_dumpall2_status=$?
+vacuumdb -d regression || visibilitymap_vacuum2_status=$?
+psql -d regression -c "$vm_sql" > "$temp_root"/vm_test2.txt || visibilitymap_test2_status=$?
pg_ctl -m fast stop
# no need to echo commands anymore
@@ -214,11 +229,26 @@ if [ -n "$pg_dumpall2_status" ]; then
exit 1
fi
+if [ -n "$visibilitymap_vacuum2_status" ];then
+ echo "VACUUM of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
+if [ -n "$visibilitymap_test2_status" ];then
+ echo "SELECT of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
case $testhost in
MINGW*) cmd /c delete_old_cluster.bat ;;
*) sh ./delete_old_cluster.sh ;;
esac
+if ! diff "$temp_root"/vm_test1.txt "$temp_root"/vm_test2.txt >/dev/null; then
+ echo "Visibility map rewriting test failed"
+ exit 1
+fi
+
if diff "$temp_root"/dump1.sql "$temp_root"/dump2.sql >/dev/null; then
echo PASSED
exit 0
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index caa0f14..93afb10 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -320,9 +320,10 @@ typedef struct xl_heap_freeze_page
typedef struct xl_heap_visible
{
TransactionId cutoff_xid;
+ uint8 flags;
} xl_heap_visible;
-#define SizeOfHeapVisible (offsetof(xl_heap_visible, cutoff_xid) + sizeof(TransactionId))
+#define SizeOfHeapVisible (offsetof(xl_heap_visible, flags) + sizeof(uint8))
typedef struct xl_heap_new_cid
{
@@ -389,6 +390,6 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
xl_heap_freeze_tuple *xlrec_tp);
extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
- Buffer vm_buffer, TransactionId cutoff_xid);
+ Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
#endif /* HEAPAM_XLOG_H */
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 0c0e0ef..7270609 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -19,15 +19,20 @@
#include "storage/buf.h"
#include "utils/relcache.h"
-extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk,
- Buffer vmbuf);
+/* Flags for bit map */
+#define VISIBILITYMAP_ALL_VISIBLE 0x01
+#define VISIBILITYMAP_ALL_FROZEN 0x02
+
+extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
Buffer *vmbuf);
extern bool visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid);
-extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
-extern BlockNumber visibilitymap_count(Relation rel);
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags);
+extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf,
+ uint8 flags);
+extern BlockNumber visibilitymap_count(Relation rel, uint8 flags);
extern void visibilitymap_truncate(Relation rel, BlockNumber nheapblocks);
#endif /* VISIBILITYMAP_H */
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 9730561..45b117c 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 201509161
+#define CATALOG_VERSION_NO 201509181
#endif
diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
index 25247b5..e64a1c8 100644
--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -47,6 +47,8 @@ CATALOG(pg_class,1259) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83) BKI_SCHEMA_MACRO
float4 reltuples; /* # of tuples (not always up-to-date) */
int32 relallvisible; /* # of all-visible blocks (not always
* up-to-date) */
+ int32 relallfrozen; /* # of all-frozen blocks (not always
+ up-to-date) */
Oid reltoastrelid; /* OID of toast table; 0 if none */
bool relhasindex; /* T if has (or has had) any indexes */
bool relisshared; /* T if shared across databases */
@@ -95,7 +97,7 @@ typedef FormData_pg_class *Form_pg_class;
* ----------------
*/
-#define Natts_pg_class 30
+#define Natts_pg_class 31
#define Anum_pg_class_relname 1
#define Anum_pg_class_relnamespace 2
#define Anum_pg_class_reltype 3
@@ -107,25 +109,26 @@ typedef FormData_pg_class *Form_pg_class;
#define Anum_pg_class_relpages 9
#define Anum_pg_class_reltuples 10
#define Anum_pg_class_relallvisible 11
-#define Anum_pg_class_reltoastrelid 12
-#define Anum_pg_class_relhasindex 13
-#define Anum_pg_class_relisshared 14
-#define Anum_pg_class_relpersistence 15
-#define Anum_pg_class_relkind 16
-#define Anum_pg_class_relnatts 17
-#define Anum_pg_class_relchecks 18
-#define Anum_pg_class_relhasoids 19
-#define Anum_pg_class_relhaspkey 20
-#define Anum_pg_class_relhasrules 21
-#define Anum_pg_class_relhastriggers 22
-#define Anum_pg_class_relhassubclass 23
-#define Anum_pg_class_relrowsecurity 24
-#define Anum_pg_class_relispopulated 25
-#define Anum_pg_class_relreplident 26
-#define Anum_pg_class_relfrozenxid 27
-#define Anum_pg_class_relminmxid 28
-#define Anum_pg_class_relacl 29
-#define Anum_pg_class_reloptions 30
+#define Anum_pg_class_relallfrozen 12
+#define Anum_pg_class_reltoastrelid 13
+#define Anum_pg_class_relhasindex 14
+#define Anum_pg_class_relisshared 15
+#define Anum_pg_class_relpersistence 16
+#define Anum_pg_class_relkind 17
+#define Anum_pg_class_relnatts 18
+#define Anum_pg_class_relchecks 19
+#define Anum_pg_class_relhasoids 20
+#define Anum_pg_class_relhaspkey 21
+#define Anum_pg_class_relhasrules 22
+#define Anum_pg_class_relhastriggers 23
+#define Anum_pg_class_relhassubclass 24
+#define Anum_pg_class_relrowsecurity 25
+#define Anum_pg_class_relispopulated 26
+#define Anum_pg_class_relreplident 27
+#define Anum_pg_class_relfrozenxid 28
+#define Anum_pg_class_relminmxid 29
+#define Anum_pg_class_relacl 30
+#define Anum_pg_class_reloptions 31
/* ----------------
* initial contents of pg_class
@@ -140,13 +143,13 @@ typedef FormData_pg_class *Form_pg_class;
* Note: "3" in the relfrozenxid column stands for FirstNormalTransactionId;
* similarly, "1" in relminmxid stands for FirstMultiXactId
*/
-DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 f f p r 29 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 0 f f p r 29 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 0 f f p r 31 0 t f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index e3a31af..d2bae2d 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -172,6 +172,7 @@ extern void vac_update_relstats(Relation relation,
BlockNumber num_pages,
double num_tuples,
BlockNumber num_all_visible_pages,
+ BlockNumber num_all_frozen_pages,
bool hasindex,
TransactionId frozenxid,
MultiXactId minmulti,
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index a2f78ee..3de3737 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -178,8 +178,10 @@ typedef PageHeaderData *PageHeader;
* tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
+#define PD_ALL_FROZEN 0x0008 /* all tuples on page are completely
+ frozen */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x000F /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -369,6 +371,15 @@ typedef PageHeaderData *PageHeader;
#define PageClearAllVisible(page) \
(((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
+#define PageIsAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags & PD_ALL_FROZEN)
+#define PageSetAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags |= PD_ALL_FROZEN)
+#define PageClearAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags &= ~PD_ALL_FROZEN)
+#define PageClearAllVisibleFrozen(page) \
+ (((PageHeader) (page))->pd_flags &= ~(PD_ALL_VISIBLE | PD_ALL_FROZEN))
+
#define PageIsPrunable(page, oldestxmin) \
( \
AssertMacro(TransactionIdIsNormal(oldestxmin)), \
diff --git a/src/test/regress/expected/visibilitymap.out b/src/test/regress/expected/visibilitymap.out
new file mode 100644
index 0000000..a410553
--- /dev/null
+++ b/src/test/regress/expected/visibilitymap.out
@@ -0,0 +1,29 @@
+--
+-- Visibility map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+\set VERBOSITY terse
+-- All pages are become all-visible
+VACUUM vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+ ?column?
+----------
+ t
+(1 row)
+
+-- All pages are become all-frozen
+VACUUM FREEZE vmtest;
+SELECT relallfrozen = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+ ?column?
+----------
+ t
+(1 row)
+
+-- All pages are skipped acoording to VM
+VACUUM FREEZE VERBOSE vmtest;
+INFO: vacuuming "public.vmtest"
+INFO: index "vmtest_pkey" now contains 10000 row versions in 30 pages
+INFO: Skipped 45 frozen pages acoording to visibility map
+INFO: "vmtest": found 0 removable, 0 nonremovable row versions in 0 out of 45 pages
+DROP TABLE vmtest;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 6fc5d1e..a5ff786 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -108,5 +108,8 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare without_oid c
# event triggers cannot run concurrently with any test that runs DDL
test: event_trigger
+# visibility map and vacuum test cannot run concurrently with any test that runs SQL
+test: visibilitymap
+
# run stats by itself because its delay may be insufficient under heavy load
test: stats
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 2ae51cf..d386d67 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -158,3 +158,4 @@ test: with
test: xml
test: event_trigger
test: stats
+test: visibilitymap
\ No newline at end of file
diff --git a/src/test/regress/sql/visibilitymap.sql b/src/test/regress/sql/visibilitymap.sql
new file mode 100644
index 0000000..9bf9094
--- /dev/null
+++ b/src/test/regress/sql/visibilitymap.sql
@@ -0,0 +1,20 @@
+--
+-- Visibility map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+
+\set VERBOSITY terse
+
+-- All pages are become all-visible
+VACUUM vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+
+-- All pages are become all-frozen
+VACUUM FREEZE vmtest;
+SELECT relallfrozen = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+
+-- All pages are skipped acoording to VM
+VACUUM FREEZE VERBOSE vmtest;
+
+DROP TABLE vmtest;
On Fri, Oct 2, 2015 at 11:23 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
+ /* all-frozen information is also cleared at the same time */ PageClearAllVisible(page); + PageClearAllFrozen(page);I wonder if it makes sense to have a macro to clear both in unison,
which seems a very common pattern.
I think PageClearAllVisible should clear both, and there should be no
other macro. There is no event that causes a page to cease being
all-visible that does not also cause it to cease being all-frozen.
You might think that deleting or locking a tuple would fall into that
category - but nope, XMAX needs to be cleared or the tuple pruned, or
there will be problems after wraparound.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Oct 3, 2015 at 3:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Oct 2, 2015 at 11:23 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:+ /* all-frozen information is also cleared at the same time */ PageClearAllVisible(page); + PageClearAllFrozen(page);I wonder if it makes sense to have a macro to clear both in unison,
which seems a very common pattern.I think PageClearAllVisible should clear both, and there should be no
other macro. There is no event that causes a page to cease being
all-visible that does not also cause it to cease being all-frozen.
You might think that deleting or locking a tuple would fall into that
category - but nope, XMAX needs to be cleared or the tuple pruned, or
there will be problems after wraparound.
Thank you for your advice.
I understood.
I changed the patch so that PageClearAllVisible clear both bits, and
removed ClearAllFrozen.
Attached the latest v16 patch which contains draft version documentation patch.
Regards,
--
Masahiko Sawada
Attachments:
000_add_frozen_bit_into_visibilitymap_v16.patchtext/x-patch; charset=US-ASCII; name=000_add_frozen_bit_into_visibilitymap_v16.patchDownload
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 22c5f7a..b1b6a06 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -87,7 +87,7 @@ statapprox_heap(Relation rel, output_type *stat)
* If the page has only visible tuples, then we can find out the free
* space from the FSM and move on.
*/
- if (visibilitymap_test(rel, blkno, &vmbuffer))
+ if (visibilitymap_test(rel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
freespace = GetRecordedFreeSpace(rel, blkno);
stat->tuple_len += BLCKSZ - freespace;
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 97ef618..f8aa18b 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -1841,6 +1841,18 @@
</row>
<row>
+ <entry><structfield>relallfrozen</structfield></entry>
+ <entry><type>int4</type></entry>
+ <entry></entry>
+ <entry>
+ Number of pages that are marked all-frozen in the tables's
+ visibility map. It is updated by <command>VACUUM</command>.
+ <command>ANALYZE</command>, and a few DDL coomand such as
+ <command>CREATE INDEX</command>.
+ </entry>
+ </row>
+
+ <row>
<entry><structfield>reltoastrelid</structfield></entry>
<entry><type>oid</type></entry>
<entry><literal><link linkend="catalog-pg-class"><structname>pg_class</structname></link>.oid</literal></entry>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5081da0..6bd4d57 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5900,7 +5900,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs a aggressive freezing if the table's
<structname>pg_class</>.<structfield>relfrozenxid</> field has reached
the age specified by this setting. The default is 150 million
transactions. Although users can set this value anywhere from zero to
@@ -5944,7 +5944,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs a aggressive freezing if the table's
<structname>pg_class</>.<structfield>relminmxid</> field has reached
the age specified by this setting. The default is 150 million multixacts.
Although users can set this value anywhere from zero to two billions,
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index b5d4050..c8ad27f 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -352,9 +352,9 @@
Vacuum maintains a <link linkend="storage-vm">visibility map</> for each
table to keep track of which pages contain only tuples that are known to be
visible to all active transactions (and all future transactions, until the
- page is again modified). This has two purposes. First, vacuum
- itself can skip such pages on the next run, since there is nothing to
- clean up.
+ page is again modified), and pages contain only tuples that are marked as
+ frozen. This has two purposes. First, vacuum itself can skip such pages
+ on the next run, since there is nothing to clean up.
</para>
<para>
@@ -438,23 +438,22 @@
</para>
<para>
- <command>VACUUM</> normally skips pages that don't have any dead row
- versions, but those pages might still have row versions with old XID
- values. To ensure all old row versions have been frozen, a
- scan of the whole table is needed.
+ <command>VACUUM</> skips pages that don't have any dead row
+ versions, and pages that have only frozen rows.
+ To ensure all old row versions have been frozen, a scan of all pages that
+ are not marked as frozen is needed.
<xref linkend="guc-vacuum-freeze-table-age"> controls when
- <command>VACUUM</> does that: a whole table sweep is forced if
- the table hasn't been fully scanned for <varname>vacuum_freeze_table_age</>
- minus <varname>vacuum_freeze_min_age</> transactions. Setting it to 0
- forces <command>VACUUM</> to always scan all pages, effectively ignoring
- the visibility map.
+ <command>VACUUM</> does that: a table sweep is forced if
+ the table hasn't been ensured all row versions are frozen for
+ <varname>vacuum_freeze_table_age</> minus <varname>vacuum_freeze_min_age</>
+ transcations.
</para>
<para>
The maximum time that a table can go unvacuumed is two billion
transactions minus the <varname>vacuum_freeze_min_age</> value at
- the time <command>VACUUM</> last scanned the whole table. If it were to go
- unvacuumed for longer than
+ the time <command>VACUUM</> last scanned pages that are not marked as frozen
+ If it were to go unvacuumed for longer than
that, data loss could result. To ensure that this does not happen,
autovacuum is invoked on any table that might contain unfrozen rows with
XIDs older than the age specified by the configuration parameter <xref
@@ -490,8 +489,7 @@
a regularly scheduled <command>VACUUM</> or an autovacuum triggered by
normal delete and update activity is run in that window. Setting it too
close could lead to anti-wraparound autovacuums, even though the table
- was recently vacuumed to reclaim space, whereas lower values lead to more
- frequent whole-table scans.
+ was recently vacuumed to reclaim space.
</para>
<para>
@@ -527,7 +525,7 @@
<structname>pg_database</>. In particular,
the <structfield>relfrozenxid</> column of a table's
<structname>pg_class</> row contains the freeze cutoff XID that was used
- by the last whole-table <command>VACUUM</> for that table. All rows
+ by the last whole-table freezing for that table. All rows
inserted by transactions with XIDs older than this cutoff XID are
guaranteed to have been frozen. Similarly,
the <structfield>datfrozenxid</> column of a database's
@@ -555,17 +553,19 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
<command>VACUUM</> normally
only scans pages that have been modified since the last vacuum, but
<structfield>relfrozenxid</> can only be advanced when the whole table is
- scanned. The whole table is scanned when <structfield>relfrozenxid</> is
- more than <varname>vacuum_freeze_table_age</> transactions old, when
- <command>VACUUM</>'s <literal>FREEZE</> option is used, or when all pages
- happen to
- require vacuuming to remove dead row versions. When <command>VACUUM</>
- scans the whole table, after it's finished <literal>age(relfrozenxid)</>
- should be a little more than the <varname>vacuum_freeze_min_age</> setting
- that was used (more by the number of transactions started since the
- <command>VACUUM</> started). If no whole-table-scanning <command>VACUUM</>
- is issued on the table until <varname>autovacuum_freeze_max_age</> is
- reached, an autovacuum will soon be forced for the table.
+ frozen. The whole table is scanned only when all pages happen to require
+ vacuuming to remove dead row versions. In other cases such as where
+ <structfield>relfrozenxid</> is more than <varname>vacuum_freeze_table_age</>
+ transcations old, where <command>VACUUM</>'s <literal>FREEZE</> option is used,
+ <command>VACUUM</> can skip the pages that all tuples on the page itself are
+ marked as frozen.
+ When the all pages of table are eventually marked as frozen by <command>VACUUM</>,
+ after it's finished <literal>age(relfrozenxid)</> should be a little more
+ than the <varname>vacuum_freeze_min_age</> setting that was used (more by
+ the number of transcations started since the <command>VACUUM</> started).
+ If the advancing of <structfield>relfrozenxid</> is not happend until
+ <varname>autovacuum_freeze_max_age</> is reached, an autovacuum will soon
+ be forced for the table.
</para>
<para>
@@ -642,28 +642,28 @@ HINT: Stop the postmaster and vacuum that database in single-user mode.
<structname>pg_class</>.<structfield>relminmxid</> stores the oldest
possible multixact ID still appearing in any tuple of that table.
If this value is older than
- <xref linkend="guc-vacuum-multixact-freeze-table-age">, a whole-table
+ <xref linkend="guc-vacuum-multixact-freeze-table-age">, a table
scan is forced. <function>mxid_age()</> can be used on
<structname>pg_class</>.<structfield>relminmxid</> to find its age.
</para>
<para>
- Whole-table <command>VACUUM</> scans, regardless of
- what causes them, enable advancing the value for that table.
+ When <command>VACUUM</> scans all pages that are not marked as frozen,
+ regardless of what causes them, enable advancing the value for that table.
Eventually, as all tables in all databases are scanned and their
oldest multixact values are advanced, on-disk storage for older
multixacts can be removed.
</para>
<para>
- As a safety device, a whole-table vacuum scan will occur for any table
- whose multixact-age is greater than
- <xref linkend="guc-autovacuum-multixact-freeze-max-age">. Whole-table
+ As a safety device, a vacuum scan will occur for any table whose
+ multixact-age is greater than
+ <xref linkend="guc-autovacuum-multixact-freeze-max-age">. A
vacuum scans will also occur progressively for all tables, starting with
those that have the oldest multixact-age, if the amount of used member
storage space exceeds the amount 50% of the addressable storage space.
- Both of these kinds of whole-table scans will occur even if autovacuum is
- nominally disabled.
+ Both of these kinds of scans will occur even if autovacuum is nominally
+ disabled.
</para>
</sect3>
</sect2>
@@ -743,8 +743,8 @@ vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuple
<command>UPDATE</command> and <command>DELETE</command> operation. (It
is only semi-accurate because some information might be lost under heavy
load.) If the <structfield>relfrozenxid</> value of the table is more
- than <varname>vacuum_freeze_table_age</> transactions old, the whole
- table is scanned to freeze old tuples and advance
+ than <varname>vacuum_freeze_table_age</> transactions old, the table is
+ scanned to freeze old tuples and advance
<structfield>relfrozenxid</>, otherwise only pages that have been modified
since the last vacuum are scanned.
</para>
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 164d08c..9328cdf 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -623,18 +623,21 @@ can be used to examine the information stored in free space maps.
<para>
Each heap relation has a Visibility Map
(VM) to keep track of which pages contain only tuples that are known to be
-visible to all active transactions. It's stored
-alongside the main relation data in a separate relation fork, named after the
-filenode number of the relation, plus a <literal>_vm</> suffix. For example,
-if the filenode of a relation is 12345, the VM is stored in a file called
-<filename>12345_vm</>, in the same directory as the main relation file.
+visible to all active transactions, and pages contain only tuples that are
+marked as frozen.
+It's stored alongside the main relation data in a separate relation fork,
+named after the filenode number of the relation, plus a <literal>_vm</> suffix.
+For example, if the filenode of a relation is 12345, the VM is stored in a file
+called <filename>12345_vm</>, in the same directory as the main relation file.
Note that indexes do not have VMs.
</para>
<para>
-The visibility map simply stores one bit per heap page. A set bit means
-that all tuples on the page are known to be visible to all transactions.
-This means that the page does not contain any tuples that need to be vacuumed.
+The visibility map stores two bits per heap page: all-visible, all-frozen.
+A set all-visible bit means that all tuples on the page are known to be visible
+to all transactions. A set all-frozen bit means that all tuples on the page are
+completely marked as frozen. This means that the page does not contain any tuples
+that need to be vacuumed, even if a scan of whole table is required.
This information can also be used by <firstterm>index-only scans</> to answer
queries using only the index tuple.
</para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index bcf9871..84577b4 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2176,8 +2176,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
/*
- * Find buffer to insert this tuple into. If the page is all visible,
- * this will also pin the requisite visibility map page.
+ * Find buffer to insert this tuple into. If the page is all visible
+ * or all frozen, this will also pin the requisite visibility map and
+ * frozen map page.
*/
buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
InvalidBuffer, options, bistate,
@@ -2776,9 +2777,9 @@ heap_delete(Relation relation, ItemPointer tid,
/*
* If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, we'll have to unlock and
- * re-lock, to avoid holding the buffer lock across an I/O. That's a bit
- * unfortunate, but hopefully shouldn't happen often.
+ * visible or all frozen while we were busy locking the buffer, we'll
+ * have to unlock and re-lock, to avoid holding the buffer lock across an
+ * I/O. That's a bit unfortunate, but hopefully shouldn't happen often.
*/
if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
{
@@ -6942,7 +6943,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
*/
XLogRecPtr
log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
- TransactionId cutoff_xid)
+ TransactionId cutoff_xid, uint8 vmflags)
{
xl_heap_visible xlrec;
XLogRecPtr recptr;
@@ -6952,6 +6953,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
Assert(BufferIsValid(vm_buffer));
xlrec.cutoff_xid = cutoff_xid;
+ xlrec.flags = vmflags;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
@@ -7541,7 +7543,12 @@ heap_xlog_visible(XLogReaderState *record)
* the subsequent update won't be replayed to clear the flag.
*/
page = BufferGetPage(buffer);
- PageSetAllVisible(page);
+
+ if (xlrec->flags & VISIBILITYMAP_ALL_VISIBLE)
+ PageSetAllVisible(page);
+ if (xlrec->flags & VISIBILITYMAP_ALL_FROZEN)
+ PageSetAllFrozen(page);
+
MarkBufferDirty(buffer);
}
else if (action == BLK_RESTORED)
@@ -7593,7 +7600,7 @@ heap_xlog_visible(XLogReaderState *record)
*/
if (lsn > PageGetLSN(vmpage))
visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
- xlrec->cutoff_xid);
+ xlrec->cutoff_xid, xlrec->flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 7c38772..b068fbe 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -21,33 +21,41 @@
*
* NOTES
*
- * The visibility map is a bitmap with one bit per heap page. A set bit means
- * that all tuples on the page are known visible to all transactions, and
- * therefore the page doesn't need to be vacuumed. The map is conservative in
- * the sense that we make sure that whenever a bit is set, we know the
- * condition is true, but if a bit is not set, it might or might not be true.
+ * The visibility map is a bitmap with two bits (all-visible and all-frozen)
+ * per heap page. A set all-visible bit means that all tuples on the page are
+ * known visible to all transactions, and therefore the page doesn't need to
+ * be vacuumed. A set all-frozen bit means that all tuples on the page are
+ * completely frozen, and therefore the page doesn't need to be vacuumed even
+ * if whole table scanning vacuum is required (e.g. anti-wraparound vacuum).
+ * The all-frozen bit must be set only when the page is already all-visible.
+ * That is, all-frozen bit is always set with all-visible bit.
+ *
+ * The map is conservative in the sense that we make sure that whenever a bit
+ * is set, we know the condition is true, but if a bit is not set, it might or
+ * might not be true.
*
* Clearing a visibility map bit is not separately WAL-logged. The callers
* must make sure that whenever a bit is cleared, the bit is cleared on WAL
- * replay of the updating operation as well.
+ * replay of the updating operation as well. And all-frozen bit must be
+ * cleared with all-visible at the same time.
*
* When we *set* a visibility map during VACUUM, we must write WAL. This may
* seem counterintuitive, since the bit is basically a hint: if it is clear,
- * it may still be the case that every tuple on the page is visible to all
- * transactions; we just don't know that for certain. The difficulty is that
- * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
- * on the page itself, and the visibility map bit. If a crash occurs after the
- * visibility map page makes it to disk and before the updated heap page makes
- * it to disk, redo must set the bit on the heap page. Otherwise, the next
- * insert, update, or delete on the heap page will fail to realize that the
- * visibility map bit must be cleared, possibly causing index-only scans to
- * return wrong answers.
+ * it may still be the case that every tuple on the page is visible or frozen
+ * to all transactions; we just don't know that for certain. The difficulty is
+ * that there are two bits which are typically set together: the PD_ALL_VISIBLE
+ * or PD_ALL_FROZEN bit on the page itself, and the corresponding visibility map
+ * bit. If a crash occurs after the visibility map page makes it to disk and before
+ * the updated heap page makes it to disk, redo must set the bit on the heap page.
+ * Otherwise, the next insert, update, or delete on the heap page will fail to
+ * realize that the visibility map bit must be cleared, possibly causing index-only
+ * scans to return wrong answers.
*
* VACUUM will normally skip pages for which the visibility map bit is set;
* such pages can't contain any dead tuples and therefore don't need vacuuming.
- * The visibility map is not used for anti-wraparound vacuums, because
- * an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
- * present in the table, even on pages that don't have any dead tuples.
+ * The visibility map has the all-frozen bit which indicates all tuple on
+ * corresponding page has been completely frozen, so the visibility map is also
+ * used for anti-wraparound vacuum, even if freezing tuples is required.
*
* LOCKING
*
@@ -58,14 +66,14 @@
* section that logs the page modification. However, we don't want to hold
* the buffer lock over any I/O that may be required to read in the visibility
* map page. To avoid this, we examine the heap page before locking it;
- * if the page-level PD_ALL_VISIBLE bit is set, we pin the visibility map
- * bit. Then, we lock the buffer. But this creates a race condition: there
- * is a possibility that in the time it takes to lock the buffer, the
- * PD_ALL_VISIBLE bit gets set. If that happens, we have to unlock the
- * buffer, pin the visibility map page, and relock the buffer. This shouldn't
- * happen often, because only VACUUM currently sets visibility map bits,
- * and the race will only occur if VACUUM processes a given page at almost
- * exactly the same time that someone tries to further modify it.
+ * if the page-level PD_ALL_VISIBLE or PD_ALL_FROZEN bit is set, we pin the
+ * visibility map bit. Then, we lock the buffer. But this creates a race
+ * condition: there is a possibility that in the time it takes to lock the
+ * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that happens,
+ * we have to unlock the buffer, pin the visibility map page, and relock the
+ * buffer. This shouldn't happen often, because only VACUUM currently sets
+ * visibility map bits, and the race will only occur if VACUUM processes a given
+ * page at almost exactly the same time that someone tries to further modify it.
*
* To set a bit, you need to hold a lock on the heap page. That prevents
* the race condition where VACUUM sees that all tuples on the page are
@@ -101,11 +109,14 @@
*/
#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
-/* Number of bits allocated for each heap block. */
-#define BITS_PER_HEAPBLOCK 1
+/*
+ * Number of bits allocated for each heap block.
+ * One for all-visible, other for all-frozen.
+*/
+#define BITS_PER_HEAPBLOCK 2
/* Number of heap blocks we can represent in one byte. */
-#define HEAPBLOCKS_PER_BYTE 8
+#define HEAPBLOCKS_PER_BYTE 4
/* Number of heap blocks we can represent in one visibility map page. */
#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
@@ -115,24 +126,42 @@
#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
#define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
-/* table for fast counting of set bits */
-static const uint8 number_of_ones[256] = {
- 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
+/* tables for fast counting of set bits for visible and freeze */
+static const uint8 number_of_ones_for_visible[256] = {
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4
+};
+static const uint8 number_of_ones_for_frozen[256] = {
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4
};
/* prototypes for internal routines */
@@ -141,7 +170,7 @@ static void vm_extend(Relation rel, BlockNumber nvmblocks);
/*
- * visibilitymap_clear - clear a bit in visibility map
+ * visibilitymap_clear - clear all bits in visibility map
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -153,11 +182,12 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
int mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- uint8 mask = 1 << mapBit;
+ uint8 mask = (VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN) <<
+ (BITS_PER_HEAPBLOCK * mapBit);
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_clear %s block %d", RelationGetRelationName(rel), heapBlk);
#endif
if (!BufferIsValid(buf) || BufferGetBlockNumber(buf) != mapBlock)
@@ -225,7 +255,7 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
}
/*
- * visibilitymap_set - set a bit on a previously pinned page
+ * visibilitymap_set - set bit(s) on a previously pinned page
*
* recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
* or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
@@ -234,10 +264,11 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
* marked all-visible; it is needed for Hot Standby, and can be
* InvalidTransactionId if the page contains no tuples.
*
- * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
- * this function. Except in recovery, caller should also pass the heap
- * buffer. When checksums are enabled and we're not in recovery, we must add
- * the heap buffer to the WAL chain to protect it from being torn.
+ * Caller is expected to set the heap page's PD_ALL_VISIBLE or PD_ALL_FROZEN
+ * bit before calling this function. Except in recovery, caller should also
+ * pass the heap buffer and flags which indicates what flag we want to set.
+ * When checksums are enabled and we're not in recovery, we must add the heap
+ * buffer to the WAL chain to protect it from being torn.
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -245,7 +276,8 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
*/
void
visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid)
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -254,7 +286,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_set %s block %d, flag %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
@@ -272,11 +304,11 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map = PageGetContents(page);
LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
- if (!(map[mapByte] & (1 << mapBit)))
+ if (flags != (map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))))
{
START_CRIT_SECTION();
- map[mapByte] |= (1 << mapBit);
+ map[mapByte] |= (flags << (BITS_PER_HEAPBLOCK * mapBit));
MarkBufferDirty(vmBuf);
if (RelationNeedsWAL(rel))
@@ -285,7 +317,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Assert(!InRecovery);
recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
- cutoff_xid);
+ cutoff_xid, flags);
/*
* If data checksums are enabled (or wal_log_hints=on), we
@@ -295,11 +327,15 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Page heapPage = BufferGetPage(heapBuf);
- /* caller is expected to set PD_ALL_VISIBLE first */
- Assert(PageIsAllVisible(heapPage));
+ /*
+ * caller is expected to set PD_ALL_VISIBLE or
+ * PD_ALL_FROZEN first.
+ */
+ Assert(PageIsAllVisible(heapPage) || PageIsAllFrozen(heapPage));
PageSetLSN(heapPage, recptr);
}
}
+
PageSetLSN(page, recptr);
}
@@ -310,15 +346,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
}
/*
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_test - test if bit(s) is set
*
- * Are all tuples on heapBlk visible to all, according to the visibility map?
+ * Are all tuples on heapBlk visible or frozen to all, according to the visibility map?
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by an
* earlier call to visibilitymap_pin or visibilitymap_test on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk, or InvalidBuffer. The caller is responsible for
- * releasing *buf after it's done testing and setting bits.
+ * releasing *buf after it's done testing and setting bits, and must set flags
+ * which indicates what flag we want to test.
*
* NOTE: This function is typically called without a lock on the heap page,
* so somebody else could change the bit just after we look at it. In fact,
@@ -328,7 +365,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
* all concurrency issues!
*/
bool
-visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
+visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf, uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -337,7 +374,7 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_test %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_test %s block %d, flag %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
/* Reuse the old pinned buffer if possible */
@@ -360,11 +397,12 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
map = PageGetContents(BufferGetPage(*buf));
/*
- * A single-bit read is atomic. There could be memory-ordering effects
+ * A single or double bit read is atomic. There could be memory-ordering effects
* here, but for performance reasons we make it the caller's job to worry
* about that.
*/
- result = (map[mapByte] & (1 << mapBit)) ? true : false;
+ result = (map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))) ?
+ true : false;
return result;
}
@@ -374,10 +412,11 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
*
* Note: we ignore the possibility of race conditions when the table is being
* extended concurrently with the call. New pages added to the table aren't
- * going to be marked all-visible, so they won't affect the result.
+ * going to be marked all-visible or all-frozen, so they won't affect the result.
+ * The caller must set the flags which indicates what flag we want to count.
*/
BlockNumber
-visibilitymap_count(Relation rel)
+visibilitymap_count(Relation rel, uint8 flags)
{
BlockNumber result = 0;
BlockNumber mapBlock;
@@ -406,7 +445,10 @@ visibilitymap_count(Relation rel)
for (i = 0; i < MAPSIZE; i++)
{
- result += number_of_ones[map[i]];
+ if (flags & VISIBILITYMAP_ALL_VISIBLE)
+ result += number_of_ones_for_visible[map[i]];
+ if (flags & VISIBILITYMAP_ALL_FROZEN)
+ result += number_of_ones_for_frozen[map[i]];
}
ReleaseBuffer(mapBuffer);
@@ -435,7 +477,7 @@ visibilitymap_truncate(Relation rel, BlockNumber nheapblocks)
uint8 truncBit = HEAPBLK_TO_MAPBIT(nheapblocks);
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_truncate %s %d", RelationGetRelationName(rel), nheapblocks);
+ elog(DEBUG1, "vm_truncate %s block %d", RelationGetRelationName(rel), nheapblocks);
#endif
RelationOpenSmgr(rel);
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e59b163..10f8dc9 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1919,11 +1919,18 @@ index_update_stats(Relation rel,
{
BlockNumber relpages = RelationGetNumberOfBlocks(rel);
BlockNumber relallvisible;
+ BlockNumber relallfrozen;
if (rd_rel->relkind != RELKIND_INDEX)
- relallvisible = visibilitymap_count(rel);
+ {
+ relallvisible = visibilitymap_count(rel, VISIBILITYMAP_ALL_VISIBLE);
+ relallfrozen = visibilitymap_count(rel, VISIBILITYMAP_ALL_FROZEN);
+ }
else /* don't bother for indexes */
+ {
relallvisible = 0;
+ relallfrozen = 0;
+ }
if (rd_rel->relpages != (int32) relpages)
{
@@ -1940,6 +1947,11 @@ index_update_stats(Relation rel,
rd_rel->relallvisible = (int32) relallvisible;
dirty = true;
}
+ if (rd_rel->relallfrozen != (int32) relallfrozen)
+ {
+ rd_rel->relallfrozen = (int32) relallfrozen;
+ dirty = true;
+ }
}
/*
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index ddb68ab..ee13f41 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -572,7 +572,8 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
vac_update_relstats(onerel,
relpages,
totalrows,
- visibilitymap_count(onerel),
+ visibilitymap_count(onerel, VISIBILITYMAP_ALL_VISIBLE),
+ visibilitymap_count(onerel, VISIBILITYMAP_ALL_FROZEN),
hasindex,
InvalidTransactionId,
InvalidMultiXactId,
@@ -595,6 +596,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
RelationGetNumberOfBlocks(Irel[ind]),
totalindexrows,
0,
+ 0,
false,
InvalidTransactionId,
InvalidMultiXactId,
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 6d55148..e5df123 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -747,6 +747,7 @@ void
vac_update_relstats(Relation relation,
BlockNumber num_pages, double num_tuples,
BlockNumber num_all_visible_pages,
+ BlockNumber num_all_frozen_pages,
bool hasindex, TransactionId frozenxid,
MultiXactId minmulti,
bool in_outer_xact)
@@ -784,6 +785,11 @@ vac_update_relstats(Relation relation,
pgcform->relallvisible = (int32) num_all_visible_pages;
dirty = true;
}
+ if (pgcform->relallfrozen != (int32) num_all_frozen_pages)
+ {
+ pgcform->relallfrozen = (int32) num_all_frozen_pages;
+ dirty = true;
+ }
/* Apply DDL updates, but not inside an outer transaction (see above) */
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index a01cfb4..b8f7d30 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -106,6 +106,8 @@ typedef struct LVRelStats
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* number of pages we examined */
BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber vmskipped_frozen_pages; /* # of pages we skipped by all-frozen bit
+ of visibility map */
double scanned_tuples; /* counts only tuples on scanned pages */
double old_rel_tuples; /* previous value of pg_class.reltuples */
double new_rel_tuples; /* new estimated total # of tuples */
@@ -156,7 +158,7 @@ static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
static int vac_cmp_itemptr(const void *left, const void *right);
static bool heap_page_is_all_visible(Relation rel, Buffer buf,
- TransactionId *visibility_cutoff_xid);
+ TransactionId *visibility_cutoff_xid, bool *all_frozen);
/*
@@ -188,7 +190,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
MultiXactId mxactFullScanLimit;
BlockNumber new_rel_pages;
double new_rel_tuples;
- BlockNumber new_rel_allvisible;
+ BlockNumber new_rel_allvisible,
+ new_rel_allfrozen;
double new_live_tuples;
TransactionId new_frozen_xid;
MultiXactId new_min_multi;
@@ -222,6 +225,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* than or equal to the requested Xid full-table scan limit; or if the
* table's minimum MultiXactId is older than or equal to the requested
* mxid full-table scan limit.
+ * Even if scan_all is set so far, we could skip to scan some pages
+ * according by frozen map.
*/
scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
xidFullScanLimit);
@@ -253,7 +258,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
- if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages + vacrelstats->vmskipped_frozen_pages)
+ < vacrelstats->rel_pages)
{
Assert(!scan_all);
scanned_all = false;
@@ -301,10 +307,14 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_tuples = vacrelstats->old_rel_tuples;
}
- new_rel_allvisible = visibilitymap_count(onerel);
+ new_rel_allvisible = visibilitymap_count(onerel, VISIBILITYMAP_ALL_VISIBLE);
if (new_rel_allvisible > new_rel_pages)
new_rel_allvisible = new_rel_pages;
+ new_rel_allfrozen = visibilitymap_count(onerel, VISIBILITYMAP_ALL_FROZEN);
+ if (new_rel_allfrozen > new_rel_pages)
+ new_rel_allfrozen = new_rel_pages;
+
new_frozen_xid = scanned_all ? FreezeLimit : InvalidTransactionId;
new_min_multi = scanned_all ? MultiXactCutoff : InvalidMultiXactId;
@@ -312,6 +322,7 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_pages,
new_rel_tuples,
new_rel_allvisible,
+ new_rel_allfrozen,
vacrelstats->hasindex,
new_frozen_xid,
new_min_multi,
@@ -360,10 +371,11 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel),
vacrelstats->num_index_scans);
- appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins\n"),
+ appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped according to vm\n"),
vacrelstats->pages_removed,
vacrelstats->rel_pages,
- vacrelstats->pinskipped_pages);
+ vacrelstats->pinskipped_pages,
+ vacrelstats->vmskipped_frozen_pages);
appendStringInfo(&buf,
_("tuples: %.0f removed, %.0f remain, %.0f are dead but not yet removable\n"),
vacrelstats->tuples_deleted,
@@ -486,9 +498,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* consecutive pages. Since we're reading sequentially, the OS should be
* doing readahead for us, so there's no gain in skipping a page now and
* then; that's likely to disable readahead and so be counterproductive.
- * Also, skipping even a single page means that we can't update
- * relfrozenxid, so we only want to do it if we can skip a goodly number
- * of pages.
+ * Also, skipping even a single page accorinding to all-visible bit of
+ * visibility map means that we can't update relfrozenxid, so we only want
+ * to do it if we can skip a goodly number. On the other hand, we count
+ * both how many pages we skipped according to all-frozen bit of visibility
+ * map and how many pages we freeze page, so we can update relfrozenxid if
+ * the sum of them is as many as pages of table.
*
* Before entering the main loop, establish the invariant that
* next_not_all_visible_block is the next block number >= blkno that's not
@@ -498,7 +513,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* started skipping blocks, we may as well skip everything up to the next
* not-all-visible block.
*
- * Note: if scan_all is true, we won't actually skip any pages; but we
+ * Note: if scan_all is true, we might not actually skip any pages; but we
* maintain next_not_all_visible_block anyway, so as to set up the
* all_visible_according_to_vm flag correctly for each page.
*
@@ -515,7 +530,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer))
+ if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
break;
vacuum_delay_point();
}
@@ -533,9 +548,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
bool tupgone,
hastup;
int prev_dead_count;
- int nfrozen;
+ int nfrozen; /* # of tuples is frozen */
+ int nalready_frozen; /* # of tuples is already frozen */
+ int ntotal_frozen; /* # of frozen tuples in single page */
+ int ntup_per_page;
Size freespace;
bool all_visible_according_to_vm;
+ bool all_frozen_according_to_vm;
bool all_visible;
bool has_dead_tuples;
TransactionId visibility_cutoff_xid = InvalidTransactionId;
@@ -548,7 +567,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block++)
{
if (!visibilitymap_test(onerel, next_not_all_visible_block,
- &vmbuffer))
+ &vmbuffer,
+ VISIBILITYMAP_ALL_VISIBLE))
break;
vacuum_delay_point();
}
@@ -563,13 +583,32 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
skipping_all_visible_blocks = false;
all_visible_according_to_vm = false;
+ all_frozen_according_to_vm = false;
}
else
{
- /* Current block is all-visible */
- if (skipping_all_visible_blocks && !scan_all)
- continue;
+ /*
+ * This block is at least all-visible according to visibility map.
+ * We check whehter this block is all-frozen to skip to vacuum this
+ * page even if scanning whole page is required.
+ */
+ bool all_frozen = visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_FROZEN);
+ if (scan_all)
+ {
+ if (all_frozen)
+ {
+ vacrelstats->vmskipped_frozen_pages++;
+ continue;
+ }
+ }
+ else
+ {
+ if (skipping_all_visible_blocks)
+ continue;
+ }
+
all_visible_according_to_vm = true;
+ all_frozen_according_to_vm = all_frozen;
}
vacuum_delay_point();
@@ -740,7 +779,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
PageSetAllVisible(page);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, InvalidTransactionId);
+ vmbuffer, InvalidTransactionId,
+ VISIBILITYMAP_ALL_VISIBLE);
END_CRIT_SECTION();
}
@@ -764,6 +804,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
all_visible = true;
has_dead_tuples = false;
nfrozen = 0;
+ nalready_frozen = 0;
+ ntup_per_page = 0;
hastup = false;
prev_dead_count = vacrelstats->num_dead_tuples;
maxoff = PageGetMaxOffsetNumber(page);
@@ -918,8 +960,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
{
num_tuples += 1;
+ ntup_per_page += 1;;
hastup = true;
+ /* Check whether this tuple is alrady frozen or not */
+ if (HeapTupleHeaderXminFrozen(tuple.t_data))
+ nalready_frozen += 1;
+
/*
* Each non-removable tuple must be checked to see if it needs
* freezing. Note we already have exclusive buffer lock.
@@ -931,9 +978,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
} /* scan along page */
/*
- * If we froze any tuples, mark the buffer dirty, and write a WAL
- * record recording the changes. We must log the changes to be
- * crash-safe against future truncation of CLOG.
+ * If we froze any tuples or any tuples are already frozen,
+ * mark the buffer dirty, and write a WAL record recording the changes.
+ * We must log the changes to be crash-safe against future truncation
+ * of CLOG.
*/
if (nfrozen > 0)
{
@@ -966,6 +1014,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
END_CRIT_SECTION();
}
+ /* Compute the number of frozen tuples in a page */
+ ntotal_frozen = nfrozen + nalready_frozen;
+
/*
* If there are no indexes then we can vacuum the page right now
* instead of doing a second scan.
@@ -988,26 +1039,46 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
freespace = PageGetHeapFreeSpace(page);
- /* mark page all-visible, if appropriate */
- if (all_visible && !all_visible_according_to_vm)
+ /* This page is all visible */
+ if (all_visible)
{
- /*
- * It should never be the case that the visibility map page is set
- * while the page-level bit is clear, but the reverse is allowed
- * (if checksums are not enabled). Regardless, set the both bits
- * so that we get back in sync.
- *
- * NB: If the heap page is all-visible but the VM bit is not set,
- * we don't need to dirty the heap page. However, if checksums
- * are enabled, we do need to make sure that the heap page is
- * dirtied before passing it to visibilitymap_set(), because it
- * may be logged. Given that this situation should only happen in
- * rare cases after a crash, it is not worth optimizing.
- */
- PageSetAllVisible(page);
- MarkBufferDirty(buf);
- visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ /* mark page all-visible, if appropriate */
+ if (!all_visible_according_to_vm)
+ {
+ /*
+ * It should never be the case that the visibility map page is set
+ * while the page-level bit is clear, but the reverse is allowed
+ * (if checksums are not enabled). Regardless, set the both bits
+ * so that we get back in sync.
+ *
+ * NB: If the heap page is all-visible but the VM bit is not set,
+ * we don't need to dirty the heap page. However, if checksums
+ * are enabled, we do need to make sure that the heap page is
+ * dirtied before passing it to visibilitymap_set(), because it
+ * may be logged. Given that this situation should only happen in
+ * rare cases after a crash, it is not worth optimizing.
+ */
+ PageSetAllVisible(page);
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+ }
+
+ /* mark page all-frozen, if all tuples are frozen and not marked yet */
+ if ((ntotal_frozen == ntup_per_page) && !all_frozen_according_to_vm)
+ {
+ Assert(PageIsAllVisible(page));
+
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ {
+ MarkBufferDirty(buf);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+ vmbuffer, visibility_cutoff_xid, flags);
+ }
}
/*
@@ -1018,7 +1089,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* that something bad has happened.
*/
else if (all_visible_according_to_vm && !PageIsAllVisible(page)
- && visibilitymap_test(onerel, blkno, &vmbuffer))
+ && visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
relname, blkno);
@@ -1078,7 +1149,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
num_tuples);
/*
- * Release any remaining pin on visibility map page.
+ * Release any remaining pin on visibility map and frozen map page.
*/
if (BufferIsValid(vmbuffer))
{
@@ -1114,6 +1185,11 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
RelationGetRelationName(onerel),
tups_vacuumed, vacuumed_pages)));
+ /* Report how many frozen pages vacuum skipped according to visibility map */
+ ereport(elevel,
+ (errmsg("skipped %d frozen pages acoording to visibility map",
+ vacrelstats->vmskipped_frozen_pages)));
+
/*
* This is pretty messy, but we split it up so that we can skip emitting
* individual parts of the message when not applicable.
@@ -1226,6 +1302,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
OffsetNumber unused[MaxOffsetNumber];
int uncnt = 0;
TransactionId visibility_cutoff_xid;
+ bool all_frozen;
START_CRIT_SECTION();
@@ -1277,19 +1354,31 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
* dirty, exclusively locked, and, if needed, a full page image has been
* emitted in the log_heap_clean() above.
*/
- if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid))
+ if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid, &all_frozen))
PageSetAllVisible(page);
/*
* All the changes to the heap page have been done. If the all-visible
- * flag is now set, also set the VM bit.
+ * flag is now set, also set the VM all-visible bit.
+ * Also, if this page is all-frozen, set VM all-frozen bit and flag.
*/
- if (PageIsAllVisible(page) &&
- !visibilitymap_test(onerel, blkno, vmbuffer))
+ if (PageIsAllVisible(page))
{
- Assert(BufferIsValid(*vmbuffer));
- visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
- visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ if (!visibilitymap_test(onerel, blkno, vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+
+ /* mark page all-frozen, and set VM all-frozen bit */
+ if (all_frozen)
+ {
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
+ visibility_cutoff_xid, flags);
}
return tupindex;
@@ -1408,6 +1497,7 @@ lazy_cleanup_index(Relation indrel,
stats->num_pages,
stats->num_index_tuples,
0,
+ 0,
false,
InvalidTransactionId,
InvalidMultiXactId,
@@ -1779,10 +1869,12 @@ vac_cmp_itemptr(const void *left, const void *right)
/*
* Check if every tuple in the given page is visible to all current and future
* transactions. Also return the visibility_cutoff_xid which is the highest
- * xmin amongst the visible tuples.
+ * xmin amongst the visible tuples, and all_frozen which implies that all tuples
+ * of this page are frozen.
*/
static bool
-heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid)
+heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid,
+ bool *all_frozen)
{
Page page = BufferGetPage(buf);
BlockNumber blockno = BufferGetBlockNumber(buf);
@@ -1791,6 +1883,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
bool all_visible = true;
*visibility_cutoff_xid = InvalidTransactionId;
+ *all_frozen = true;
/*
* This is a stripped down version of the line pointer scan in
@@ -1814,11 +1907,12 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/*
* Dead line pointers can have index pointers pointing to them. So
- * they can't be treated as visible
+ * they can't be treated as visible and frozen.
*/
if (ItemIdIsDead(itemid))
{
all_visible = false;
+ *all_frozen = false;
break;
}
@@ -1855,6 +1949,10 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
+
+ /* Check whether this tuple is alrady frozen or not */
+ if (!HeapTupleHeaderXminFrozen(tuple.t_data))
+ *all_frozen = false;
}
break;
@@ -1863,6 +1961,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
case HEAPTUPLE_INSERT_IN_PROGRESS:
case HEAPTUPLE_DELETE_IN_PROGRESS:
all_visible = false;
+ *all_frozen = false;
break;
default:
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 9f54c46..08df289 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -116,7 +116,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
*/
if (!visibilitymap_test(scandesc->heapRelation,
ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
+ &node->ioss_VMBuffer, VISIBILITYMAP_ALL_VISIBLE))
{
/*
* Rats, we have to visit the heap to check visibility.
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 41d4606..c34b5da 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -231,6 +231,15 @@ check_cluster_versions(void)
if (old_cluster.major_version > new_cluster.major_version)
pg_fatal("This utility cannot be used to downgrade to older major PostgreSQL versions.\n");
+ /*
+ * We cant't allow upgrading with link mode between 9.5 or before and 9.6 or later,
+ * because the format of visibility map has changed on version 9.6.
+ */
+ if (user_opts.transfer_mode == TRANSFER_MODE_LINK &&
+ GET_MAJOR_VERSION(old_cluster.major_version) < 906 &&
+ GET_MAJOR_VERSION(new_cluster.major_version) >= 906)
+ pg_fatal("This utility cannot upgrade from PostgreSQL version from 9.5 or before to 9.6 or later with link mode.\n");
+
/* get old and new binary versions */
get_bin_version(&old_cluster);
get_bin_version(&new_cluster);
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 79d9390..9bae08c 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -10,6 +10,7 @@
#include "postgres_fe.h"
#include "pg_upgrade.h"
+#include "storage/bufpage.h"
#include <fcntl.h>
@@ -21,6 +22,43 @@ static int copy_file(const char *fromfile, const char *tofile, bool force);
static int win32_pghardlink(const char *src, const char *dst);
#endif
+static const char *rewriteVisibilitymap(const char *fromfile, const char *tofile,
+ bool force);
+
+/* table for fast rewriting vm file in order to add all-frozen information */
+static const uint16 rewrite_vm_table[256] = {
+ 0, 1, 4, 5, 16, 17, 20, 21, 64, 65, 68, 69, 80, 81, 84, 85,
+ 256, 257, 260, 261, 272, 273, 276, 277, 320, 321, 324, 325, 336, 337, 340, 341,
+ 1024, 1025, 1028, 1029, 1040, 1041, 1044, 1045, 1088, 1089, 1092, 1093, 1104, 1105, 1108, 1109,
+ 1280, 1281, 1284, 1285, 1296, 1297, 1300, 1301, 1344, 1345, 1348, 1349, 1360, 1361, 1364, 1365,
+ 4096, 4097, 4100, 4101, 4112, 4113, 4116, 4117, 4160, 4161, 4164, 4165, 4176, 4177, 4180, 4181,
+ 4352, 4353, 4356, 4357, 4368, 4369, 4372, 4373, 4416, 4417, 4420, 4421, 4432, 4433, 4436, 4437,
+ 5120, 5121, 5124, 5125, 5136, 5137, 5140, 5141, 5184, 5185, 5188, 5189, 5200, 5201, 5204, 5205,
+ 5376, 5377, 5380, 5381, 5392, 5393, 5396, 5397, 5440, 5441, 5444, 5445, 5456, 5457, 5460, 5461,
+ 16384, 16385, 16388, 16389, 16400, 16401, 16404, 16405, 16448, 16449, 16452, 16453, 16464, 16465, 16468, 16469,
+ 16640, 16641, 16644, 16645, 16656, 16657, 16660, 16661, 16704, 16705, 16708, 16709, 16720, 16721, 16724, 16725,
+ 17408, 17409, 17412, 17413, 17424, 17425, 17428, 17429, 17472, 17473, 17476, 17477, 17488, 17489, 17492, 17493,
+ 17664, 17665, 17668, 17669, 17680, 17681, 17684, 17685, 17728, 17729, 17732, 17733, 17744, 17745, 17748, 17749,
+ 20480, 20481, 20484, 20485, 20496, 20497, 20500, 20501, 20544, 20545, 20548, 20549, 20560, 20561, 20564, 20565,
+ 20736, 20737, 20740, 20741, 20752, 20753, 20756, 20757, 20800, 20801, 20804, 20805, 20816, 20817, 20820, 20821,
+ 21504, 21505, 21508, 21509, 21520, 21521, 21524, 21525, 21568, 21569, 21572, 21573, 21584, 21585, 21588, 21589,
+ 21760, 21761, 21764, 21765, 21776, 21777, 21780, 21781, 21824, 21825, 21828, 21829, 21840, 21841, 21844, 21845
+};
+
+/*
+ * copyOrRewriteFile()
+ * This function copies file or rewrite visibility map file.
+ * If rewrite_vm is true, we have to rewrite visibility map regardless value of pageConverter.
+ */
+const char *
+copyOrRewriteFile(pageCnvCtx *pageConverter,
+ const char *src, const char *dst, bool force, bool rewrite_vm)
+{
+ if (rewrite_vm)
+ return rewriteVisibilitymap(src, dst, force);
+ else
+ return copyAndUpdateFile(pageConverter, src, dst, force);
+}
/*
* copyAndUpdateFile()
@@ -201,6 +239,99 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
#endif
+/*
+ * rewriteVisibilitymap()
+ *
+ * A additional bit which indicates that all tuples on page is completely
+ * frozen is added into visibility map at PG 9.6. So the format of visibiilty
+ * map has been changed.
+ * Copies a visibility map file while adding all-frozen bit(0) into each bit.
+ */
+static const char *
+rewriteVisibilitymap(const char *fromfile, const char *tofile, bool force)
+{
+#define REWRITE_BUF_SIZE (50 * BLCKSZ)
+#define BITS_PER_HEAPBLOCK 2
+
+ int src_fd, dst_fd;
+ uint16 vm_bits;
+ ssize_t nbytes;
+ char *buffer;
+ int ret = 0;
+
+ if ((fromfile == NULL) || (tofile == NULL))
+ {
+ errno = EINVAL;
+ goto err;
+ }
+
+ if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
+ goto err;
+
+ if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+ goto err;
+
+ buffer = (char *) pg_malloc(REWRITE_BUF_SIZE);
+
+ /* Copy page header data in advance */
+ if ((nbytes = read(src_fd, buffer, MAXALIGN(SizeOfPageHeaderData))) <= 0)
+ goto err;
+
+ if (write(dst_fd, buffer, nbytes) != nbytes)
+ {
+ /* if write didn't set errno, assume problem is no disk space */
+ if (errno == 0)
+ errno = ENOSPC;
+ goto err;
+ }
+
+ /* perform data rewriting i.e read src srouce, write to destination */
+ while (true)
+ {
+ ssize_t nbytes = read(src_fd, buffer, REWRITE_BUF_SIZE);
+ char *cur, *end;
+
+ if (nbytes < 0)
+ {
+ ret = -1;
+ break;
+ }
+
+ if (nbytes == 0)
+ break;
+
+ cur = buffer;
+ end = buffer + nbytes;
+
+ /* Rewrite a byte and write dest_fd per BITS_PER_HEAPBLOCK bytes */
+ while (end > cur)
+ {
+ /* Get rewritten bit from table and its string representation */
+ vm_bits = rewrite_vm_table[(uint8) *cur];
+
+ if (write(dst_fd, &vm_bits, BITS_PER_HEAPBLOCK) != BITS_PER_HEAPBLOCK)
+ {
+ ret = -1;
+ break;
+ }
+ cur++;
+ }
+ }
+
+err:
+
+ if (!buffer)
+ pg_free(buffer);
+
+ if (src_fd != 0)
+ close(src_fd);
+
+ if (dst_fd != 0)
+ close(dst_fd);
+
+ return (errno == 0) ? NULL : getErrorText(errno);
+}
+
void
check_hard_link(void)
{
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 13aa891..fc45ef6 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -112,6 +112,10 @@ extern char *output_files[];
#define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031
/*
+ * The format of visibility map changed with this 9.6 commit,
+ */
+#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201510051
+/*
* pg_multixact format changed in 9.3 commit 0ac5ad5134f2769ccbaefec73844f85,
* ("Improve concurrency of foreign key locking") which also updated catalog
* version to this value. pg_upgrade behavior depends on whether old and new
@@ -396,6 +400,8 @@ const pageCnvCtx *setupPageConverter(void);
typedef void *pageCnvCtx;
#endif
+const char *copyOrRewriteFile(pageCnvCtx *pageConverter, const char *src,
+ const char *dst, bool force, bool rewrite_vm);
const char *copyAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
const char *dst, bool force);
const char *linkAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index c22df42..2fa5b47 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -171,6 +171,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
{
int mapnum;
bool vm_crashsafe_match = true;
+ bool vm_rewrite_needed = false;
/*
* Do the old and new cluster disagree on the crash-safetiness of the vm
@@ -180,6 +181,13 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
vm_crashsafe_match = false;
+ /*
+ * Do we need to rewrite visibilitymap?
+ */
+ if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ vm_rewrite_needed = true;
+
for (mapnum = 0; mapnum < size; mapnum++)
{
if (old_tablespace == NULL ||
@@ -195,7 +203,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
* Copy/link any fsm and vm files, if they exist
*/
transfer_relfile(pageConverter, &maps[mapnum], "_fsm");
- if (vm_crashsafe_match)
+ if (vm_crashsafe_match || vm_rewrite_needed)
transfer_relfile(pageConverter, &maps[mapnum], "_vm");
}
}
@@ -218,6 +226,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
int fd;
int segno;
char extent_suffix[65];
+ bool rewrite_vm = false;
/*
* Now copy/link any related segments as well. Remember, PG breaks large
@@ -276,7 +285,15 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
+ /*
+ * Do we need to rewrite visibilitymap?
+ */
+ if (strcmp(type_suffix, "_vm") == 0 &&
+ old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ rewrite_vm = true;
+
+ if ((msg = copyOrRewriteFile(pageConverter, old_file, new_file, true, rewrite_vm)) != NULL)
pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
diff --git a/src/bin/pg_upgrade/test.sh b/src/bin/pg_upgrade/test.sh
index aa7f399..6b058d4 100644
--- a/src/bin/pg_upgrade/test.sh
+++ b/src/bin/pg_upgrade/test.sh
@@ -174,6 +174,11 @@ if "$MAKE" -C "$oldsrc" installcheck; then
mv "$temp_root"/dump1.sql "$temp_root"/dump1.sql.orig
sed "s;$oldsrc;$newsrc;g" "$temp_root"/dump1.sql.orig >"$temp_root"/dump1.sql
fi
+
+ vm_sql="SELECT c.relname, c.relallvisible FROM pg_class as c, pg_namespace as n WHERE c.relnamespace = n.oid AND n.nspname NOT IN ('information_schema', 'pg_toast', 'pg_catalog') ORDER BY c.relname;"
+ # Test for rewriting visibility map
+ vacuumdb -d regression || visibilitymap_vacuum1_status=$?
+ psql -d regression -c "$vm_sql" > "$temp_root"/vm_test1.txt || visibilitymap_test1_status=$?
else
make_installcheck_status=$?
fi
@@ -188,6 +193,14 @@ if [ -n "$pg_dumpall1_status" ]; then
echo "pg_dumpall of pre-upgrade database cluster failed"
exit 1
fi
+if [ -n "$visibilitymap_vacuum1_status" ];then
+ echo "VACUUM of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+if [ -n "$visibilitymap_test1_status" ];then
+ echo "SELECT of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
PGDATA=$BASE_PGDATA
@@ -203,6 +216,8 @@ case $testhost in
esac
pg_dumpall -f "$temp_root"/dump2.sql || pg_dumpall2_status=$?
+vacuumdb -d regression || visibilitymap_vacuum2_status=$?
+psql -d regression -c "$vm_sql" > "$temp_root"/vm_test2.txt || visibilitymap_test2_status=$?
pg_ctl -m fast stop
# no need to echo commands anymore
@@ -214,11 +229,26 @@ if [ -n "$pg_dumpall2_status" ]; then
exit 1
fi
+if [ -n "$visibilitymap_vacuum2_status" ];then
+ echo "VACUUM of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
+if [ -n "$visibilitymap_test2_status" ];then
+ echo "SELECT of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
case $testhost in
MINGW*) cmd /c delete_old_cluster.bat ;;
*) sh ./delete_old_cluster.sh ;;
esac
+if ! diff "$temp_root"/vm_test1.txt "$temp_root"/vm_test2.txt >/dev/null; then
+ echo "Visibility map rewriting test failed"
+ exit 1
+fi
+
if diff "$temp_root"/dump1.sql "$temp_root"/dump2.sql >/dev/null; then
echo PASSED
exit 0
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index caa0f14..93afb10 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -320,9 +320,10 @@ typedef struct xl_heap_freeze_page
typedef struct xl_heap_visible
{
TransactionId cutoff_xid;
+ uint8 flags;
} xl_heap_visible;
-#define SizeOfHeapVisible (offsetof(xl_heap_visible, cutoff_xid) + sizeof(TransactionId))
+#define SizeOfHeapVisible (offsetof(xl_heap_visible, flags) + sizeof(uint8))
typedef struct xl_heap_new_cid
{
@@ -389,6 +390,6 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
xl_heap_freeze_tuple *xlrec_tp);
extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
- Buffer vm_buffer, TransactionId cutoff_xid);
+ Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
#endif /* HEAPAM_XLOG_H */
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 0c0e0ef..7270609 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -19,15 +19,20 @@
#include "storage/buf.h"
#include "utils/relcache.h"
-extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk,
- Buffer vmbuf);
+/* Flags for bit map */
+#define VISIBILITYMAP_ALL_VISIBLE 0x01
+#define VISIBILITYMAP_ALL_FROZEN 0x02
+
+extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
Buffer *vmbuf);
extern bool visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid);
-extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
-extern BlockNumber visibilitymap_count(Relation rel);
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags);
+extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf,
+ uint8 flags);
+extern BlockNumber visibilitymap_count(Relation rel, uint8 flags);
extern void visibilitymap_truncate(Relation rel, BlockNumber nheapblocks);
#endif /* VISIBILITYMAP_H */
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 8afe5cc..e3b567a 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 201510042
+#define CATALOG_VERSION_NO 201510051
#endif
diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
index 06d287e..e8c1316 100644
--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -47,6 +47,8 @@ CATALOG(pg_class,1259) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83) BKI_SCHEMA_MACRO
float4 reltuples; /* # of tuples (not always up-to-date) */
int32 relallvisible; /* # of all-visible blocks (not always
* up-to-date) */
+ int32 relallfrozen; /* # of all-frozen blocks (not always
+ up-to-date) */
Oid reltoastrelid; /* OID of toast table; 0 if none */
bool relhasindex; /* T if has (or has had) any indexes */
bool relisshared; /* T if shared across databases */
@@ -96,7 +98,7 @@ typedef FormData_pg_class *Form_pg_class;
* ----------------
*/
-#define Natts_pg_class 31
+#define Natts_pg_class 32
#define Anum_pg_class_relname 1
#define Anum_pg_class_relnamespace 2
#define Anum_pg_class_reltype 3
@@ -108,26 +110,27 @@ typedef FormData_pg_class *Form_pg_class;
#define Anum_pg_class_relpages 9
#define Anum_pg_class_reltuples 10
#define Anum_pg_class_relallvisible 11
-#define Anum_pg_class_reltoastrelid 12
-#define Anum_pg_class_relhasindex 13
-#define Anum_pg_class_relisshared 14
-#define Anum_pg_class_relpersistence 15
-#define Anum_pg_class_relkind 16
-#define Anum_pg_class_relnatts 17
-#define Anum_pg_class_relchecks 18
-#define Anum_pg_class_relhasoids 19
-#define Anum_pg_class_relhaspkey 20
-#define Anum_pg_class_relhasrules 21
-#define Anum_pg_class_relhastriggers 22
-#define Anum_pg_class_relhassubclass 23
-#define Anum_pg_class_relrowsecurity 24
-#define Anum_pg_class_relforcerowsecurity 25
-#define Anum_pg_class_relispopulated 26
-#define Anum_pg_class_relreplident 27
-#define Anum_pg_class_relfrozenxid 28
-#define Anum_pg_class_relminmxid 29
-#define Anum_pg_class_relacl 30
-#define Anum_pg_class_reloptions 31
+#define Anum_pg_class_relallfrozen 12
+#define Anum_pg_class_reltoastrelid 13
+#define Anum_pg_class_relhasindex 14
+#define Anum_pg_class_relisshared 15
+#define Anum_pg_class_relpersistence 16
+#define Anum_pg_class_relkind 17
+#define Anum_pg_class_relnatts 18
+#define Anum_pg_class_relchecks 19
+#define Anum_pg_class_relhasoids 20
+#define Anum_pg_class_relhaspkey 21
+#define Anum_pg_class_relhasrules 22
+#define Anum_pg_class_relhastriggers 23
+#define Anum_pg_class_relhassubclass 24
+#define Anum_pg_class_relrowsecurity 25
+#define Anum_pg_class_relforcerowsecurity 26
+#define Anum_pg_class_relispopulated 27
+#define Anum_pg_class_relreplident 28
+#define Anum_pg_class_relfrozenxid 29
+#define Anum_pg_class_relminmxid 30
+#define Anum_pg_class_relacl 31
+#define Anum_pg_class_reloptions 32
/* ----------------
* initial contents of pg_class
@@ -142,13 +145,13 @@ typedef FormData_pg_class *Form_pg_class;
* Note: "3" in the relfrozenxid column stands for FirstNormalTransactionId;
* similarly, "1" in relminmxid stands for FirstMultiXactId
*/
-DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 f f p r 29 0 t f f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 0 f f p r 29 0 t f f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 f f p r 31 0 t f f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 0 f f p r 32 0 t f f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index e3a31af..d2bae2d 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -172,6 +172,7 @@ extern void vac_update_relstats(Relation relation,
BlockNumber num_pages,
double num_tuples,
BlockNumber num_all_visible_pages,
+ BlockNumber num_all_frozen_pages,
bool hasindex,
TransactionId frozenxid,
MultiXactId minmulti,
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index a2f78ee..1040885 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -178,8 +178,10 @@ typedef PageHeaderData *PageHeader;
* tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
+#define PD_ALL_FROZEN 0x0008 /* all tuples on page are completely
+ frozen */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x000F /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -369,6 +371,11 @@ typedef PageHeaderData *PageHeader;
#define PageClearAllVisible(page) \
(((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
+#define PageIsAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags & PD_ALL_FROZEN)
+#define PageSetAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags |= PD_ALL_FROZEN)
+
#define PageIsPrunable(page, oldestxmin) \
( \
AssertMacro(TransactionIdIsNormal(oldestxmin)), \
diff --git a/src/test/regress/expected/visibilitymap.out b/src/test/regress/expected/visibilitymap.out
new file mode 100644
index 0000000..fe0c60c
--- /dev/null
+++ b/src/test/regress/expected/visibilitymap.out
@@ -0,0 +1,29 @@
+--
+-- Visibility map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+\set VERBOSITY terse
+-- All pages are become all-visible
+VACUUM vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+ ?column?
+----------
+ t
+(1 row)
+
+-- All pages are become all-frozen
+VACUUM FREEZE vmtest;
+SELECT relallfrozen = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+ ?column?
+----------
+ t
+(1 row)
+
+-- All pages are skipped acoording to VM
+VACUUM FREEZE VERBOSE vmtest;
+INFO: vacuuming "public.vmtest"
+INFO: index "vmtest_pkey" now contains 10000 row versions in 30 pages
+INFO: skipped 45 frozen pages acoording to visibility map
+INFO: "vmtest": found 0 removable, 0 nonremovable row versions in 0 out of 45 pages
+DROP TABLE vmtest;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 6fc5d1e..a5ff786 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -108,5 +108,8 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare without_oid c
# event triggers cannot run concurrently with any test that runs DDL
test: event_trigger
+# visibility map and vacuum test cannot run concurrently with any test that runs SQL
+test: visibilitymap
+
# run stats by itself because its delay may be insufficient under heavy load
test: stats
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 2ae51cf..d386d67 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -158,3 +158,4 @@ test: with
test: xml
test: event_trigger
test: stats
+test: visibilitymap
\ No newline at end of file
diff --git a/src/test/regress/sql/visibilitymap.sql b/src/test/regress/sql/visibilitymap.sql
new file mode 100644
index 0000000..9bf9094
--- /dev/null
+++ b/src/test/regress/sql/visibilitymap.sql
@@ -0,0 +1,20 @@
+--
+-- Visibility map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+
+\set VERBOSITY terse
+
+-- All pages are become all-visible
+VACUUM vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+
+-- All pages are become all-frozen
+VACUUM FREEZE vmtest;
+SELECT relallfrozen = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+
+-- All pages are skipped acoording to VM
+VACUUM FREEZE VERBOSE vmtest;
+
+DROP TABLE vmtest;
On Fri, Oct 2, 2015 at 8:14 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
+#define Anum_pg_class_relallfrozen 12
Why is pg_class.relallfrozen necessary? ISTM that there is no user of it now.The relallfrozen would be useful for user to estimate time to vacuum
freeze or anti-wrapping vacuum before being done them actually.
(Also this value is used on regression test.)
But this information is not used on planning like relallvisible, so it
would be good to move this information to another system view like
pg_stat_*_tables.
Or make pgstattuple and pgstattuple_approx report even the number
of frozen tuples?
Regards,
--
Fujii Masao
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 10 September 2015 at 01:58, Andres Freund <andres@anarazel.de> wrote:
On 2015-09-04 23:35:42 +0100, Simon Riggs wrote:
This looks OK. You saw that I was proposing to solve this problem a
different way ("Summary of plans to avoid the annoyance of Freezing"),
suggesting that we wait for a few CFs to see if a patch emerges for that-
then fall back to this patch if it doesn't? So I am moving this patch to
next CF.As noted on that other thread I don't think that's a good policy, and it
seems like Robert agrees with me. So I think we should move this back to
"Needs Review".
I also agree. Andres and I spoke at PostgresOpen and persuaded me, I've
just been away.
Am happy to review and commit in next few days/weeks, once I catch up on
the thread.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Oct 5, 2015 at 11:03 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
On Fri, Oct 2, 2015 at 8:14 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
+#define Anum_pg_class_relallfrozen 12
Why is pg_class.relallfrozen necessary? ISTM that there is no user of it now.The relallfrozen would be useful for user to estimate time to vacuum
freeze or anti-wrapping vacuum before being done them actually.
(Also this value is used on regression test.)
But this information is not used on planning like relallvisible, so it
would be good to move this information to another system view like
pg_stat_*_tables.Or make pgstattuple and pgstattuple_approx report even the number
of frozen tuples?
But we cannot know the number of frozen pages without installation of
pageinspect module.
I'm a bit concerned about that the all projects cannot install
extentension module into postgresql on production environment.
I think we need to provide such feature at least into core.
Thought?
Regards,
--
Masahiko Sawada
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Oct 5, 2015 at 7:31 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Sat, Oct 3, 2015 at 3:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Oct 2, 2015 at 11:23 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:+ /* all-frozen information is also cleared at the same time */ PageClearAllVisible(page); + PageClearAllFrozen(page);I wonder if it makes sense to have a macro to clear both in unison,
which seems a very common pattern.I think PageClearAllVisible should clear both, and there should be no
other macro. There is no event that causes a page to cease being
all-visible that does not also cause it to cease being all-frozen.
You might think that deleting or locking a tuple would fall into that
category - but nope, XMAX needs to be cleared or the tuple pruned, or
there will be problems after wraparound.Thank you for your advice.
I understood.I changed the patch so that PageClearAllVisible clear both bits, and
removed ClearAllFrozen.
Attached the latest v16 patch which contains draft version documentation patch.
Thanks for updating the patch! Here are another review comments.
+ ereport(elevel,
+ (errmsg("skipped %d frozen pages acoording to visibility map",
+ vacrelstats->vmskipped_frozen_pages)));
Typo: acoording should be according.
When vmskipped_frozen_pages is 1, "1 frozen pages" in log message
sounds incorrect in terms of grammar. So probably errmsg_plural()
should be used here.
+ relallvisible = visibilitymap_count(rel,
VISIBILITYMAP_ALL_VISIBLE);
+ relallfrozen = visibilitymap_count(rel, VISIBILITYMAP_ALL_FROZEN);
We can refactor visibilitymap_count() so that it counts the numbers of
both all-visible and all-frozen tuples at the same time, in order to
avoid reading through visibility map twice.
heap_page_is_all_visible() can set all_frozen to TRUE even when
it returns FALSE. This is odd because the page must not be all frozen
when it's not all visible. heap_page_is_all_visible() should set
all_frozen to FALSE whenever all_visible is set to FALSE?
Probably it's better to forcibly set all_frozen to FALSE at the end of
the function whenever all_visible is FALSE.
+ if (PageIsAllVisible(page))
{
- Assert(BufferIsValid(*vmbuffer));
Why did you remove this assertion?
+ if (all_frozen)
+ {
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
Why didn't you call visibilitymap_test() for all frozen case here?
In visibilitymap_set(), the argument flag must be either
(VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN) or
VISIBILITYMAP_ALL_VISIBLE. So I think that it's better to add
Assert() which checks whether the specified flag is valid or not.
+ * caller is expected to set PD_ALL_VISIBLE or
+ * PD_ALL_FROZEN first.
+ */
+ Assert(PageIsAllVisible(heapPage) ||
PageIsAllFrozen(heapPage));
This should be the following?
Assert(((flag | VISIBILITYMAP_ALL_VISIBLE) && PageIsAllVisible(heapPage)) ||
((flag | VISIBILITYMAP_ALL_FROZEN) && PageIsAllFrozen(heapPage)));
Regards,
--
Fujii Masao
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Oct 8, 2015 at 7:03 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
On Mon, Oct 5, 2015 at 7:31 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Sat, Oct 3, 2015 at 3:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Oct 2, 2015 at 11:23 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:+ /* all-frozen information is also cleared at the same time */ PageClearAllVisible(page); + PageClearAllFrozen(page);I wonder if it makes sense to have a macro to clear both in unison,
which seems a very common pattern.I think PageClearAllVisible should clear both, and there should be no
other macro. There is no event that causes a page to cease being
all-visible that does not also cause it to cease being all-frozen.
You might think that deleting or locking a tuple would fall into that
category - but nope, XMAX needs to be cleared or the tuple pruned, or
there will be problems after wraparound.Thank you for your advice.
I understood.I changed the patch so that PageClearAllVisible clear both bits, and
removed ClearAllFrozen.
Attached the latest v16 patch which contains draft version documentation patch.Thanks for updating the patch! Here are another review comments.
Thank you for reviewing!
Attached the latest patch.
+ ereport(elevel, + (errmsg("skipped %d frozen pages acoording to visibility map", + vacrelstats->vmskipped_frozen_pages)));Typo: acoording should be according.
When vmskipped_frozen_pages is 1, "1 frozen pages" in log message
sounds incorrect in terms of grammar. So probably errmsg_plural()
should be used here.
Thank you for your advice.
Fixed.
+ relallvisible = visibilitymap_count(rel, VISIBILITYMAP_ALL_VISIBLE); + relallfrozen = visibilitymap_count(rel, VISIBILITYMAP_ALL_FROZEN);We can refactor visibilitymap_count() so that it counts the numbers of
both all-visible and all-frozen tuples at the same time, in order to
avoid reading through visibility map twice.
I agree.
I've changed so.
heap_page_is_all_visible() can set all_frozen to TRUE even when
it returns FALSE. This is odd because the page must not be all frozen
when it's not all visible. heap_page_is_all_visible() should set
all_frozen to FALSE whenever all_visible is set to FALSE?
Probably it's better to forcibly set all_frozen to FALSE at the end of
the function whenever all_visible is FALSE.
Fixed.
+ if (PageIsAllVisible(page)) { - Assert(BufferIsValid(*vmbuffer));Why did you remove this assertion?
It's my mistake.
Fixed.
+ if (all_frozen) + { + PageSetAllFrozen(page); + flags |= VISIBILITYMAP_ALL_FROZEN; + }Why didn't you call visibilitymap_test() for all frozen case here?
Same as above.
Fixed.
In visibilitymap_set(), the argument flag must be either
(VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN) or
VISIBILITYMAP_ALL_VISIBLE. So I think that it's better to add
Assert() which checks whether the specified flag is valid or not.
I agree.
I added Assert() to beginning of visibilitymap_set() function.
+ * caller is expected to set PD_ALL_VISIBLE or + * PD_ALL_FROZEN first. + */ + Assert(PageIsAllVisible(heapPage) || PageIsAllFrozen(heapPage));This should be the following?
Assert(((flag | VISIBILITYMAP_ALL_VISIBLE) && PageIsAllVisible(heapPage)) ||
((flag | VISIBILITYMAP_ALL_FROZEN) && PageIsAllFrozen(heapPage)));
I agree.
Fixed.
Regards,
--
Masahiko Sawada
Attachments:
000_add_frozen_bit_into_visibilitymap_v17.patchtext/x-patch; charset=US-ASCII; name=000_add_frozen_bit_into_visibilitymap_v17.patchDownload
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 22c5f7a..b1b6a06 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -87,7 +87,7 @@ statapprox_heap(Relation rel, output_type *stat)
* If the page has only visible tuples, then we can find out the free
* space from the FSM and move on.
*/
- if (visibilitymap_test(rel, blkno, &vmbuffer))
+ if (visibilitymap_test(rel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
freespace = GetRecordedFreeSpace(rel, blkno);
stat->tuple_len += BLCKSZ - freespace;
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 97ef618..f8aa18b 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -1841,6 +1841,18 @@
</row>
<row>
+ <entry><structfield>relallfrozen</structfield></entry>
+ <entry><type>int4</type></entry>
+ <entry></entry>
+ <entry>
+ Number of pages that are marked all-frozen in the tables's
+ visibility map. It is updated by <command>VACUUM</command>.
+ <command>ANALYZE</command>, and a few DDL coomand such as
+ <command>CREATE INDEX</command>.
+ </entry>
+ </row>
+
+ <row>
<entry><structfield>reltoastrelid</structfield></entry>
<entry><type>oid</type></entry>
<entry><literal><link linkend="catalog-pg-class"><structname>pg_class</structname></link>.oid</literal></entry>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5081da0..6bd4d57 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5900,7 +5900,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs a aggressive freezing if the table's
<structname>pg_class</>.<structfield>relfrozenxid</> field has reached
the age specified by this setting. The default is 150 million
transactions. Although users can set this value anywhere from zero to
@@ -5944,7 +5944,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs a aggressive freezing if the table's
<structname>pg_class</>.<structfield>relminmxid</> field has reached
the age specified by this setting. The default is 150 million multixacts.
Although users can set this value anywhere from zero to two billions,
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index b5d4050..c8ad27f 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -352,9 +352,9 @@
Vacuum maintains a <link linkend="storage-vm">visibility map</> for each
table to keep track of which pages contain only tuples that are known to be
visible to all active transactions (and all future transactions, until the
- page is again modified). This has two purposes. First, vacuum
- itself can skip such pages on the next run, since there is nothing to
- clean up.
+ page is again modified), and pages contain only tuples that are marked as
+ frozen. This has two purposes. First, vacuum itself can skip such pages
+ on the next run, since there is nothing to clean up.
</para>
<para>
@@ -438,23 +438,22 @@
</para>
<para>
- <command>VACUUM</> normally skips pages that don't have any dead row
- versions, but those pages might still have row versions with old XID
- values. To ensure all old row versions have been frozen, a
- scan of the whole table is needed.
+ <command>VACUUM</> skips pages that don't have any dead row
+ versions, and pages that have only frozen rows.
+ To ensure all old row versions have been frozen, a scan of all pages that
+ are not marked as frozen is needed.
<xref linkend="guc-vacuum-freeze-table-age"> controls when
- <command>VACUUM</> does that: a whole table sweep is forced if
- the table hasn't been fully scanned for <varname>vacuum_freeze_table_age</>
- minus <varname>vacuum_freeze_min_age</> transactions. Setting it to 0
- forces <command>VACUUM</> to always scan all pages, effectively ignoring
- the visibility map.
+ <command>VACUUM</> does that: a table sweep is forced if
+ the table hasn't been ensured all row versions are frozen for
+ <varname>vacuum_freeze_table_age</> minus <varname>vacuum_freeze_min_age</>
+ transcations.
</para>
<para>
The maximum time that a table can go unvacuumed is two billion
transactions minus the <varname>vacuum_freeze_min_age</> value at
- the time <command>VACUUM</> last scanned the whole table. If it were to go
- unvacuumed for longer than
+ the time <command>VACUUM</> last scanned pages that are not marked as frozen
+ If it were to go unvacuumed for longer than
that, data loss could result. To ensure that this does not happen,
autovacuum is invoked on any table that might contain unfrozen rows with
XIDs older than the age specified by the configuration parameter <xref
@@ -490,8 +489,7 @@
a regularly scheduled <command>VACUUM</> or an autovacuum triggered by
normal delete and update activity is run in that window. Setting it too
close could lead to anti-wraparound autovacuums, even though the table
- was recently vacuumed to reclaim space, whereas lower values lead to more
- frequent whole-table scans.
+ was recently vacuumed to reclaim space.
</para>
<para>
@@ -527,7 +525,7 @@
<structname>pg_database</>. In particular,
the <structfield>relfrozenxid</> column of a table's
<structname>pg_class</> row contains the freeze cutoff XID that was used
- by the last whole-table <command>VACUUM</> for that table. All rows
+ by the last whole-table freezing for that table. All rows
inserted by transactions with XIDs older than this cutoff XID are
guaranteed to have been frozen. Similarly,
the <structfield>datfrozenxid</> column of a database's
@@ -555,17 +553,19 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
<command>VACUUM</> normally
only scans pages that have been modified since the last vacuum, but
<structfield>relfrozenxid</> can only be advanced when the whole table is
- scanned. The whole table is scanned when <structfield>relfrozenxid</> is
- more than <varname>vacuum_freeze_table_age</> transactions old, when
- <command>VACUUM</>'s <literal>FREEZE</> option is used, or when all pages
- happen to
- require vacuuming to remove dead row versions. When <command>VACUUM</>
- scans the whole table, after it's finished <literal>age(relfrozenxid)</>
- should be a little more than the <varname>vacuum_freeze_min_age</> setting
- that was used (more by the number of transactions started since the
- <command>VACUUM</> started). If no whole-table-scanning <command>VACUUM</>
- is issued on the table until <varname>autovacuum_freeze_max_age</> is
- reached, an autovacuum will soon be forced for the table.
+ frozen. The whole table is scanned only when all pages happen to require
+ vacuuming to remove dead row versions. In other cases such as where
+ <structfield>relfrozenxid</> is more than <varname>vacuum_freeze_table_age</>
+ transcations old, where <command>VACUUM</>'s <literal>FREEZE</> option is used,
+ <command>VACUUM</> can skip the pages that all tuples on the page itself are
+ marked as frozen.
+ When the all pages of table are eventually marked as frozen by <command>VACUUM</>,
+ after it's finished <literal>age(relfrozenxid)</> should be a little more
+ than the <varname>vacuum_freeze_min_age</> setting that was used (more by
+ the number of transcations started since the <command>VACUUM</> started).
+ If the advancing of <structfield>relfrozenxid</> is not happend until
+ <varname>autovacuum_freeze_max_age</> is reached, an autovacuum will soon
+ be forced for the table.
</para>
<para>
@@ -642,28 +642,28 @@ HINT: Stop the postmaster and vacuum that database in single-user mode.
<structname>pg_class</>.<structfield>relminmxid</> stores the oldest
possible multixact ID still appearing in any tuple of that table.
If this value is older than
- <xref linkend="guc-vacuum-multixact-freeze-table-age">, a whole-table
+ <xref linkend="guc-vacuum-multixact-freeze-table-age">, a table
scan is forced. <function>mxid_age()</> can be used on
<structname>pg_class</>.<structfield>relminmxid</> to find its age.
</para>
<para>
- Whole-table <command>VACUUM</> scans, regardless of
- what causes them, enable advancing the value for that table.
+ When <command>VACUUM</> scans all pages that are not marked as frozen,
+ regardless of what causes them, enable advancing the value for that table.
Eventually, as all tables in all databases are scanned and their
oldest multixact values are advanced, on-disk storage for older
multixacts can be removed.
</para>
<para>
- As a safety device, a whole-table vacuum scan will occur for any table
- whose multixact-age is greater than
- <xref linkend="guc-autovacuum-multixact-freeze-max-age">. Whole-table
+ As a safety device, a vacuum scan will occur for any table whose
+ multixact-age is greater than
+ <xref linkend="guc-autovacuum-multixact-freeze-max-age">. A
vacuum scans will also occur progressively for all tables, starting with
those that have the oldest multixact-age, if the amount of used member
storage space exceeds the amount 50% of the addressable storage space.
- Both of these kinds of whole-table scans will occur even if autovacuum is
- nominally disabled.
+ Both of these kinds of scans will occur even if autovacuum is nominally
+ disabled.
</para>
</sect3>
</sect2>
@@ -743,8 +743,8 @@ vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuple
<command>UPDATE</command> and <command>DELETE</command> operation. (It
is only semi-accurate because some information might be lost under heavy
load.) If the <structfield>relfrozenxid</> value of the table is more
- than <varname>vacuum_freeze_table_age</> transactions old, the whole
- table is scanned to freeze old tuples and advance
+ than <varname>vacuum_freeze_table_age</> transactions old, the table is
+ scanned to freeze old tuples and advance
<structfield>relfrozenxid</>, otherwise only pages that have been modified
since the last vacuum are scanned.
</para>
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 164d08c..9328cdf 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -623,18 +623,21 @@ can be used to examine the information stored in free space maps.
<para>
Each heap relation has a Visibility Map
(VM) to keep track of which pages contain only tuples that are known to be
-visible to all active transactions. It's stored
-alongside the main relation data in a separate relation fork, named after the
-filenode number of the relation, plus a <literal>_vm</> suffix. For example,
-if the filenode of a relation is 12345, the VM is stored in a file called
-<filename>12345_vm</>, in the same directory as the main relation file.
+visible to all active transactions, and pages contain only tuples that are
+marked as frozen.
+It's stored alongside the main relation data in a separate relation fork,
+named after the filenode number of the relation, plus a <literal>_vm</> suffix.
+For example, if the filenode of a relation is 12345, the VM is stored in a file
+called <filename>12345_vm</>, in the same directory as the main relation file.
Note that indexes do not have VMs.
</para>
<para>
-The visibility map simply stores one bit per heap page. A set bit means
-that all tuples on the page are known to be visible to all transactions.
-This means that the page does not contain any tuples that need to be vacuumed.
+The visibility map stores two bits per heap page: all-visible, all-frozen.
+A set all-visible bit means that all tuples on the page are known to be visible
+to all transactions. A set all-frozen bit means that all tuples on the page are
+completely marked as frozen. This means that the page does not contain any tuples
+that need to be vacuumed, even if a scan of whole table is required.
This information can also be used by <firstterm>index-only scans</> to answer
queries using only the index tuple.
</para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index bcf9871..84577b4 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2176,8 +2176,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
/*
- * Find buffer to insert this tuple into. If the page is all visible,
- * this will also pin the requisite visibility map page.
+ * Find buffer to insert this tuple into. If the page is all visible
+ * or all frozen, this will also pin the requisite visibility map and
+ * frozen map page.
*/
buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
InvalidBuffer, options, bistate,
@@ -2776,9 +2777,9 @@ heap_delete(Relation relation, ItemPointer tid,
/*
* If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, we'll have to unlock and
- * re-lock, to avoid holding the buffer lock across an I/O. That's a bit
- * unfortunate, but hopefully shouldn't happen often.
+ * visible or all frozen while we were busy locking the buffer, we'll
+ * have to unlock and re-lock, to avoid holding the buffer lock across an
+ * I/O. That's a bit unfortunate, but hopefully shouldn't happen often.
*/
if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
{
@@ -6942,7 +6943,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
*/
XLogRecPtr
log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
- TransactionId cutoff_xid)
+ TransactionId cutoff_xid, uint8 vmflags)
{
xl_heap_visible xlrec;
XLogRecPtr recptr;
@@ -6952,6 +6953,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
Assert(BufferIsValid(vm_buffer));
xlrec.cutoff_xid = cutoff_xid;
+ xlrec.flags = vmflags;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
@@ -7541,7 +7543,12 @@ heap_xlog_visible(XLogReaderState *record)
* the subsequent update won't be replayed to clear the flag.
*/
page = BufferGetPage(buffer);
- PageSetAllVisible(page);
+
+ if (xlrec->flags & VISIBILITYMAP_ALL_VISIBLE)
+ PageSetAllVisible(page);
+ if (xlrec->flags & VISIBILITYMAP_ALL_FROZEN)
+ PageSetAllFrozen(page);
+
MarkBufferDirty(buffer);
}
else if (action == BLK_RESTORED)
@@ -7593,7 +7600,7 @@ heap_xlog_visible(XLogReaderState *record)
*/
if (lsn > PageGetLSN(vmpage))
visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
- xlrec->cutoff_xid);
+ xlrec->cutoff_xid, xlrec->flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 7c38772..c87cb65 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -21,33 +21,41 @@
*
* NOTES
*
- * The visibility map is a bitmap with one bit per heap page. A set bit means
- * that all tuples on the page are known visible to all transactions, and
- * therefore the page doesn't need to be vacuumed. The map is conservative in
- * the sense that we make sure that whenever a bit is set, we know the
- * condition is true, but if a bit is not set, it might or might not be true.
+ * The visibility map is a bitmap with two bits (all-visible and all-frozen)
+ * per heap page. A set all-visible bit means that all tuples on the page are
+ * known visible to all transactions, and therefore the page doesn't need to
+ * be vacuumed. A set all-frozen bit means that all tuples on the page are
+ * completely frozen, and therefore the page doesn't need to be vacuumed even
+ * if whole table scanning vacuum is required (e.g. anti-wraparound vacuum).
+ * The all-frozen bit must be set only when the page is already all-visible.
+ * That is, all-frozen bit is always set with all-visible bit.
+ *
+ * The map is conservative in the sense that we make sure that whenever a bit
+ * is set, we know the condition is true, but if a bit is not set, it might or
+ * might not be true.
*
* Clearing a visibility map bit is not separately WAL-logged. The callers
* must make sure that whenever a bit is cleared, the bit is cleared on WAL
- * replay of the updating operation as well.
+ * replay of the updating operation as well. And all-frozen bit must be
+ * cleared with all-visible at the same time.
*
* When we *set* a visibility map during VACUUM, we must write WAL. This may
* seem counterintuitive, since the bit is basically a hint: if it is clear,
- * it may still be the case that every tuple on the page is visible to all
- * transactions; we just don't know that for certain. The difficulty is that
- * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
- * on the page itself, and the visibility map bit. If a crash occurs after the
- * visibility map page makes it to disk and before the updated heap page makes
- * it to disk, redo must set the bit on the heap page. Otherwise, the next
- * insert, update, or delete on the heap page will fail to realize that the
- * visibility map bit must be cleared, possibly causing index-only scans to
- * return wrong answers.
+ * it may still be the case that every tuple on the page is visible or frozen
+ * to all transactions; we just don't know that for certain. The difficulty is
+ * that there are two bits which are typically set together: the PD_ALL_VISIBLE
+ * or PD_ALL_FROZEN bit on the page itself, and the corresponding visibility map
+ * bit. If a crash occurs after the visibility map page makes it to disk and before
+ * the updated heap page makes it to disk, redo must set the bit on the heap page.
+ * Otherwise, the next insert, update, or delete on the heap page will fail to
+ * realize that the visibility map bit must be cleared, possibly causing index-only
+ * scans to return wrong answers.
*
* VACUUM will normally skip pages for which the visibility map bit is set;
* such pages can't contain any dead tuples and therefore don't need vacuuming.
- * The visibility map is not used for anti-wraparound vacuums, because
- * an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
- * present in the table, even on pages that don't have any dead tuples.
+ * The visibility map has the all-frozen bit which indicates all tuple on
+ * corresponding page has been completely frozen, so the visibility map is also
+ * used for anti-wraparound vacuum, even if freezing tuples is required.
*
* LOCKING
*
@@ -58,14 +66,14 @@
* section that logs the page modification. However, we don't want to hold
* the buffer lock over any I/O that may be required to read in the visibility
* map page. To avoid this, we examine the heap page before locking it;
- * if the page-level PD_ALL_VISIBLE bit is set, we pin the visibility map
- * bit. Then, we lock the buffer. But this creates a race condition: there
- * is a possibility that in the time it takes to lock the buffer, the
- * PD_ALL_VISIBLE bit gets set. If that happens, we have to unlock the
- * buffer, pin the visibility map page, and relock the buffer. This shouldn't
- * happen often, because only VACUUM currently sets visibility map bits,
- * and the race will only occur if VACUUM processes a given page at almost
- * exactly the same time that someone tries to further modify it.
+ * if the page-level PD_ALL_VISIBLE or PD_ALL_FROZEN bit is set, we pin the
+ * visibility map bit. Then, we lock the buffer. But this creates a race
+ * condition: there is a possibility that in the time it takes to lock the
+ * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that happens,
+ * we have to unlock the buffer, pin the visibility map page, and relock the
+ * buffer. This shouldn't happen often, because only VACUUM currently sets
+ * visibility map bits, and the race will only occur if VACUUM processes a given
+ * page at almost exactly the same time that someone tries to further modify it.
*
* To set a bit, you need to hold a lock on the heap page. That prevents
* the race condition where VACUUM sees that all tuples on the page are
@@ -101,11 +109,14 @@
*/
#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
-/* Number of bits allocated for each heap block. */
-#define BITS_PER_HEAPBLOCK 1
+/*
+ * Number of bits allocated for each heap block.
+ * One for all-visible, other for all-frozen.
+*/
+#define BITS_PER_HEAPBLOCK 2
/* Number of heap blocks we can represent in one byte. */
-#define HEAPBLOCKS_PER_BYTE 8
+#define HEAPBLOCKS_PER_BYTE 4
/* Number of heap blocks we can represent in one visibility map page. */
#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
@@ -115,24 +126,42 @@
#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
#define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
-/* table for fast counting of set bits */
-static const uint8 number_of_ones[256] = {
- 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
+/* tables for fast counting of set bits for visible and freeze */
+static const uint8 number_of_ones_for_visible[256] = {
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4
+};
+static const uint8 number_of_ones_for_frozen[256] = {
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4
};
/* prototypes for internal routines */
@@ -141,7 +170,7 @@ static void vm_extend(Relation rel, BlockNumber nvmblocks);
/*
- * visibilitymap_clear - clear a bit in visibility map
+ * visibilitymap_clear - clear all bits in visibility map
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -153,11 +182,12 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
int mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- uint8 mask = 1 << mapBit;
+ uint8 mask = (VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN) <<
+ (BITS_PER_HEAPBLOCK * mapBit);
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_clear %s block %d", RelationGetRelationName(rel), heapBlk);
#endif
if (!BufferIsValid(buf) || BufferGetBlockNumber(buf) != mapBlock)
@@ -225,7 +255,7 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
}
/*
- * visibilitymap_set - set a bit on a previously pinned page
+ * visibilitymap_set - set bit(s) on a previously pinned page
*
* recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
* or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
@@ -234,10 +264,11 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
* marked all-visible; it is needed for Hot Standby, and can be
* InvalidTransactionId if the page contains no tuples.
*
- * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
- * this function. Except in recovery, caller should also pass the heap
- * buffer. When checksums are enabled and we're not in recovery, we must add
- * the heap buffer to the WAL chain to protect it from being torn.
+ * Caller is expected to set the heap page's PD_ALL_VISIBLE or PD_ALL_FROZEN
+ * bit before calling this function. Except in recovery, caller should also
+ * pass the heap buffer and flags which indicates what flag we want to set.
+ * When checksums are enabled and we're not in recovery, we must add the heap
+ * buffer to the WAL chain to protect it from being torn.
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -245,7 +276,8 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
*/
void
visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid)
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -254,11 +286,13 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_set %s block %d, flag %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
Assert(InRecovery || BufferIsValid(heapBuf));
+ Assert((flags & VISIBILITYMAP_ALL_VISIBLE) ||
+ (flags & (VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN)));
/* Check that we have the right heap page pinned, if present */
if (BufferIsValid(heapBuf) && BufferGetBlockNumber(heapBuf) != heapBlk)
@@ -272,11 +306,11 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map = PageGetContents(page);
LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
- if (!(map[mapByte] & (1 << mapBit)))
+ if (flags != (map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))))
{
START_CRIT_SECTION();
- map[mapByte] |= (1 << mapBit);
+ map[mapByte] |= (flags << (BITS_PER_HEAPBLOCK * mapBit));
MarkBufferDirty(vmBuf);
if (RelationNeedsWAL(rel))
@@ -285,7 +319,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Assert(!InRecovery);
recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
- cutoff_xid);
+ cutoff_xid, flags);
/*
* If data checksums are enabled (or wal_log_hints=on), we
@@ -295,11 +329,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Page heapPage = BufferGetPage(heapBuf);
- /* caller is expected to set PD_ALL_VISIBLE first */
- Assert(PageIsAllVisible(heapPage));
+ /*
+ * Caller is expected to set PD_ALL_VISIBLE or
+ * PD_ALL_FROZEN first.
+ */
+ Assert(((flags | VISIBILITYMAP_ALL_VISIBLE) && PageIsAllVisible(heapPage)) ||
+ ((flags | VISIBILITYMAP_ALL_FROZEN) && PageIsAllFrozen(heapPage)));
PageSetLSN(heapPage, recptr);
}
}
+
PageSetLSN(page, recptr);
}
@@ -310,15 +349,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
}
/*
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_test - test if bit(s) is set
*
- * Are all tuples on heapBlk visible to all, according to the visibility map?
+ * Are all tuples on heapBlk visible or frozen to all, according to the visibility map?
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by an
* earlier call to visibilitymap_pin or visibilitymap_test on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk, or InvalidBuffer. The caller is responsible for
- * releasing *buf after it's done testing and setting bits.
+ * releasing *buf after it's done testing and setting bits, and must set flags
+ * which indicates what flag we want to test.
*
* NOTE: This function is typically called without a lock on the heap page,
* so somebody else could change the bit just after we look at it. In fact,
@@ -328,7 +368,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
* all concurrency issues!
*/
bool
-visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
+visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf, uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -337,7 +377,7 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_test %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_test %s block %d, flag %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
/* Reuse the old pinned buffer if possible */
@@ -360,11 +400,12 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
map = PageGetContents(BufferGetPage(*buf));
/*
- * A single-bit read is atomic. There could be memory-ordering effects
+ * A single or double bit read is atomic. There could be memory-ordering effects
* here, but for performance reasons we make it the caller's job to worry
* about that.
*/
- result = (map[mapByte] & (1 << mapBit)) ? true : false;
+ result = (map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))) ?
+ true : false;
return result;
}
@@ -374,14 +415,16 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
*
* Note: we ignore the possibility of race conditions when the table is being
* extended concurrently with the call. New pages added to the table aren't
- * going to be marked all-visible, so they won't affect the result.
+ * going to be marked all-visible or all-frozen, so they won't affect the result.
+ * The caller must set the flags which indicates what flag we want to count.
*/
-BlockNumber
-visibilitymap_count(Relation rel)
+void
+visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen)
{
- BlockNumber result = 0;
BlockNumber mapBlock;
+ *all_visible = *all_frozen = 0;
+
for (mapBlock = 0;; mapBlock++)
{
Buffer mapBuffer;
@@ -406,13 +449,12 @@ visibilitymap_count(Relation rel)
for (i = 0; i < MAPSIZE; i++)
{
- result += number_of_ones[map[i]];
+ *all_visible += number_of_ones_for_visible[map[i]];
+ *all_frozen += number_of_ones_for_frozen[map[i]];
}
ReleaseBuffer(mapBuffer);
}
-
- return result;
}
/*
@@ -435,7 +477,7 @@ visibilitymap_truncate(Relation rel, BlockNumber nheapblocks)
uint8 truncBit = HEAPBLK_TO_MAPBIT(nheapblocks);
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_truncate %s %d", RelationGetRelationName(rel), nheapblocks);
+ elog(DEBUG1, "vm_truncate %s block %d", RelationGetRelationName(rel), nheapblocks);
#endif
RelationOpenSmgr(rel);
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e59b163..fe743ba 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1918,12 +1918,15 @@ index_update_stats(Relation rel,
if (reltuples >= 0)
{
BlockNumber relpages = RelationGetNumberOfBlocks(rel);
- BlockNumber relallvisible;
+ BlockNumber relallvisible, relallfrozen;
if (rd_rel->relkind != RELKIND_INDEX)
- relallvisible = visibilitymap_count(rel);
+ visibilitymap_count(rel, &relallvisible, &relallfrozen);
else /* don't bother for indexes */
+ {
relallvisible = 0;
+ relallfrozen = 0;
+ }
if (rd_rel->relpages != (int32) relpages)
{
@@ -1940,6 +1943,11 @@ index_update_stats(Relation rel,
rd_rel->relallvisible = (int32) relallvisible;
dirty = true;
}
+ if (rd_rel->relallfrozen != (int32) relallfrozen)
+ {
+ rd_rel->relallfrozen = (int32) relallfrozen;
+ dirty = true;
+ }
}
/*
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index ddb68ab..53279a7 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -314,6 +314,8 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
+ BlockNumber relallvisible,
+ relallfrozen;
if (inh)
ereport(elevel,
@@ -564,6 +566,9 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
}
}
+ /* Caluclate the number of all-visible and all-frozen bit */
+ visibilitymap_count(onerel, &relallvisible, &relallfrozen);
+
/*
* Update pages/tuples stats in pg_class ... but not if we're doing
* inherited stats.
@@ -572,7 +577,8 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
vac_update_relstats(onerel,
relpages,
totalrows,
- visibilitymap_count(onerel),
+ relallvisible,
+ relallfrozen,
hasindex,
InvalidTransactionId,
InvalidMultiXactId,
@@ -595,6 +601,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
RelationGetNumberOfBlocks(Irel[ind]),
totalindexrows,
0,
+ 0,
false,
InvalidTransactionId,
InvalidMultiXactId,
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 6d55148..e5df123 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -747,6 +747,7 @@ void
vac_update_relstats(Relation relation,
BlockNumber num_pages, double num_tuples,
BlockNumber num_all_visible_pages,
+ BlockNumber num_all_frozen_pages,
bool hasindex, TransactionId frozenxid,
MultiXactId minmulti,
bool in_outer_xact)
@@ -784,6 +785,11 @@ vac_update_relstats(Relation relation,
pgcform->relallvisible = (int32) num_all_visible_pages;
dirty = true;
}
+ if (pgcform->relallfrozen != (int32) num_all_frozen_pages)
+ {
+ pgcform->relallfrozen = (int32) num_all_frozen_pages;
+ dirty = true;
+ }
/* Apply DDL updates, but not inside an outer transaction (see above) */
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index a01cfb4..2f92f05 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -106,6 +106,8 @@ typedef struct LVRelStats
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* number of pages we examined */
BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber vmskipped_frozen_pages; /* # of pages we skipped by all-frozen bit
+ of visibility map */
double scanned_tuples; /* counts only tuples on scanned pages */
double old_rel_tuples; /* previous value of pg_class.reltuples */
double new_rel_tuples; /* new estimated total # of tuples */
@@ -156,7 +158,7 @@ static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
static int vac_cmp_itemptr(const void *left, const void *right);
static bool heap_page_is_all_visible(Relation rel, Buffer buf,
- TransactionId *visibility_cutoff_xid);
+ TransactionId *visibility_cutoff_xid, bool *all_frozen);
/*
@@ -188,7 +190,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
MultiXactId mxactFullScanLimit;
BlockNumber new_rel_pages;
double new_rel_tuples;
- BlockNumber new_rel_allvisible;
+ BlockNumber new_rel_allvisible,
+ new_rel_allfrozen;
double new_live_tuples;
TransactionId new_frozen_xid;
MultiXactId new_min_multi;
@@ -222,6 +225,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* than or equal to the requested Xid full-table scan limit; or if the
* table's minimum MultiXactId is older than or equal to the requested
* mxid full-table scan limit.
+ * Even if scan_all is set so far, we could skip to scan some pages
+ * according by frozen map.
*/
scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
xidFullScanLimit);
@@ -253,7 +258,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
- if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages + vacrelstats->vmskipped_frozen_pages)
+ < vacrelstats->rel_pages)
{
Assert(!scan_all);
scanned_all = false;
@@ -301,10 +307,13 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_tuples = vacrelstats->old_rel_tuples;
}
- new_rel_allvisible = visibilitymap_count(onerel);
+ visibilitymap_count(onerel, &new_rel_allvisible, &new_rel_allfrozen);
if (new_rel_allvisible > new_rel_pages)
new_rel_allvisible = new_rel_pages;
+ if (new_rel_allfrozen > new_rel_pages)
+ new_rel_allfrozen = new_rel_pages;
+
new_frozen_xid = scanned_all ? FreezeLimit : InvalidTransactionId;
new_min_multi = scanned_all ? MultiXactCutoff : InvalidMultiXactId;
@@ -312,6 +321,7 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_pages,
new_rel_tuples,
new_rel_allvisible,
+ new_rel_allfrozen,
vacrelstats->hasindex,
new_frozen_xid,
new_min_multi,
@@ -360,10 +370,11 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel),
vacrelstats->num_index_scans);
- appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins\n"),
+ appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped according to vm\n"),
vacrelstats->pages_removed,
vacrelstats->rel_pages,
- vacrelstats->pinskipped_pages);
+ vacrelstats->pinskipped_pages,
+ vacrelstats->vmskipped_frozen_pages);
appendStringInfo(&buf,
_("tuples: %.0f removed, %.0f remain, %.0f are dead but not yet removable\n"),
vacrelstats->tuples_deleted,
@@ -486,9 +497,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* consecutive pages. Since we're reading sequentially, the OS should be
* doing readahead for us, so there's no gain in skipping a page now and
* then; that's likely to disable readahead and so be counterproductive.
- * Also, skipping even a single page means that we can't update
- * relfrozenxid, so we only want to do it if we can skip a goodly number
- * of pages.
+ * Also, skipping even a single page accorinding to all-visible bit of
+ * visibility map means that we can't update relfrozenxid, so we only want
+ * to do it if we can skip a goodly number. On the other hand, we count
+ * both how many pages we skipped according to all-frozen bit of visibility
+ * map and how many pages we freeze page, so we can update relfrozenxid if
+ * the sum of them is as many as pages of table.
*
* Before entering the main loop, establish the invariant that
* next_not_all_visible_block is the next block number >= blkno that's not
@@ -498,7 +512,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* started skipping blocks, we may as well skip everything up to the next
* not-all-visible block.
*
- * Note: if scan_all is true, we won't actually skip any pages; but we
+ * Note: if scan_all is true, we might not actually skip any pages; but we
* maintain next_not_all_visible_block anyway, so as to set up the
* all_visible_according_to_vm flag correctly for each page.
*
@@ -515,7 +529,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer))
+ if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
break;
vacuum_delay_point();
}
@@ -533,9 +547,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
bool tupgone,
hastup;
int prev_dead_count;
- int nfrozen;
+ int nfrozen; /* # of tuples is frozen */
+ int nalready_frozen; /* # of tuples is already frozen */
+ int ntotal_frozen; /* # of frozen tuples in single page */
+ int ntup_per_page;
Size freespace;
bool all_visible_according_to_vm;
+ bool all_frozen_according_to_vm;
bool all_visible;
bool has_dead_tuples;
TransactionId visibility_cutoff_xid = InvalidTransactionId;
@@ -548,7 +566,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block++)
{
if (!visibilitymap_test(onerel, next_not_all_visible_block,
- &vmbuffer))
+ &vmbuffer,
+ VISIBILITYMAP_ALL_VISIBLE))
break;
vacuum_delay_point();
}
@@ -563,13 +582,32 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
skipping_all_visible_blocks = false;
all_visible_according_to_vm = false;
+ all_frozen_according_to_vm = false;
}
else
{
- /* Current block is all-visible */
- if (skipping_all_visible_blocks && !scan_all)
- continue;
+ /*
+ * This block is at least all-visible according to visibility map.
+ * We check whehter this block is all-frozen to skip to vacuum this
+ * page even if scanning whole page is required.
+ */
+ bool all_frozen = visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_FROZEN);
+ if (scan_all)
+ {
+ if (all_frozen)
+ {
+ vacrelstats->vmskipped_frozen_pages++;
+ continue;
+ }
+ }
+ else
+ {
+ if (skipping_all_visible_blocks)
+ continue;
+ }
+
all_visible_according_to_vm = true;
+ all_frozen_according_to_vm = all_frozen;
}
vacuum_delay_point();
@@ -740,7 +778,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
PageSetAllVisible(page);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, InvalidTransactionId);
+ vmbuffer, InvalidTransactionId,
+ VISIBILITYMAP_ALL_VISIBLE);
END_CRIT_SECTION();
}
@@ -764,6 +803,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
all_visible = true;
has_dead_tuples = false;
nfrozen = 0;
+ nalready_frozen = 0;
+ ntup_per_page = 0;
hastup = false;
prev_dead_count = vacrelstats->num_dead_tuples;
maxoff = PageGetMaxOffsetNumber(page);
@@ -918,8 +959,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
{
num_tuples += 1;
+ ntup_per_page += 1;;
hastup = true;
+ /* Check whether this tuple is alrady frozen or not */
+ if (HeapTupleHeaderXminFrozen(tuple.t_data))
+ nalready_frozen += 1;
+
/*
* Each non-removable tuple must be checked to see if it needs
* freezing. Note we already have exclusive buffer lock.
@@ -931,9 +977,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
} /* scan along page */
/*
- * If we froze any tuples, mark the buffer dirty, and write a WAL
- * record recording the changes. We must log the changes to be
- * crash-safe against future truncation of CLOG.
+ * If we froze any tuples or any tuples are already frozen,
+ * mark the buffer dirty, and write a WAL record recording the changes.
+ * We must log the changes to be crash-safe against future truncation
+ * of CLOG.
*/
if (nfrozen > 0)
{
@@ -966,6 +1013,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
END_CRIT_SECTION();
}
+ /* Compute the number of frozen tuples in a page */
+ ntotal_frozen = nfrozen + nalready_frozen;
+
/*
* If there are no indexes then we can vacuum the page right now
* instead of doing a second scan.
@@ -988,26 +1038,46 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
freespace = PageGetHeapFreeSpace(page);
- /* mark page all-visible, if appropriate */
- if (all_visible && !all_visible_according_to_vm)
+ /* This page is all visible */
+ if (all_visible)
{
- /*
- * It should never be the case that the visibility map page is set
- * while the page-level bit is clear, but the reverse is allowed
- * (if checksums are not enabled). Regardless, set the both bits
- * so that we get back in sync.
- *
- * NB: If the heap page is all-visible but the VM bit is not set,
- * we don't need to dirty the heap page. However, if checksums
- * are enabled, we do need to make sure that the heap page is
- * dirtied before passing it to visibilitymap_set(), because it
- * may be logged. Given that this situation should only happen in
- * rare cases after a crash, it is not worth optimizing.
- */
- PageSetAllVisible(page);
- MarkBufferDirty(buf);
- visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ /* mark page all-visible, if appropriate */
+ if (!all_visible_according_to_vm)
+ {
+ /*
+ * It should never be the case that the visibility map page is set
+ * while the page-level bit is clear, but the reverse is allowed
+ * (if checksums are not enabled). Regardless, set the both bits
+ * so that we get back in sync.
+ *
+ * NB: If the heap page is all-visible but the VM bit is not set,
+ * we don't need to dirty the heap page. However, if checksums
+ * are enabled, we do need to make sure that the heap page is
+ * dirtied before passing it to visibilitymap_set(), because it
+ * may be logged. Given that this situation should only happen in
+ * rare cases after a crash, it is not worth optimizing.
+ */
+ PageSetAllVisible(page);
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+ }
+
+ /* mark page all-frozen, if all tuples are frozen and not marked yet */
+ if ((ntotal_frozen == ntup_per_page) && !all_frozen_according_to_vm)
+ {
+ Assert(PageIsAllVisible(page));
+
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ {
+ MarkBufferDirty(buf);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+ vmbuffer, visibility_cutoff_xid, flags);
+ }
}
/*
@@ -1018,7 +1088,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* that something bad has happened.
*/
else if (all_visible_according_to_vm && !PageIsAllVisible(page)
- && visibilitymap_test(onerel, blkno, &vmbuffer))
+ && visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
relname, blkno);
@@ -1078,7 +1148,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
num_tuples);
/*
- * Release any remaining pin on visibility map page.
+ * Release any remaining pin on visibility map and frozen map page.
*/
if (BufferIsValid(vmbuffer))
{
@@ -1114,6 +1184,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
RelationGetRelationName(onerel),
tups_vacuumed, vacuumed_pages)));
+ /* Report how many frozen pages vacuum skipped according to visibility map */
+ ereport(elevel,
+ (errmsg_plural("skipped %d frozen page according to visibility map",
+ "skipped %d frozen pages according to visibility map",
+ vacrelstats->vmskipped_frozen_pages,
+ vacrelstats->vmskipped_frozen_pages)));
+
/*
* This is pretty messy, but we split it up so that we can skip emitting
* individual parts of the message when not applicable.
@@ -1226,6 +1303,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
OffsetNumber unused[MaxOffsetNumber];
int uncnt = 0;
TransactionId visibility_cutoff_xid;
+ bool all_frozen;
START_CRIT_SECTION();
@@ -1277,19 +1355,34 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
* dirty, exclusively locked, and, if needed, a full page image has been
* emitted in the log_heap_clean() above.
*/
- if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid))
+ if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid, &all_frozen))
PageSetAllVisible(page);
/*
* All the changes to the heap page have been done. If the all-visible
- * flag is now set, also set the VM bit.
+ * flag is now set, also set the VM all-visible bit.
+ * Also, if this page is all-frozen, set the VM all-frozen bit and flag.
*/
- if (PageIsAllVisible(page) &&
- !visibilitymap_test(onerel, blkno, vmbuffer))
+ if (PageIsAllVisible(page))
{
+ uint8 flags = 0;
+
+ if (!visibilitymap_test(onerel, blkno, vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+
+ /* Set the VM all-frozen bit to flag, if needed */
+ if (all_frozen &&
+ !visibilitymap_test(onerel, blkno, vmbuffer, VISIBILITYMAP_ALL_FROZEN))
+ {
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
Assert(BufferIsValid(*vmbuffer));
- visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
- visibility_cutoff_xid);
+
+ if (flags)
+ visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
+ visibility_cutoff_xid, flags);
}
return tupindex;
@@ -1408,6 +1501,7 @@ lazy_cleanup_index(Relation indrel,
stats->num_pages,
stats->num_index_tuples,
0,
+ 0,
false,
InvalidTransactionId,
InvalidMultiXactId,
@@ -1779,10 +1873,12 @@ vac_cmp_itemptr(const void *left, const void *right)
/*
* Check if every tuple in the given page is visible to all current and future
* transactions. Also return the visibility_cutoff_xid which is the highest
- * xmin amongst the visible tuples.
+ * xmin amongst the visible tuples, and all_frozen which implies that all tuples
+ * of this page are frozen.
*/
static bool
-heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid)
+heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid,
+ bool *all_frozen)
{
Page page = BufferGetPage(buf);
BlockNumber blockno = BufferGetBlockNumber(buf);
@@ -1791,6 +1887,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
bool all_visible = true;
*visibility_cutoff_xid = InvalidTransactionId;
+ *all_frozen = true;
/*
* This is a stripped down version of the line pointer scan in
@@ -1814,11 +1911,12 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/*
* Dead line pointers can have index pointers pointing to them. So
- * they can't be treated as visible
+ * they can't be treated as visible and frozen.
*/
if (ItemIdIsDead(itemid))
{
all_visible = false;
+ *all_frozen = false;
break;
}
@@ -1855,6 +1953,10 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
+
+ /* Check whether this tuple is alrady frozen or not */
+ if (!HeapTupleHeaderXminFrozen(tuple.t_data))
+ *all_frozen = false;
}
break;
@@ -1863,6 +1965,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
case HEAPTUPLE_INSERT_IN_PROGRESS:
case HEAPTUPLE_DELETE_IN_PROGRESS:
all_visible = false;
+ *all_frozen = false;
break;
default:
@@ -1871,5 +1974,8 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
}
} /* scan along page */
+ if (!all_visible)
+ *all_frozen = false;
+
return all_visible;
}
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 9f54c46..08df289 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -116,7 +116,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
*/
if (!visibilitymap_test(scandesc->heapRelation,
ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
+ &node->ioss_VMBuffer, VISIBILITYMAP_ALL_VISIBLE))
{
/*
* Rats, we have to visit the heap to check visibility.
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 41d4606..c34b5da 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -231,6 +231,15 @@ check_cluster_versions(void)
if (old_cluster.major_version > new_cluster.major_version)
pg_fatal("This utility cannot be used to downgrade to older major PostgreSQL versions.\n");
+ /*
+ * We cant't allow upgrading with link mode between 9.5 or before and 9.6 or later,
+ * because the format of visibility map has changed on version 9.6.
+ */
+ if (user_opts.transfer_mode == TRANSFER_MODE_LINK &&
+ GET_MAJOR_VERSION(old_cluster.major_version) < 906 &&
+ GET_MAJOR_VERSION(new_cluster.major_version) >= 906)
+ pg_fatal("This utility cannot upgrade from PostgreSQL version from 9.5 or before to 9.6 or later with link mode.\n");
+
/* get old and new binary versions */
get_bin_version(&old_cluster);
get_bin_version(&new_cluster);
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 79d9390..9bae08c 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -10,6 +10,7 @@
#include "postgres_fe.h"
#include "pg_upgrade.h"
+#include "storage/bufpage.h"
#include <fcntl.h>
@@ -21,6 +22,43 @@ static int copy_file(const char *fromfile, const char *tofile, bool force);
static int win32_pghardlink(const char *src, const char *dst);
#endif
+static const char *rewriteVisibilitymap(const char *fromfile, const char *tofile,
+ bool force);
+
+/* table for fast rewriting vm file in order to add all-frozen information */
+static const uint16 rewrite_vm_table[256] = {
+ 0, 1, 4, 5, 16, 17, 20, 21, 64, 65, 68, 69, 80, 81, 84, 85,
+ 256, 257, 260, 261, 272, 273, 276, 277, 320, 321, 324, 325, 336, 337, 340, 341,
+ 1024, 1025, 1028, 1029, 1040, 1041, 1044, 1045, 1088, 1089, 1092, 1093, 1104, 1105, 1108, 1109,
+ 1280, 1281, 1284, 1285, 1296, 1297, 1300, 1301, 1344, 1345, 1348, 1349, 1360, 1361, 1364, 1365,
+ 4096, 4097, 4100, 4101, 4112, 4113, 4116, 4117, 4160, 4161, 4164, 4165, 4176, 4177, 4180, 4181,
+ 4352, 4353, 4356, 4357, 4368, 4369, 4372, 4373, 4416, 4417, 4420, 4421, 4432, 4433, 4436, 4437,
+ 5120, 5121, 5124, 5125, 5136, 5137, 5140, 5141, 5184, 5185, 5188, 5189, 5200, 5201, 5204, 5205,
+ 5376, 5377, 5380, 5381, 5392, 5393, 5396, 5397, 5440, 5441, 5444, 5445, 5456, 5457, 5460, 5461,
+ 16384, 16385, 16388, 16389, 16400, 16401, 16404, 16405, 16448, 16449, 16452, 16453, 16464, 16465, 16468, 16469,
+ 16640, 16641, 16644, 16645, 16656, 16657, 16660, 16661, 16704, 16705, 16708, 16709, 16720, 16721, 16724, 16725,
+ 17408, 17409, 17412, 17413, 17424, 17425, 17428, 17429, 17472, 17473, 17476, 17477, 17488, 17489, 17492, 17493,
+ 17664, 17665, 17668, 17669, 17680, 17681, 17684, 17685, 17728, 17729, 17732, 17733, 17744, 17745, 17748, 17749,
+ 20480, 20481, 20484, 20485, 20496, 20497, 20500, 20501, 20544, 20545, 20548, 20549, 20560, 20561, 20564, 20565,
+ 20736, 20737, 20740, 20741, 20752, 20753, 20756, 20757, 20800, 20801, 20804, 20805, 20816, 20817, 20820, 20821,
+ 21504, 21505, 21508, 21509, 21520, 21521, 21524, 21525, 21568, 21569, 21572, 21573, 21584, 21585, 21588, 21589,
+ 21760, 21761, 21764, 21765, 21776, 21777, 21780, 21781, 21824, 21825, 21828, 21829, 21840, 21841, 21844, 21845
+};
+
+/*
+ * copyOrRewriteFile()
+ * This function copies file or rewrite visibility map file.
+ * If rewrite_vm is true, we have to rewrite visibility map regardless value of pageConverter.
+ */
+const char *
+copyOrRewriteFile(pageCnvCtx *pageConverter,
+ const char *src, const char *dst, bool force, bool rewrite_vm)
+{
+ if (rewrite_vm)
+ return rewriteVisibilitymap(src, dst, force);
+ else
+ return copyAndUpdateFile(pageConverter, src, dst, force);
+}
/*
* copyAndUpdateFile()
@@ -201,6 +239,99 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
#endif
+/*
+ * rewriteVisibilitymap()
+ *
+ * A additional bit which indicates that all tuples on page is completely
+ * frozen is added into visibility map at PG 9.6. So the format of visibiilty
+ * map has been changed.
+ * Copies a visibility map file while adding all-frozen bit(0) into each bit.
+ */
+static const char *
+rewriteVisibilitymap(const char *fromfile, const char *tofile, bool force)
+{
+#define REWRITE_BUF_SIZE (50 * BLCKSZ)
+#define BITS_PER_HEAPBLOCK 2
+
+ int src_fd, dst_fd;
+ uint16 vm_bits;
+ ssize_t nbytes;
+ char *buffer;
+ int ret = 0;
+
+ if ((fromfile == NULL) || (tofile == NULL))
+ {
+ errno = EINVAL;
+ goto err;
+ }
+
+ if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
+ goto err;
+
+ if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+ goto err;
+
+ buffer = (char *) pg_malloc(REWRITE_BUF_SIZE);
+
+ /* Copy page header data in advance */
+ if ((nbytes = read(src_fd, buffer, MAXALIGN(SizeOfPageHeaderData))) <= 0)
+ goto err;
+
+ if (write(dst_fd, buffer, nbytes) != nbytes)
+ {
+ /* if write didn't set errno, assume problem is no disk space */
+ if (errno == 0)
+ errno = ENOSPC;
+ goto err;
+ }
+
+ /* perform data rewriting i.e read src srouce, write to destination */
+ while (true)
+ {
+ ssize_t nbytes = read(src_fd, buffer, REWRITE_BUF_SIZE);
+ char *cur, *end;
+
+ if (nbytes < 0)
+ {
+ ret = -1;
+ break;
+ }
+
+ if (nbytes == 0)
+ break;
+
+ cur = buffer;
+ end = buffer + nbytes;
+
+ /* Rewrite a byte and write dest_fd per BITS_PER_HEAPBLOCK bytes */
+ while (end > cur)
+ {
+ /* Get rewritten bit from table and its string representation */
+ vm_bits = rewrite_vm_table[(uint8) *cur];
+
+ if (write(dst_fd, &vm_bits, BITS_PER_HEAPBLOCK) != BITS_PER_HEAPBLOCK)
+ {
+ ret = -1;
+ break;
+ }
+ cur++;
+ }
+ }
+
+err:
+
+ if (!buffer)
+ pg_free(buffer);
+
+ if (src_fd != 0)
+ close(src_fd);
+
+ if (dst_fd != 0)
+ close(dst_fd);
+
+ return (errno == 0) ? NULL : getErrorText(errno);
+}
+
void
check_hard_link(void)
{
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 13aa891..0f81ba5 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -112,6 +112,10 @@ extern char *output_files[];
#define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031
/*
+ * The format of visibility map changed with this 9.6 commit,
+ */
+#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201510081
+/*
* pg_multixact format changed in 9.3 commit 0ac5ad5134f2769ccbaefec73844f85,
* ("Improve concurrency of foreign key locking") which also updated catalog
* version to this value. pg_upgrade behavior depends on whether old and new
@@ -396,6 +400,8 @@ const pageCnvCtx *setupPageConverter(void);
typedef void *pageCnvCtx;
#endif
+const char *copyOrRewriteFile(pageCnvCtx *pageConverter, const char *src,
+ const char *dst, bool force, bool rewrite_vm);
const char *copyAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
const char *dst, bool force);
const char *linkAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index c22df42..2fa5b47 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -171,6 +171,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
{
int mapnum;
bool vm_crashsafe_match = true;
+ bool vm_rewrite_needed = false;
/*
* Do the old and new cluster disagree on the crash-safetiness of the vm
@@ -180,6 +181,13 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
vm_crashsafe_match = false;
+ /*
+ * Do we need to rewrite visibilitymap?
+ */
+ if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ vm_rewrite_needed = true;
+
for (mapnum = 0; mapnum < size; mapnum++)
{
if (old_tablespace == NULL ||
@@ -195,7 +203,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
* Copy/link any fsm and vm files, if they exist
*/
transfer_relfile(pageConverter, &maps[mapnum], "_fsm");
- if (vm_crashsafe_match)
+ if (vm_crashsafe_match || vm_rewrite_needed)
transfer_relfile(pageConverter, &maps[mapnum], "_vm");
}
}
@@ -218,6 +226,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
int fd;
int segno;
char extent_suffix[65];
+ bool rewrite_vm = false;
/*
* Now copy/link any related segments as well. Remember, PG breaks large
@@ -276,7 +285,15 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
+ /*
+ * Do we need to rewrite visibilitymap?
+ */
+ if (strcmp(type_suffix, "_vm") == 0 &&
+ old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ rewrite_vm = true;
+
+ if ((msg = copyOrRewriteFile(pageConverter, old_file, new_file, true, rewrite_vm)) != NULL)
pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
diff --git a/src/bin/pg_upgrade/test.sh b/src/bin/pg_upgrade/test.sh
index aa7f399..6b058d4 100644
--- a/src/bin/pg_upgrade/test.sh
+++ b/src/bin/pg_upgrade/test.sh
@@ -174,6 +174,11 @@ if "$MAKE" -C "$oldsrc" installcheck; then
mv "$temp_root"/dump1.sql "$temp_root"/dump1.sql.orig
sed "s;$oldsrc;$newsrc;g" "$temp_root"/dump1.sql.orig >"$temp_root"/dump1.sql
fi
+
+ vm_sql="SELECT c.relname, c.relallvisible FROM pg_class as c, pg_namespace as n WHERE c.relnamespace = n.oid AND n.nspname NOT IN ('information_schema', 'pg_toast', 'pg_catalog') ORDER BY c.relname;"
+ # Test for rewriting visibility map
+ vacuumdb -d regression || visibilitymap_vacuum1_status=$?
+ psql -d regression -c "$vm_sql" > "$temp_root"/vm_test1.txt || visibilitymap_test1_status=$?
else
make_installcheck_status=$?
fi
@@ -188,6 +193,14 @@ if [ -n "$pg_dumpall1_status" ]; then
echo "pg_dumpall of pre-upgrade database cluster failed"
exit 1
fi
+if [ -n "$visibilitymap_vacuum1_status" ];then
+ echo "VACUUM of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+if [ -n "$visibilitymap_test1_status" ];then
+ echo "SELECT of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
PGDATA=$BASE_PGDATA
@@ -203,6 +216,8 @@ case $testhost in
esac
pg_dumpall -f "$temp_root"/dump2.sql || pg_dumpall2_status=$?
+vacuumdb -d regression || visibilitymap_vacuum2_status=$?
+psql -d regression -c "$vm_sql" > "$temp_root"/vm_test2.txt || visibilitymap_test2_status=$?
pg_ctl -m fast stop
# no need to echo commands anymore
@@ -214,11 +229,26 @@ if [ -n "$pg_dumpall2_status" ]; then
exit 1
fi
+if [ -n "$visibilitymap_vacuum2_status" ];then
+ echo "VACUUM of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
+if [ -n "$visibilitymap_test2_status" ];then
+ echo "SELECT of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
case $testhost in
MINGW*) cmd /c delete_old_cluster.bat ;;
*) sh ./delete_old_cluster.sh ;;
esac
+if ! diff "$temp_root"/vm_test1.txt "$temp_root"/vm_test2.txt >/dev/null; then
+ echo "Visibility map rewriting test failed"
+ exit 1
+fi
+
if diff "$temp_root"/dump1.sql "$temp_root"/dump2.sql >/dev/null; then
echo PASSED
exit 0
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index caa0f14..93afb10 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -320,9 +320,10 @@ typedef struct xl_heap_freeze_page
typedef struct xl_heap_visible
{
TransactionId cutoff_xid;
+ uint8 flags;
} xl_heap_visible;
-#define SizeOfHeapVisible (offsetof(xl_heap_visible, cutoff_xid) + sizeof(TransactionId))
+#define SizeOfHeapVisible (offsetof(xl_heap_visible, flags) + sizeof(uint8))
typedef struct xl_heap_new_cid
{
@@ -389,6 +390,6 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
xl_heap_freeze_tuple *xlrec_tp);
extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
- Buffer vm_buffer, TransactionId cutoff_xid);
+ Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
#endif /* HEAPAM_XLOG_H */
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 0c0e0ef..4dc1314 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -19,15 +19,21 @@
#include "storage/buf.h"
#include "utils/relcache.h"
-extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk,
- Buffer vmbuf);
+/* Flags for bit map */
+#define VISIBILITYMAP_ALL_VISIBLE 0x01
+#define VISIBILITYMAP_ALL_FROZEN 0x02
+
+extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
Buffer *vmbuf);
extern bool visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid);
-extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
-extern BlockNumber visibilitymap_count(Relation rel);
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags);
+extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf,
+ uint8 flags);
+extern void visibilitymap_count(Relation rel, BlockNumber *all_visible,
+ BlockNumber *all_frozen);
extern void visibilitymap_truncate(Relation rel, BlockNumber nheapblocks);
#endif /* VISIBILITYMAP_H */
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 4c08d2e..83955ab 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 201510052
+#define CATALOG_VERSION_NO 201510081
#endif
diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
index 06d287e..e8c1316 100644
--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -47,6 +47,8 @@ CATALOG(pg_class,1259) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83) BKI_SCHEMA_MACRO
float4 reltuples; /* # of tuples (not always up-to-date) */
int32 relallvisible; /* # of all-visible blocks (not always
* up-to-date) */
+ int32 relallfrozen; /* # of all-frozen blocks (not always
+ up-to-date) */
Oid reltoastrelid; /* OID of toast table; 0 if none */
bool relhasindex; /* T if has (or has had) any indexes */
bool relisshared; /* T if shared across databases */
@@ -96,7 +98,7 @@ typedef FormData_pg_class *Form_pg_class;
* ----------------
*/
-#define Natts_pg_class 31
+#define Natts_pg_class 32
#define Anum_pg_class_relname 1
#define Anum_pg_class_relnamespace 2
#define Anum_pg_class_reltype 3
@@ -108,26 +110,27 @@ typedef FormData_pg_class *Form_pg_class;
#define Anum_pg_class_relpages 9
#define Anum_pg_class_reltuples 10
#define Anum_pg_class_relallvisible 11
-#define Anum_pg_class_reltoastrelid 12
-#define Anum_pg_class_relhasindex 13
-#define Anum_pg_class_relisshared 14
-#define Anum_pg_class_relpersistence 15
-#define Anum_pg_class_relkind 16
-#define Anum_pg_class_relnatts 17
-#define Anum_pg_class_relchecks 18
-#define Anum_pg_class_relhasoids 19
-#define Anum_pg_class_relhaspkey 20
-#define Anum_pg_class_relhasrules 21
-#define Anum_pg_class_relhastriggers 22
-#define Anum_pg_class_relhassubclass 23
-#define Anum_pg_class_relrowsecurity 24
-#define Anum_pg_class_relforcerowsecurity 25
-#define Anum_pg_class_relispopulated 26
-#define Anum_pg_class_relreplident 27
-#define Anum_pg_class_relfrozenxid 28
-#define Anum_pg_class_relminmxid 29
-#define Anum_pg_class_relacl 30
-#define Anum_pg_class_reloptions 31
+#define Anum_pg_class_relallfrozen 12
+#define Anum_pg_class_reltoastrelid 13
+#define Anum_pg_class_relhasindex 14
+#define Anum_pg_class_relisshared 15
+#define Anum_pg_class_relpersistence 16
+#define Anum_pg_class_relkind 17
+#define Anum_pg_class_relnatts 18
+#define Anum_pg_class_relchecks 19
+#define Anum_pg_class_relhasoids 20
+#define Anum_pg_class_relhaspkey 21
+#define Anum_pg_class_relhasrules 22
+#define Anum_pg_class_relhastriggers 23
+#define Anum_pg_class_relhassubclass 24
+#define Anum_pg_class_relrowsecurity 25
+#define Anum_pg_class_relforcerowsecurity 26
+#define Anum_pg_class_relispopulated 27
+#define Anum_pg_class_relreplident 28
+#define Anum_pg_class_relfrozenxid 29
+#define Anum_pg_class_relminmxid 30
+#define Anum_pg_class_relacl 31
+#define Anum_pg_class_reloptions 32
/* ----------------
* initial contents of pg_class
@@ -142,13 +145,13 @@ typedef FormData_pg_class *Form_pg_class;
* Note: "3" in the relfrozenxid column stands for FirstNormalTransactionId;
* similarly, "1" in relminmxid stands for FirstMultiXactId
*/
-DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 f f p r 29 0 t f f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 0 f f p r 29 0 t f f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 f f p r 31 0 t f f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 0 f f p r 32 0 t f f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index e3a31af..d2bae2d 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -172,6 +172,7 @@ extern void vac_update_relstats(Relation relation,
BlockNumber num_pages,
double num_tuples,
BlockNumber num_all_visible_pages,
+ BlockNumber num_all_frozen_pages,
bool hasindex,
TransactionId frozenxid,
MultiXactId minmulti,
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index a2f78ee..1040885 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -178,8 +178,10 @@ typedef PageHeaderData *PageHeader;
* tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
+#define PD_ALL_FROZEN 0x0008 /* all tuples on page are completely
+ frozen */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x000F /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -369,6 +371,11 @@ typedef PageHeaderData *PageHeader;
#define PageClearAllVisible(page) \
(((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
+#define PageIsAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags & PD_ALL_FROZEN)
+#define PageSetAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags |= PD_ALL_FROZEN)
+
#define PageIsPrunable(page, oldestxmin) \
( \
AssertMacro(TransactionIdIsNormal(oldestxmin)), \
diff --git a/src/test/regress/expected/visibilitymap.out b/src/test/regress/expected/visibilitymap.out
new file mode 100644
index 0000000..0dd5cc1
--- /dev/null
+++ b/src/test/regress/expected/visibilitymap.out
@@ -0,0 +1,29 @@
+--
+-- Visibility map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+\set VERBOSITY terse
+-- All pages are become all-visible
+VACUUM vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+ ?column?
+----------
+ t
+(1 row)
+
+-- All pages are become all-frozen
+VACUUM FREEZE vmtest;
+SELECT relallfrozen = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+ ?column?
+----------
+ t
+(1 row)
+
+-- All pages are skipped according to VM
+VACUUM FREEZE VERBOSE vmtest;
+INFO: vacuuming "public.vmtest"
+INFO: index "vmtest_pkey" now contains 10000 row versions in 30 pages
+INFO: skipped 45 frozen pages according to visibility map
+INFO: "vmtest": found 0 removable, 0 nonremovable row versions in 0 out of 45 pages
+DROP TABLE vmtest;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 6fc5d1e..a5ff786 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -108,5 +108,8 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare without_oid c
# event triggers cannot run concurrently with any test that runs DDL
test: event_trigger
+# visibility map and vacuum test cannot run concurrently with any test that runs SQL
+test: visibilitymap
+
# run stats by itself because its delay may be insufficient under heavy load
test: stats
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 2ae51cf..d386d67 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -158,3 +158,4 @@ test: with
test: xml
test: event_trigger
test: stats
+test: visibilitymap
\ No newline at end of file
diff --git a/src/test/regress/sql/visibilitymap.sql b/src/test/regress/sql/visibilitymap.sql
new file mode 100644
index 0000000..53d817e
--- /dev/null
+++ b/src/test/regress/sql/visibilitymap.sql
@@ -0,0 +1,20 @@
+--
+-- Visibility map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+
+\set VERBOSITY terse
+
+-- All pages are become all-visible
+VACUUM vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+
+-- All pages are become all-frozen
+VACUUM FREEZE vmtest;
+SELECT relallfrozen = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+
+-- All pages are skipped according to VM
+VACUUM FREEZE VERBOSE vmtest;
+
+DROP TABLE vmtest;
On 1 October 2015 at 23:30, Josh Berkus <josh@agliodbs.com> wrote:
On 10/01/2015 07:43 AM, Robert Haas wrote:
On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao <masao.fujii@gmail.com>
wrote:
I wonder how much it's worth renaming only the file extension while
there are many places where "visibility map" and "vm" are used,
for example, log messages, function names, variables, etc.I'd be inclined to keep calling it the visibility map (vm) even if it
also contains freeze information.-1 to rename. Visibility Map is a perfectly good name.
The name can stay the same, but specifically the file extension should
change.
This patch changes the layout of existing information:
* _vm stores one bit per page
* _$new stores two bits per page
The problem is we won't be able to tell the two formats apart, since they
both are just lots of bits. So we won't be able to tell if the file is old
format or new format, which could lead to loss of information that relates
to visibility. If we think something is all-visible when it is not, this is
effectively data corruption.
In light of lessons learned from multixactids, I think its important that
we are able to tell the difference between an old format and a new format
visibility map.
My suggestion to do so was to call it "vfm", so we indicate that it is now
a Visibility & Freeze Map
I don't care if we change the name, but I do care if we can't tell the
difference between a failed upgrade, a normal upgrade and a server that has
been upgraded multiple times. Alternate suggestions welcome.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On October 8, 2015 7:35:24 PM GMT+02:00, Simon Riggs <simon@2ndQuadrant.com> wrote:
The problem is we won't be able to tell the two formats apart, since
they
both are just lots of bits. So we won't be able to tell if the file is
old
format or new format, which could lead to loss of information that
relates
to visibility.
I don't see the problem? I mean catversion will reliably tell you which format the vm is in?
We could additionally use the opportunity to as a metapage, but that seems like an independent thing.
Andres
---
Please excuse brevity and formatting - I am writing this on my mobile phone.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Oct 8, 2015 at 1:52 PM, Andres Freund <andres@anarazel.de> wrote:
I don't see the problem? I mean catversion will reliably tell you which format the vm is in?
Totally agreed.
We could additionally use the opportunity to as a metapage, but that seems like an independent thing.
I agree with that, too.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Oct 10, 2015 at 4:20 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Oct 8, 2015 at 1:52 PM, Andres Freund <andres@anarazel.de> wrote:
I don't see the problem? I mean catversion will reliably tell you which format the vm is in?
Totally agreed.
We could additionally use the opportunity to as a metapage, but that seems like an independent thing.
I agree with that, too.
Attached the updated v18 patch fixes some bugs.
Please review the patch.
Regards,
--
Masahiko Sawada
Attachments:
000_add_frozen_bit_into_visibilitymap_v18.patchtext/x-patch; charset=US-ASCII; name=000_add_frozen_bit_into_visibilitymap_v18.patchDownload
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 22c5f7a..b1b6a06 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -87,7 +87,7 @@ statapprox_heap(Relation rel, output_type *stat)
* If the page has only visible tuples, then we can find out the free
* space from the FSM and move on.
*/
- if (visibilitymap_test(rel, blkno, &vmbuffer))
+ if (visibilitymap_test(rel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
freespace = GetRecordedFreeSpace(rel, blkno);
stat->tuple_len += BLCKSZ - freespace;
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 97ef618..f8aa18b 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -1841,6 +1841,18 @@
</row>
<row>
+ <entry><structfield>relallfrozen</structfield></entry>
+ <entry><type>int4</type></entry>
+ <entry></entry>
+ <entry>
+ Number of pages that are marked all-frozen in the tables's
+ visibility map. It is updated by <command>VACUUM</command>.
+ <command>ANALYZE</command>, and a few DDL coomand such as
+ <command>CREATE INDEX</command>.
+ </entry>
+ </row>
+
+ <row>
<entry><structfield>reltoastrelid</structfield></entry>
<entry><type>oid</type></entry>
<entry><literal><link linkend="catalog-pg-class"><structname>pg_class</structname></link>.oid</literal></entry>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5549de7..bb63bb9 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5900,7 +5900,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs a aggressive freezing if the table's
<structname>pg_class</>.<structfield>relfrozenxid</> field has reached
the age specified by this setting. The default is 150 million
transactions. Although users can set this value anywhere from zero to
@@ -5944,7 +5944,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs a aggressive freezing if the table's
<structname>pg_class</>.<structfield>relminmxid</> field has reached
the age specified by this setting. The default is 150 million multixacts.
Although users can set this value anywhere from zero to two billions,
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index b5d4050..c8ad27f 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -352,9 +352,9 @@
Vacuum maintains a <link linkend="storage-vm">visibility map</> for each
table to keep track of which pages contain only tuples that are known to be
visible to all active transactions (and all future transactions, until the
- page is again modified). This has two purposes. First, vacuum
- itself can skip such pages on the next run, since there is nothing to
- clean up.
+ page is again modified), and pages contain only tuples that are marked as
+ frozen. This has two purposes. First, vacuum itself can skip such pages
+ on the next run, since there is nothing to clean up.
</para>
<para>
@@ -438,23 +438,22 @@
</para>
<para>
- <command>VACUUM</> normally skips pages that don't have any dead row
- versions, but those pages might still have row versions with old XID
- values. To ensure all old row versions have been frozen, a
- scan of the whole table is needed.
+ <command>VACUUM</> skips pages that don't have any dead row
+ versions, and pages that have only frozen rows.
+ To ensure all old row versions have been frozen, a scan of all pages that
+ are not marked as frozen is needed.
<xref linkend="guc-vacuum-freeze-table-age"> controls when
- <command>VACUUM</> does that: a whole table sweep is forced if
- the table hasn't been fully scanned for <varname>vacuum_freeze_table_age</>
- minus <varname>vacuum_freeze_min_age</> transactions. Setting it to 0
- forces <command>VACUUM</> to always scan all pages, effectively ignoring
- the visibility map.
+ <command>VACUUM</> does that: a table sweep is forced if
+ the table hasn't been ensured all row versions are frozen for
+ <varname>vacuum_freeze_table_age</> minus <varname>vacuum_freeze_min_age</>
+ transcations.
</para>
<para>
The maximum time that a table can go unvacuumed is two billion
transactions minus the <varname>vacuum_freeze_min_age</> value at
- the time <command>VACUUM</> last scanned the whole table. If it were to go
- unvacuumed for longer than
+ the time <command>VACUUM</> last scanned pages that are not marked as frozen
+ If it were to go unvacuumed for longer than
that, data loss could result. To ensure that this does not happen,
autovacuum is invoked on any table that might contain unfrozen rows with
XIDs older than the age specified by the configuration parameter <xref
@@ -490,8 +489,7 @@
a regularly scheduled <command>VACUUM</> or an autovacuum triggered by
normal delete and update activity is run in that window. Setting it too
close could lead to anti-wraparound autovacuums, even though the table
- was recently vacuumed to reclaim space, whereas lower values lead to more
- frequent whole-table scans.
+ was recently vacuumed to reclaim space.
</para>
<para>
@@ -527,7 +525,7 @@
<structname>pg_database</>. In particular,
the <structfield>relfrozenxid</> column of a table's
<structname>pg_class</> row contains the freeze cutoff XID that was used
- by the last whole-table <command>VACUUM</> for that table. All rows
+ by the last whole-table freezing for that table. All rows
inserted by transactions with XIDs older than this cutoff XID are
guaranteed to have been frozen. Similarly,
the <structfield>datfrozenxid</> column of a database's
@@ -555,17 +553,19 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
<command>VACUUM</> normally
only scans pages that have been modified since the last vacuum, but
<structfield>relfrozenxid</> can only be advanced when the whole table is
- scanned. The whole table is scanned when <structfield>relfrozenxid</> is
- more than <varname>vacuum_freeze_table_age</> transactions old, when
- <command>VACUUM</>'s <literal>FREEZE</> option is used, or when all pages
- happen to
- require vacuuming to remove dead row versions. When <command>VACUUM</>
- scans the whole table, after it's finished <literal>age(relfrozenxid)</>
- should be a little more than the <varname>vacuum_freeze_min_age</> setting
- that was used (more by the number of transactions started since the
- <command>VACUUM</> started). If no whole-table-scanning <command>VACUUM</>
- is issued on the table until <varname>autovacuum_freeze_max_age</> is
- reached, an autovacuum will soon be forced for the table.
+ frozen. The whole table is scanned only when all pages happen to require
+ vacuuming to remove dead row versions. In other cases such as where
+ <structfield>relfrozenxid</> is more than <varname>vacuum_freeze_table_age</>
+ transcations old, where <command>VACUUM</>'s <literal>FREEZE</> option is used,
+ <command>VACUUM</> can skip the pages that all tuples on the page itself are
+ marked as frozen.
+ When the all pages of table are eventually marked as frozen by <command>VACUUM</>,
+ after it's finished <literal>age(relfrozenxid)</> should be a little more
+ than the <varname>vacuum_freeze_min_age</> setting that was used (more by
+ the number of transcations started since the <command>VACUUM</> started).
+ If the advancing of <structfield>relfrozenxid</> is not happend until
+ <varname>autovacuum_freeze_max_age</> is reached, an autovacuum will soon
+ be forced for the table.
</para>
<para>
@@ -642,28 +642,28 @@ HINT: Stop the postmaster and vacuum that database in single-user mode.
<structname>pg_class</>.<structfield>relminmxid</> stores the oldest
possible multixact ID still appearing in any tuple of that table.
If this value is older than
- <xref linkend="guc-vacuum-multixact-freeze-table-age">, a whole-table
+ <xref linkend="guc-vacuum-multixact-freeze-table-age">, a table
scan is forced. <function>mxid_age()</> can be used on
<structname>pg_class</>.<structfield>relminmxid</> to find its age.
</para>
<para>
- Whole-table <command>VACUUM</> scans, regardless of
- what causes them, enable advancing the value for that table.
+ When <command>VACUUM</> scans all pages that are not marked as frozen,
+ regardless of what causes them, enable advancing the value for that table.
Eventually, as all tables in all databases are scanned and their
oldest multixact values are advanced, on-disk storage for older
multixacts can be removed.
</para>
<para>
- As a safety device, a whole-table vacuum scan will occur for any table
- whose multixact-age is greater than
- <xref linkend="guc-autovacuum-multixact-freeze-max-age">. Whole-table
+ As a safety device, a vacuum scan will occur for any table whose
+ multixact-age is greater than
+ <xref linkend="guc-autovacuum-multixact-freeze-max-age">. A
vacuum scans will also occur progressively for all tables, starting with
those that have the oldest multixact-age, if the amount of used member
storage space exceeds the amount 50% of the addressable storage space.
- Both of these kinds of whole-table scans will occur even if autovacuum is
- nominally disabled.
+ Both of these kinds of scans will occur even if autovacuum is nominally
+ disabled.
</para>
</sect3>
</sect2>
@@ -743,8 +743,8 @@ vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuple
<command>UPDATE</command> and <command>DELETE</command> operation. (It
is only semi-accurate because some information might be lost under heavy
load.) If the <structfield>relfrozenxid</> value of the table is more
- than <varname>vacuum_freeze_table_age</> transactions old, the whole
- table is scanned to freeze old tuples and advance
+ than <varname>vacuum_freeze_table_age</> transactions old, the table is
+ scanned to freeze old tuples and advance
<structfield>relfrozenxid</>, otherwise only pages that have been modified
since the last vacuum are scanned.
</para>
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 164d08c..9328cdf 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -623,18 +623,21 @@ can be used to examine the information stored in free space maps.
<para>
Each heap relation has a Visibility Map
(VM) to keep track of which pages contain only tuples that are known to be
-visible to all active transactions. It's stored
-alongside the main relation data in a separate relation fork, named after the
-filenode number of the relation, plus a <literal>_vm</> suffix. For example,
-if the filenode of a relation is 12345, the VM is stored in a file called
-<filename>12345_vm</>, in the same directory as the main relation file.
+visible to all active transactions, and pages contain only tuples that are
+marked as frozen.
+It's stored alongside the main relation data in a separate relation fork,
+named after the filenode number of the relation, plus a <literal>_vm</> suffix.
+For example, if the filenode of a relation is 12345, the VM is stored in a file
+called <filename>12345_vm</>, in the same directory as the main relation file.
Note that indexes do not have VMs.
</para>
<para>
-The visibility map simply stores one bit per heap page. A set bit means
-that all tuples on the page are known to be visible to all transactions.
-This means that the page does not contain any tuples that need to be vacuumed.
+The visibility map stores two bits per heap page: all-visible, all-frozen.
+A set all-visible bit means that all tuples on the page are known to be visible
+to all transactions. A set all-frozen bit means that all tuples on the page are
+completely marked as frozen. This means that the page does not contain any tuples
+that need to be vacuumed, even if a scan of whole table is required.
This information can also be used by <firstterm>index-only scans</> to answer
queries using only the index tuple.
</para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 66deb1f..e70f110 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2400,8 +2400,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
/*
- * Find buffer to insert this tuple into. If the page is all visible,
- * this will also pin the requisite visibility map page.
+ * Find buffer to insert this tuple into. If the page is all visible
+ * or all frozen, this will also pin the requisite visibility map and
+ * frozen map page.
*/
buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
InvalidBuffer, options, bistate,
@@ -3000,9 +3001,9 @@ heap_delete(Relation relation, ItemPointer tid,
/*
* If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, we'll have to unlock and
- * re-lock, to avoid holding the buffer lock across an I/O. That's a bit
- * unfortunate, but hopefully shouldn't happen often.
+ * visible or all frozen while we were busy locking the buffer, we'll
+ * have to unlock and re-lock, to avoid holding the buffer lock across an
+ * I/O. That's a bit unfortunate, but hopefully shouldn't happen often.
*/
if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
{
@@ -7166,7 +7167,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
*/
XLogRecPtr
log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
- TransactionId cutoff_xid)
+ TransactionId cutoff_xid, uint8 vmflags)
{
xl_heap_visible xlrec;
XLogRecPtr recptr;
@@ -7176,6 +7177,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
Assert(BufferIsValid(vm_buffer));
xlrec.cutoff_xid = cutoff_xid;
+ xlrec.flags = vmflags;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
@@ -7765,7 +7767,12 @@ heap_xlog_visible(XLogReaderState *record)
* the subsequent update won't be replayed to clear the flag.
*/
page = BufferGetPage(buffer);
- PageSetAllVisible(page);
+
+ if (xlrec->flags & VISIBILITYMAP_ALL_VISIBLE)
+ PageSetAllVisible(page);
+ if (xlrec->flags & VISIBILITYMAP_ALL_FROZEN)
+ PageSetAllFrozen(page);
+
MarkBufferDirty(buffer);
}
else if (action == BLK_RESTORED)
@@ -7817,7 +7824,7 @@ heap_xlog_visible(XLogReaderState *record)
*/
if (lsn > PageGetLSN(vmpage))
visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
- xlrec->cutoff_xid);
+ xlrec->cutoff_xid, xlrec->flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 7c38772..dc4d582 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -21,33 +21,41 @@
*
* NOTES
*
- * The visibility map is a bitmap with one bit per heap page. A set bit means
- * that all tuples on the page are known visible to all transactions, and
- * therefore the page doesn't need to be vacuumed. The map is conservative in
- * the sense that we make sure that whenever a bit is set, we know the
- * condition is true, but if a bit is not set, it might or might not be true.
+ * The visibility map is a bitmap with two bits (all-visible and all-frozen)
+ * per heap page. A set all-visible bit means that all tuples on the page are
+ * known visible to all transactions, and therefore the page doesn't need to
+ * be vacuumed. A set all-frozen bit means that all tuples on the page are
+ * completely frozen, and therefore the page doesn't need to be vacuumed even
+ * if whole table scanning vacuum is required (e.g. anti-wraparound vacuum).
+ * The all-frozen bit must be set only when the page is already all-visible.
+ * That is, all-frozen bit is always set with all-visible bit.
+ *
+ * The map is conservative in the sense that we make sure that whenever a bit
+ * is set, we know the condition is true, but if a bit is not set, it might or
+ * might not be true.
*
* Clearing a visibility map bit is not separately WAL-logged. The callers
* must make sure that whenever a bit is cleared, the bit is cleared on WAL
- * replay of the updating operation as well.
+ * replay of the updating operation as well. And all-frozen bit must be
+ * cleared with all-visible at the same time.
*
* When we *set* a visibility map during VACUUM, we must write WAL. This may
* seem counterintuitive, since the bit is basically a hint: if it is clear,
- * it may still be the case that every tuple on the page is visible to all
- * transactions; we just don't know that for certain. The difficulty is that
- * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
- * on the page itself, and the visibility map bit. If a crash occurs after the
- * visibility map page makes it to disk and before the updated heap page makes
- * it to disk, redo must set the bit on the heap page. Otherwise, the next
- * insert, update, or delete on the heap page will fail to realize that the
- * visibility map bit must be cleared, possibly causing index-only scans to
- * return wrong answers.
+ * it may still be the case that every tuple on the page is all-visible or
+ * all-frozen we just don't know that for certain. The difficulty is that
+ * there are two bits which are typically set together: the PD_ALL_VISIBLE
+ * or PD_ALL_FROZEN bit on the page itself, and the corresponding visibility map
+ * bit. If a crash occurs after the visibility map page makes it to disk and before
+ * the updated heap page makes it to disk, redo must set the bit on the heap page.
+ * Otherwise, the next insert, update, or delete on the heap page will fail to
+ * realize that the visibility map bit must be cleared, possibly causing index-only
+ * scans to return wrong answers.
*
* VACUUM will normally skip pages for which the visibility map bit is set;
* such pages can't contain any dead tuples and therefore don't need vacuuming.
- * The visibility map is not used for anti-wraparound vacuums, because
- * an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
- * present in the table, even on pages that don't have any dead tuples.
+ * The visibility map has the all-frozen bit which indicates all tuple on
+ * corresponding page has been completely frozen, so the visibility map is also
+ * used for anti-wraparound vacuum, even if freezing tuples is required.
*
* LOCKING
*
@@ -58,14 +66,14 @@
* section that logs the page modification. However, we don't want to hold
* the buffer lock over any I/O that may be required to read in the visibility
* map page. To avoid this, we examine the heap page before locking it;
- * if the page-level PD_ALL_VISIBLE bit is set, we pin the visibility map
- * bit. Then, we lock the buffer. But this creates a race condition: there
- * is a possibility that in the time it takes to lock the buffer, the
- * PD_ALL_VISIBLE bit gets set. If that happens, we have to unlock the
- * buffer, pin the visibility map page, and relock the buffer. This shouldn't
- * happen often, because only VACUUM currently sets visibility map bits,
- * and the race will only occur if VACUUM processes a given page at almost
- * exactly the same time that someone tries to further modify it.
+ * if the page-level PD_ALL_VISIBLE or PD_ALL_FROZEN bit is set, we pin the
+ * visibility map bit. Then, we lock the buffer. But this creates a race
+ * condition: there is a possibility that in the time it takes to lock the
+ * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that happens,
+ * we have to unlock the buffer, pin the visibility map page, and relock the
+ * buffer. This shouldn't happen often, because only VACUUM currently sets
+ * visibility map bits, and the race will only occur if VACUUM processes a given
+ * page at almost exactly the same time that someone tries to further modify it.
*
* To set a bit, you need to hold a lock on the heap page. That prevents
* the race condition where VACUUM sees that all tuples on the page are
@@ -101,11 +109,14 @@
*/
#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
-/* Number of bits allocated for each heap block. */
-#define BITS_PER_HEAPBLOCK 1
+/*
+ * Number of bits allocated for each heap block.
+ * One for all-visible, other for all-frozen.
+*/
+#define BITS_PER_HEAPBLOCK 2
/* Number of heap blocks we can represent in one byte. */
-#define HEAPBLOCKS_PER_BYTE 8
+#define HEAPBLOCKS_PER_BYTE 4
/* Number of heap blocks we can represent in one visibility map page. */
#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
@@ -115,24 +126,42 @@
#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
#define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
-/* table for fast counting of set bits */
-static const uint8 number_of_ones[256] = {
- 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
+/* tables for fast counting of set bits for visible and freeze */
+static const uint8 number_of_ones_for_visible[256] = {
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4
+};
+static const uint8 number_of_ones_for_frozen[256] = {
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4
};
/* prototypes for internal routines */
@@ -141,7 +170,7 @@ static void vm_extend(Relation rel, BlockNumber nvmblocks);
/*
- * visibilitymap_clear - clear a bit in visibility map
+ * visibilitymap_clear - clear all bits in visibility map
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -153,11 +182,12 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
int mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- uint8 mask = 1 << mapBit;
+ uint8 mask = (VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN) <<
+ (BITS_PER_HEAPBLOCK * mapBit);
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_clear %s block %d", RelationGetRelationName(rel), heapBlk);
#endif
if (!BufferIsValid(buf) || BufferGetBlockNumber(buf) != mapBlock)
@@ -225,7 +255,7 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
}
/*
- * visibilitymap_set - set a bit on a previously pinned page
+ * visibilitymap_set - set bit(s) on a previously pinned page
*
* recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
* or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
@@ -234,10 +264,11 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
* marked all-visible; it is needed for Hot Standby, and can be
* InvalidTransactionId if the page contains no tuples.
*
- * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
- * this function. Except in recovery, caller should also pass the heap
- * buffer. When checksums are enabled and we're not in recovery, we must add
- * the heap buffer to the WAL chain to protect it from being torn.
+ * Caller is expected to set the heap page's PD_ALL_VISIBLE or PD_ALL_FROZEN
+ * bit before calling this function. Except in recovery, caller should also
+ * pass the heap buffer and flags which indicates what flag we want to set.
+ * When checksums are enabled and we're not in recovery, we must add the heap
+ * buffer to the WAL chain to protect it from being torn.
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -245,7 +276,8 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
*/
void
visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid)
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -254,11 +286,13 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_set %s block %d, flag %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
Assert(InRecovery || BufferIsValid(heapBuf));
+ Assert((flags & VISIBILITYMAP_ALL_VISIBLE) ||
+ (flags & (VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN)));
/* Check that we have the right heap page pinned, if present */
if (BufferIsValid(heapBuf) && BufferGetBlockNumber(heapBuf) != heapBlk)
@@ -272,11 +306,11 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map = PageGetContents(page);
LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
- if (!(map[mapByte] & (1 << mapBit)))
+ if (flags != (map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))))
{
START_CRIT_SECTION();
- map[mapByte] |= (1 << mapBit);
+ map[mapByte] |= (flags << (BITS_PER_HEAPBLOCK * mapBit));
MarkBufferDirty(vmBuf);
if (RelationNeedsWAL(rel))
@@ -285,7 +319,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Assert(!InRecovery);
recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
- cutoff_xid);
+ cutoff_xid, flags);
/*
* If data checksums are enabled (or wal_log_hints=on), we
@@ -295,11 +329,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Page heapPage = BufferGetPage(heapBuf);
- /* caller is expected to set PD_ALL_VISIBLE first */
- Assert(PageIsAllVisible(heapPage));
+ /*
+ * Caller is expected to set PD_ALL_VISIBLE or
+ * PD_ALL_FROZEN first.
+ */
+ Assert(((flags | VISIBILITYMAP_ALL_VISIBLE) && PageIsAllVisible(heapPage)) ||
+ ((flags | VISIBILITYMAP_ALL_FROZEN) && PageIsAllFrozen(heapPage)));
PageSetLSN(heapPage, recptr);
}
}
+
PageSetLSN(page, recptr);
}
@@ -310,15 +349,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
}
/*
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_test - test if bit(s) is set
*
- * Are all tuples on heapBlk visible to all, according to the visibility map?
+ * Are all tuples on heapBlk visible or frozen to all, according to the visibility map?
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by an
* earlier call to visibilitymap_pin or visibilitymap_test on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk, or InvalidBuffer. The caller is responsible for
- * releasing *buf after it's done testing and setting bits.
+ * releasing *buf after it's done testing and setting bits, and must set flags
+ * which indicates what flag we want to test.
*
* NOTE: This function is typically called without a lock on the heap page,
* so somebody else could change the bit just after we look at it. In fact,
@@ -328,7 +368,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
* all concurrency issues!
*/
bool
-visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
+visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf, uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -337,7 +377,7 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_test %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_test %s block %d, flag %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
/* Reuse the old pinned buffer if possible */
@@ -360,11 +400,12 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
map = PageGetContents(BufferGetPage(*buf));
/*
- * A single-bit read is atomic. There could be memory-ordering effects
+ * A single or double bit read is atomic. There could be memory-ordering effects
* here, but for performance reasons we make it the caller's job to worry
* about that.
*/
- result = (map[mapByte] & (1 << mapBit)) ? true : false;
+ result = (map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))) ?
+ true : false;
return result;
}
@@ -374,14 +415,16 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
*
* Note: we ignore the possibility of race conditions when the table is being
* extended concurrently with the call. New pages added to the table aren't
- * going to be marked all-visible, so they won't affect the result.
+ * going to be marked all-visible or all-frozen, so they won't affect the result.
+ * The caller must set the flags which indicates what flag we want to count.
*/
-BlockNumber
-visibilitymap_count(Relation rel)
+void
+visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen)
{
- BlockNumber result = 0;
BlockNumber mapBlock;
+ *all_visible = *all_frozen = 0;
+
for (mapBlock = 0;; mapBlock++)
{
Buffer mapBuffer;
@@ -406,13 +449,12 @@ visibilitymap_count(Relation rel)
for (i = 0; i < MAPSIZE; i++)
{
- result += number_of_ones[map[i]];
+ *all_visible += number_of_ones_for_visible[map[i]];
+ *all_frozen += number_of_ones_for_frozen[map[i]];
}
ReleaseBuffer(mapBuffer);
}
-
- return result;
}
/*
@@ -435,7 +477,7 @@ visibilitymap_truncate(Relation rel, BlockNumber nheapblocks)
uint8 truncBit = HEAPBLK_TO_MAPBIT(nheapblocks);
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_truncate %s %d", RelationGetRelationName(rel), nheapblocks);
+ elog(DEBUG1, "vm_truncate %s block %d", RelationGetRelationName(rel), nheapblocks);
#endif
RelationOpenSmgr(rel);
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 7d7d062..c78bebe 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -790,6 +790,7 @@ InsertPgClassTuple(Relation pg_class_desc,
values[Anum_pg_class_relpages - 1] = Int32GetDatum(rd_rel->relpages);
values[Anum_pg_class_reltuples - 1] = Float4GetDatum(rd_rel->reltuples);
values[Anum_pg_class_relallvisible - 1] = Int32GetDatum(rd_rel->relallvisible);
+ values[Anum_pg_class_relallfrozen - 1] = Int32GetDatum(rd_rel->relallfrozen);
values[Anum_pg_class_reltoastrelid - 1] = ObjectIdGetDatum(rd_rel->reltoastrelid);
values[Anum_pg_class_relhasindex - 1] = BoolGetDatum(rd_rel->relhasindex);
values[Anum_pg_class_relisshared - 1] = BoolGetDatum(rd_rel->relisshared);
@@ -869,18 +870,21 @@ AddNewRelationTuple(Relation pg_class_desc,
new_rel_reltup->relpages = 0;
new_rel_reltup->reltuples = 0;
new_rel_reltup->relallvisible = 0;
+ new_rel_reltup->relallfrozen = 0;
break;
case RELKIND_SEQUENCE:
/* Sequences always have a known size */
new_rel_reltup->relpages = 1;
new_rel_reltup->reltuples = 1;
new_rel_reltup->relallvisible = 0;
+ new_rel_reltup->relallfrozen = 0;
break;
default:
/* Views, etc, have no disk storage */
new_rel_reltup->relpages = 0;
new_rel_reltup->reltuples = 0;
new_rel_reltup->relallvisible = 0;
+ new_rel_reltup->relallfrozen = 0;
break;
}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e59b163..7753f66 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1813,8 +1813,8 @@ FormIndexDatum(IndexInfo *indexInfo,
* isprimary: if true, set relhaspkey true; else no change
* reltuples: if >= 0, set reltuples to this value; else no change
*
- * If reltuples >= 0, relpages and relallvisible are also updated (using
- * RelationGetNumberOfBlocks() and visibilitymap_count()).
+ * If reltuples >= 0, relpages, relallvisible and relallfrozen are also updated
+ * (using RelationGetNumberOfBlocks() and visibilitymap_count()).
*
* NOTE: an important side-effect of this operation is that an SI invalidation
* message is sent out to all backends --- including me --- causing relcache
@@ -1859,8 +1859,8 @@ index_update_stats(Relation rel,
* true is safe even if there are no indexes (VACUUM will eventually fix
* it), likewise for relhaspkey. And of course the new relpages and
* reltuples counts are correct regardless. However, we don't want to
- * change relpages (or relallvisible) if the caller isn't providing an
- * updated reltuples count, because that would bollix the
+ * change relpages (or relallvisible/relallfrozen) if the caller isn't
+ * providing an updated reltuples count, because that would bollix the
* reltuples/relpages ratio which is what's really important.
*/
@@ -1918,12 +1918,15 @@ index_update_stats(Relation rel,
if (reltuples >= 0)
{
BlockNumber relpages = RelationGetNumberOfBlocks(rel);
- BlockNumber relallvisible;
+ BlockNumber relallvisible, relallfrozen;
if (rd_rel->relkind != RELKIND_INDEX)
- relallvisible = visibilitymap_count(rel);
+ visibilitymap_count(rel, &relallvisible, &relallfrozen);
else /* don't bother for indexes */
+ {
relallvisible = 0;
+ relallfrozen = 0;
+ }
if (rd_rel->relpages != (int32) relpages)
{
@@ -1940,6 +1943,11 @@ index_update_stats(Relation rel,
rd_rel->relallvisible = (int32) relallvisible;
dirty = true;
}
+ if (rd_rel->relallfrozen != (int32) relallfrozen)
+ {
+ rd_rel->relallfrozen = (int32) relallfrozen;
+ dirty = true;
+ }
}
/*
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index ddb68ab..681b4a9 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -314,6 +314,8 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
+ BlockNumber relallvisible,
+ relallfrozen;
if (inh)
ereport(elevel,
@@ -564,6 +566,10 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
}
}
+ /* Calculate the number of all-visible and all-frozen bit */
+ if (!inh)
+ visibilitymap_count(onerel, &relallvisible, &relallfrozen);
+
/*
* Update pages/tuples stats in pg_class ... but not if we're doing
* inherited stats.
@@ -572,7 +578,8 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
vac_update_relstats(onerel,
relpages,
totalrows,
- visibilitymap_count(onerel),
+ relallvisible,
+ relallfrozen,
hasindex,
InvalidTransactionId,
InvalidMultiXactId,
@@ -595,6 +602,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
RelationGetNumberOfBlocks(Irel[ind]),
totalindexrows,
0,
+ 0,
false,
InvalidTransactionId,
InvalidMultiXactId,
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 7ab4874..beb0ecf 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1266,6 +1266,7 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
int32 swap_pages;
float4 swap_tuples;
int32 swap_allvisible;
+ int32 swap_allfrozen;
swap_pages = relform1->relpages;
relform1->relpages = relform2->relpages;
@@ -1278,6 +1279,10 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
swap_allvisible = relform1->relallvisible;
relform1->relallvisible = relform2->relallvisible;
relform2->relallvisible = swap_allvisible;
+
+ swap_allfrozen = relform1->relallfrozen;
+ relform1->relallfrozen = relform2->relallfrozen;
+ relform2->relallfrozen = swap_allfrozen;
}
/*
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 6d55148..19b768d 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -729,11 +729,11 @@ vac_estimate_reltuples(Relation relation, bool is_analyze,
* marked with xmin = our xid.
*
* In addition to fundamentally nontransactional statistics such as
- * relpages and relallvisible, we try to maintain certain lazily-updated
- * DDL flags such as relhasindex, by clearing them if no longer correct.
- * It's safe to do this in VACUUM, which can't run in parallel with
- * CREATE INDEX/RULE/TRIGGER and can't be part of a transaction block.
- * However, it's *not* safe to do it in an ANALYZE that's within an
+ * relpages, relallvisible and relallfrozen, we try to maintain certain
+ * lazily-updated DDL flags such as relhasindex, by clearing them if no
+ * longer correct. It's safe to do this in VACUUM, which can't run in
+ * parallel with CREATE INDEX/RULE/TRIGGER and can't be part of a transaction
+ * block. However, it's *not* safe to do it in an ANALYZE that's within an
* outer transaction, because for example the current transaction might
* have dropped the last index; then we'd think relhasindex should be
* cleared, but if the transaction later rolls back this would be wrong.
@@ -747,6 +747,7 @@ void
vac_update_relstats(Relation relation,
BlockNumber num_pages, double num_tuples,
BlockNumber num_all_visible_pages,
+ BlockNumber num_all_frozen_pages,
bool hasindex, TransactionId frozenxid,
MultiXactId minmulti,
bool in_outer_xact)
@@ -784,6 +785,11 @@ vac_update_relstats(Relation relation,
pgcform->relallvisible = (int32) num_all_visible_pages;
dirty = true;
}
+ if (pgcform->relallfrozen != (int32) num_all_frozen_pages)
+ {
+ pgcform->relallfrozen = (int32) num_all_frozen_pages;
+ dirty = true;
+ }
/* Apply DDL updates, but not inside an outer transaction (see above) */
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index a01cfb4..e31597f 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -106,6 +106,8 @@ typedef struct LVRelStats
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* number of pages we examined */
BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber vmskipped_frozen_pages; /* # of pages we skipped by all-frozen bit
+ of visibility map */
double scanned_tuples; /* counts only tuples on scanned pages */
double old_rel_tuples; /* previous value of pg_class.reltuples */
double new_rel_tuples; /* new estimated total # of tuples */
@@ -156,7 +158,7 @@ static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
static int vac_cmp_itemptr(const void *left, const void *right);
static bool heap_page_is_all_visible(Relation rel, Buffer buf,
- TransactionId *visibility_cutoff_xid);
+ TransactionId *visibility_cutoff_xid, bool *all_frozen);
/*
@@ -188,7 +190,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
MultiXactId mxactFullScanLimit;
BlockNumber new_rel_pages;
double new_rel_tuples;
- BlockNumber new_rel_allvisible;
+ BlockNumber new_rel_allvisible,
+ new_rel_allfrozen;
double new_live_tuples;
TransactionId new_frozen_xid;
MultiXactId new_min_multi;
@@ -222,6 +225,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* than or equal to the requested Xid full-table scan limit; or if the
* table's minimum MultiXactId is older than or equal to the requested
* mxid full-table scan limit.
+ * Even if scan_all is set so far, we could skip to scan some pages
+ * according by frozen map.
*/
scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
xidFullScanLimit);
@@ -253,7 +258,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
- if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages + vacrelstats->vmskipped_frozen_pages)
+ < vacrelstats->rel_pages)
{
Assert(!scan_all);
scanned_all = false;
@@ -286,9 +292,9 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* density") with nonzero relpages and reltuples=0 (which means "zero
* tuple density") unless there's some actual evidence for the latter.
*
- * We do update relallvisible even in the corner case, since if the table
- * is all-visible we'd definitely like to know that. But clamp the value
- * to be not more than what we're setting relpages to.
+ * We do update relallvisible and relallfrozen even in the corner case,
+ * since if the table is all-visible we'd definitely like to know that.
+ * But clamp the value to be not more than what we're setting relpages to.
*
* Also, don't change relfrozenxid/relminmxid if we skipped any pages,
* since then we don't know for certain that all tuples have a newer xmin.
@@ -301,10 +307,13 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_tuples = vacrelstats->old_rel_tuples;
}
- new_rel_allvisible = visibilitymap_count(onerel);
+ visibilitymap_count(onerel, &new_rel_allvisible, &new_rel_allfrozen);
if (new_rel_allvisible > new_rel_pages)
new_rel_allvisible = new_rel_pages;
+ if (new_rel_allfrozen > new_rel_pages)
+ new_rel_allfrozen = new_rel_pages;
+
new_frozen_xid = scanned_all ? FreezeLimit : InvalidTransactionId;
new_min_multi = scanned_all ? MultiXactCutoff : InvalidMultiXactId;
@@ -312,6 +321,7 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_pages,
new_rel_tuples,
new_rel_allvisible,
+ new_rel_allfrozen,
vacrelstats->hasindex,
new_frozen_xid,
new_min_multi,
@@ -360,10 +370,11 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel),
vacrelstats->num_index_scans);
- appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins\n"),
+ appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped according to vm\n"),
vacrelstats->pages_removed,
vacrelstats->rel_pages,
- vacrelstats->pinskipped_pages);
+ vacrelstats->pinskipped_pages,
+ vacrelstats->vmskipped_frozen_pages);
appendStringInfo(&buf,
_("tuples: %.0f removed, %.0f remain, %.0f are dead but not yet removable\n"),
vacrelstats->tuples_deleted,
@@ -486,9 +497,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* consecutive pages. Since we're reading sequentially, the OS should be
* doing readahead for us, so there's no gain in skipping a page now and
* then; that's likely to disable readahead and so be counterproductive.
- * Also, skipping even a single page means that we can't update
- * relfrozenxid, so we only want to do it if we can skip a goodly number
- * of pages.
+ * Also, skipping even a single page accorinding to all-visible bit of
+ * visibility map means that we can't update relfrozenxid, so we only want
+ * to do it if we can skip a goodly number. On the other hand, we count
+ * both how many pages we skipped according to all-frozen bit of visibility
+ * map and how many pages we freeze page, so we can update relfrozenxid if
+ * the sum of them is as many as pages of table.
*
* Before entering the main loop, establish the invariant that
* next_not_all_visible_block is the next block number >= blkno that's not
@@ -498,24 +512,25 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* started skipping blocks, we may as well skip everything up to the next
* not-all-visible block.
*
- * Note: if scan_all is true, we won't actually skip any pages; but we
+ * Note: if scan_all is true, we might not actually skip any pages; but we
* maintain next_not_all_visible_block anyway, so as to set up the
* all_visible_according_to_vm flag correctly for each page.
*
* Note: The value returned by visibilitymap_test could be slightly
* out-of-date, since we make this test before reading the corresponding
* heap page or locking the buffer. This is OK. If we mistakenly think
- * that the page is all-visible when in fact the flag's just been cleared,
- * we might fail to vacuum the page. But it's OK to skip pages when
- * scan_all is not set, so no great harm done; the next vacuum will find
- * them. If we make the reverse mistake and vacuum a page unnecessarily,
- * it'll just be a no-op.
+ * that the page is all-visible/all-frozen when in fact the flag's just
+ * been cleared, we might fail to vacuum the page. But it's OK to skip
+ * pages when scan_all is not set, so no great harm done; the next vacuum
+ * will find them. If we make the reverse mistake and vacuum a page
+ * unnecessarily, it'll just be a no-op.
*/
for (next_not_all_visible_block = 0;
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer))
+ if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer,
+ VISIBILITYMAP_ALL_VISIBLE))
break;
vacuum_delay_point();
}
@@ -533,9 +548,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
bool tupgone,
hastup;
int prev_dead_count;
- int nfrozen;
+ int nfrozen; /* # of tuples is frozen */
+ int nalready_frozen; /* # of tuples is already frozen */
+ int ntotal_frozen; /* # of frozen tuples in single page */
+ int ntup_per_page;
Size freespace;
bool all_visible_according_to_vm;
+ bool all_frozen_according_to_vm;
bool all_visible;
bool has_dead_tuples;
TransactionId visibility_cutoff_xid = InvalidTransactionId;
@@ -548,7 +567,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block++)
{
if (!visibilitymap_test(onerel, next_not_all_visible_block,
- &vmbuffer))
+ &vmbuffer,
+ VISIBILITYMAP_ALL_VISIBLE))
break;
vacuum_delay_point();
}
@@ -563,13 +583,30 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
skipping_all_visible_blocks = false;
all_visible_according_to_vm = false;
+ all_frozen_according_to_vm = false;
}
else
{
- /* Current block is all-visible */
- if (skipping_all_visible_blocks && !scan_all)
- continue;
+ /*
+ * This block is at least all-visible according to visibility map.
+ * We check whehter this block is all-frozen or not, to skip to
+ * vacuum this page even if scan_all is true.
+ */
+ bool all_frozen = visibilitymap_test(onerel, blkno, &vmbuffer,
+ VISIBILITYMAP_ALL_FROZEN);
+ if (scan_all)
+ {
+ if (all_frozen)
+ {
+ vacrelstats->vmskipped_frozen_pages++;
+ continue;
+ }
+ }
+ else if (skipping_all_visible_blocks)
+ continue;
+
all_visible_according_to_vm = true;
+ all_frozen_according_to_vm = all_frozen;
}
vacuum_delay_point();
@@ -740,7 +777,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
PageSetAllVisible(page);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, InvalidTransactionId);
+ vmbuffer, InvalidTransactionId,
+ VISIBILITYMAP_ALL_VISIBLE);
END_CRIT_SECTION();
}
@@ -764,6 +802,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
all_visible = true;
has_dead_tuples = false;
nfrozen = 0;
+ nalready_frozen = 0;
+ ntup_per_page = 0;
hastup = false;
prev_dead_count = vacrelstats->num_dead_tuples;
maxoff = PageGetMaxOffsetNumber(page);
@@ -918,8 +958,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
{
num_tuples += 1;
+ ntup_per_page += 1;;
hastup = true;
+ /* Check whether this tuple is alrady frozen or not */
+ if (HeapTupleHeaderXminFrozen(tuple.t_data))
+ nalready_frozen += 1;
+
/*
* Each non-removable tuple must be checked to see if it needs
* freezing. Note we already have exclusive buffer lock.
@@ -931,9 +976,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
} /* scan along page */
/*
- * If we froze any tuples, mark the buffer dirty, and write a WAL
- * record recording the changes. We must log the changes to be
- * crash-safe against future truncation of CLOG.
+ * If we froze any tuples or any tuples are already frozen,
+ * mark the buffer dirty, and write a WAL record recording the changes.
+ * We must log the changes to be crash-safe against future truncation
+ * of CLOG.
*/
if (nfrozen > 0)
{
@@ -966,6 +1012,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
END_CRIT_SECTION();
}
+ /* Compute the number of frozen tuples in a page */
+ ntotal_frozen = nfrozen + nalready_frozen;
+
/*
* If there are no indexes then we can vacuum the page right now
* instead of doing a second scan.
@@ -988,26 +1037,46 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
freespace = PageGetHeapFreeSpace(page);
- /* mark page all-visible, if appropriate */
- if (all_visible && !all_visible_according_to_vm)
+ /* This page is all visible */
+ if (all_visible)
{
- /*
- * It should never be the case that the visibility map page is set
- * while the page-level bit is clear, but the reverse is allowed
- * (if checksums are not enabled). Regardless, set the both bits
- * so that we get back in sync.
- *
- * NB: If the heap page is all-visible but the VM bit is not set,
- * we don't need to dirty the heap page. However, if checksums
- * are enabled, we do need to make sure that the heap page is
- * dirtied before passing it to visibilitymap_set(), because it
- * may be logged. Given that this situation should only happen in
- * rare cases after a crash, it is not worth optimizing.
- */
- PageSetAllVisible(page);
- MarkBufferDirty(buf);
- visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ /* mark page all-visible, if appropriate */
+ if (!all_visible_according_to_vm)
+ {
+ /*
+ * It should never be the case that the visibility map page is set
+ * while the page-level bit is clear, but the reverse is allowed
+ * (if checksums are not enabled). Regardless, set the both bits
+ * so that we get back in sync.
+ *
+ * NB: If the heap page is all-visible but the VM bit is not set,
+ * we don't need to dirty the heap page. However, if checksums
+ * are enabled, we do need to make sure that the heap page is
+ * dirtied before passing it to visibilitymap_set(), because it
+ * may be logged. Given that this situation should only happen in
+ * rare cases after a crash, it is not worth optimizing.
+ */
+ PageSetAllVisible(page);
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+ }
+
+ /* mark page all-frozen, if all tuples are frozen and not marked yet */
+ if ((ntotal_frozen == ntup_per_page) && !all_frozen_according_to_vm)
+ {
+ Assert(PageIsAllVisible(page));
+
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ {
+ MarkBufferDirty(buf);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+ vmbuffer, visibility_cutoff_xid, flags);
+ }
}
/*
@@ -1018,7 +1087,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* that something bad has happened.
*/
else if (all_visible_according_to_vm && !PageIsAllVisible(page)
- && visibilitymap_test(onerel, blkno, &vmbuffer))
+ && visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
relname, blkno);
@@ -1078,7 +1147,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
num_tuples);
/*
- * Release any remaining pin on visibility map page.
+ * Release any remaining pin on visibility map and frozen map page.
*/
if (BufferIsValid(vmbuffer))
{
@@ -1114,6 +1183,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
RelationGetRelationName(onerel),
tups_vacuumed, vacuumed_pages)));
+ /* Report how many frozen pages vacuum skipped according to visibility map */
+ ereport(elevel,
+ (errmsg_plural("skipped %d frozen page according to visibility map",
+ "skipped %d frozen pages according to visibility map",
+ vacrelstats->vmskipped_frozen_pages,
+ vacrelstats->vmskipped_frozen_pages)));
+
/*
* This is pretty messy, but we split it up so that we can skip emitting
* individual parts of the message when not applicable.
@@ -1226,6 +1302,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
OffsetNumber unused[MaxOffsetNumber];
int uncnt = 0;
TransactionId visibility_cutoff_xid;
+ bool all_frozen;
START_CRIT_SECTION();
@@ -1277,19 +1354,34 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
* dirty, exclusively locked, and, if needed, a full page image has been
* emitted in the log_heap_clean() above.
*/
- if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid))
+ if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid, &all_frozen))
PageSetAllVisible(page);
/*
* All the changes to the heap page have been done. If the all-visible
- * flag is now set, also set the VM bit.
+ * flag is now set, also set the VM all-visible bit.
+ * Also, if this page is all-frozen, set the VM all-frozen bit and flag.
*/
- if (PageIsAllVisible(page) &&
- !visibilitymap_test(onerel, blkno, vmbuffer))
+ if (PageIsAllVisible(page))
{
+ uint8 flags = 0;
+
+ if (!visibilitymap_test(onerel, blkno, vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+
+ /* Set the VM all-frozen bit to flag, if needed */
+ if (all_frozen &&
+ !visibilitymap_test(onerel, blkno, vmbuffer, VISIBILITYMAP_ALL_FROZEN))
+ {
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
Assert(BufferIsValid(*vmbuffer));
- visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
- visibility_cutoff_xid);
+
+ if (flags)
+ visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
+ visibility_cutoff_xid, flags);
}
return tupindex;
@@ -1408,6 +1500,7 @@ lazy_cleanup_index(Relation indrel,
stats->num_pages,
stats->num_index_tuples,
0,
+ 0,
false,
InvalidTransactionId,
InvalidMultiXactId,
@@ -1779,10 +1872,12 @@ vac_cmp_itemptr(const void *left, const void *right)
/*
* Check if every tuple in the given page is visible to all current and future
* transactions. Also return the visibility_cutoff_xid which is the highest
- * xmin amongst the visible tuples.
+ * xmin amongst the visible tuples, and all_frozen which implies that all tuples
+ * of this page are frozen.
*/
static bool
-heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid)
+heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid,
+ bool *all_frozen)
{
Page page = BufferGetPage(buf);
BlockNumber blockno = BufferGetBlockNumber(buf);
@@ -1791,6 +1886,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
bool all_visible = true;
*visibility_cutoff_xid = InvalidTransactionId;
+ *all_frozen = true;
/*
* This is a stripped down version of the line pointer scan in
@@ -1814,11 +1910,12 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/*
* Dead line pointers can have index pointers pointing to them. So
- * they can't be treated as visible
+ * they can't be treated as visible and frozen.
*/
if (ItemIdIsDead(itemid))
{
all_visible = false;
+ *all_frozen = false;
break;
}
@@ -1855,6 +1952,10 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
+
+ /* Check whether this tuple is alrady frozen or not */
+ if (!HeapTupleHeaderXminFrozen(tuple.t_data))
+ *all_frozen = false;
}
break;
@@ -1863,6 +1964,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
case HEAPTUPLE_INSERT_IN_PROGRESS:
case HEAPTUPLE_DELETE_IN_PROGRESS:
all_visible = false;
+ *all_frozen = false;
break;
default:
@@ -1871,5 +1973,8 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
}
} /* scan along page */
+ if (!all_visible)
+ *all_frozen = false;
+
return all_visible;
}
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 9f54c46..08df289 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -116,7 +116,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
*/
if (!visibilitymap_test(scandesc->heapRelation,
ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
+ &node->ioss_VMBuffer, VISIBILITYMAP_ALL_VISIBLE))
{
/*
* Rats, we have to visit the heap to check visibility.
diff --git a/src/backend/rewrite/rewriteDefine.c b/src/backend/rewrite/rewriteDefine.c
index 39c83a6..560cf5a 100644
--- a/src/backend/rewrite/rewriteDefine.c
+++ b/src/backend/rewrite/rewriteDefine.c
@@ -604,6 +604,7 @@ DefineQueryRewrite(char *rulename,
classForm->relpages = 0;
classForm->reltuples = 0;
classForm->relallvisible = 0;
+ classForm->relallfrozen = 0;
classForm->reltoastrelid = InvalidOid;
classForm->relhasindex = false;
classForm->relkind = RELKIND_VIEW;
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 9c3d096..8aa8470 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1613,6 +1613,7 @@ formrdesc(const char *relationName, Oid relationReltype,
relation->rd_rel->relpages = 0;
relation->rd_rel->reltuples = 0;
relation->rd_rel->relallvisible = 0;
+ relation->rd_rel->relallfrozen = 0;
relation->rd_rel->relkind = RELKIND_RELATION;
relation->rd_rel->relhasoids = hasoids;
relation->rd_rel->relnatts = (int16) natts;
@@ -3114,6 +3115,7 @@ RelationSetNewRelfilenode(Relation relation, char persistence,
classform->relpages = 0; /* it's empty until further notice */
classform->reltuples = 0;
classform->relallvisible = 0;
+ classform->relallfrozen = 0;
}
classform->relfrozenxid = freezeXid;
classform->relminmxid = minmulti;
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 41d4606..a3ce324 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -231,6 +231,15 @@ check_cluster_versions(void)
if (old_cluster.major_version > new_cluster.major_version)
pg_fatal("This utility cannot be used to downgrade to older major PostgreSQL versions.\n");
+ /*
+ * We cant't allow upgrading with link mode between 9.5 or before and 9.6 or later,
+ * because the format of visibility map has been changed on version 9.6.
+ */
+ if (user_opts.transfer_mode == TRANSFER_MODE_LINK &&
+ GET_MAJOR_VERSION(old_cluster.major_version) < 906 &&
+ GET_MAJOR_VERSION(new_cluster.major_version) >= 906)
+ pg_fatal("This utility cannot upgrade from PostgreSQL version from 9.5 or before to 9.6 or later with link mode.\n");
+
/* get old and new binary versions */
get_bin_version(&old_cluster);
get_bin_version(&new_cluster);
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 79d9390..d47a98b 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -10,6 +10,7 @@
#include "postgres_fe.h"
#include "pg_upgrade.h"
+#include "storage/bufpage.h"
#include <fcntl.h>
@@ -21,6 +22,43 @@ static int copy_file(const char *fromfile, const char *tofile, bool force);
static int win32_pghardlink(const char *src, const char *dst);
#endif
+static const char *rewriteVisibilitymap(const char *fromfile, const char *tofile,
+ bool force);
+
+/* table for fast rewriting vm file in order to add all-frozen information */
+static const uint16 rewrite_vm_table[256] = {
+ 0, 1, 4, 5, 16, 17, 20, 21, 64, 65, 68, 69, 80, 81, 84, 85,
+ 256, 257, 260, 261, 272, 273, 276, 277, 320, 321, 324, 325, 336, 337, 340, 341,
+ 1024, 1025, 1028, 1029, 1040, 1041, 1044, 1045, 1088, 1089, 1092, 1093, 1104, 1105, 1108, 1109,
+ 1280, 1281, 1284, 1285, 1296, 1297, 1300, 1301, 1344, 1345, 1348, 1349, 1360, 1361, 1364, 1365,
+ 4096, 4097, 4100, 4101, 4112, 4113, 4116, 4117, 4160, 4161, 4164, 4165, 4176, 4177, 4180, 4181,
+ 4352, 4353, 4356, 4357, 4368, 4369, 4372, 4373, 4416, 4417, 4420, 4421, 4432, 4433, 4436, 4437,
+ 5120, 5121, 5124, 5125, 5136, 5137, 5140, 5141, 5184, 5185, 5188, 5189, 5200, 5201, 5204, 5205,
+ 5376, 5377, 5380, 5381, 5392, 5393, 5396, 5397, 5440, 5441, 5444, 5445, 5456, 5457, 5460, 5461,
+ 16384, 16385, 16388, 16389, 16400, 16401, 16404, 16405, 16448, 16449, 16452, 16453, 16464, 16465, 16468, 16469,
+ 16640, 16641, 16644, 16645, 16656, 16657, 16660, 16661, 16704, 16705, 16708, 16709, 16720, 16721, 16724, 16725,
+ 17408, 17409, 17412, 17413, 17424, 17425, 17428, 17429, 17472, 17473, 17476, 17477, 17488, 17489, 17492, 17493,
+ 17664, 17665, 17668, 17669, 17680, 17681, 17684, 17685, 17728, 17729, 17732, 17733, 17744, 17745, 17748, 17749,
+ 20480, 20481, 20484, 20485, 20496, 20497, 20500, 20501, 20544, 20545, 20548, 20549, 20560, 20561, 20564, 20565,
+ 20736, 20737, 20740, 20741, 20752, 20753, 20756, 20757, 20800, 20801, 20804, 20805, 20816, 20817, 20820, 20821,
+ 21504, 21505, 21508, 21509, 21520, 21521, 21524, 21525, 21568, 21569, 21572, 21573, 21584, 21585, 21588, 21589,
+ 21760, 21761, 21764, 21765, 21776, 21777, 21780, 21781, 21824, 21825, 21828, 21829, 21840, 21841, 21844, 21845
+};
+
+/*
+ * copyOrRewriteFile()
+ * This function copies file or rewrite visibility map file.
+ * If rewrite_vm is true, we have to rewrite visibility map regardless value of pageConverter.
+ */
+const char *
+copyOrRewriteFile(pageCnvCtx *pageConverter,
+ const char *src, const char *dst, bool force, bool rewrite_vm)
+{
+ if (rewrite_vm)
+ return rewriteVisibilitymap(src, dst, force);
+ else
+ return copyAndUpdateFile(pageConverter, src, dst, force);
+}
/*
* copyAndUpdateFile()
@@ -201,6 +239,97 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
#endif
+/*
+ * rewriteVisibilitymap()
+ *
+ * A additional bit which indicates that all tuples on page is completely
+ * frozen is added into visibility map at PG 9.6. So the format of visibiilty
+ * map has been changed.
+ * Copies a visibility map file while adding all-frozen bit(0) into each bit.
+ */
+static const char *
+rewriteVisibilitymap(const char *fromfile, const char *tofile, bool force)
+{
+#define REWRITE_BUF_SIZE (50 * BLCKSZ)
+#define BITS_PER_HEAPBLOCK 2
+
+ int src_fd = 0;
+ int dst_fd = 0;
+ uint16 vm_bits;
+ ssize_t nbytes;
+ char *buffer = NULL;
+ int ret = 0;
+
+ if ((fromfile == NULL) || (tofile == NULL))
+ return getErrorText(EINVAL);
+
+ if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
+ goto err;
+
+ if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+ goto err;
+
+ buffer = (char *) pg_malloc(REWRITE_BUF_SIZE);
+
+ /* Copy page header data in advance */
+ if ((nbytes = read(src_fd, buffer, MAXALIGN(SizeOfPageHeaderData))) <= 0)
+ goto err;
+
+ if (write(dst_fd, buffer, nbytes) != nbytes)
+ {
+ /* if write didn't set errno, assume problem is no disk space */
+ if (errno == 0)
+ errno = ENOSPC;
+ goto err;
+ }
+
+ /* perform data rewriting i.e read src srouce, write to destination */
+ while (true)
+ {
+ ssize_t nbytes = read(src_fd, buffer, REWRITE_BUF_SIZE);
+ char *cur, *end;
+
+ if (nbytes < 0)
+ {
+ ret = -1;
+ break;
+ }
+
+ if (nbytes == 0)
+ break;
+
+ cur = buffer;
+ end = buffer + nbytes;
+
+ /* Rewrite a byte and write dest_fd per BITS_PER_HEAPBLOCK bytes */
+ while (end > cur)
+ {
+ /* Get rewritten bit from table and its string representation */
+ vm_bits = rewrite_vm_table[(uint8) *cur];
+
+ if (write(dst_fd, &vm_bits, BITS_PER_HEAPBLOCK) != BITS_PER_HEAPBLOCK)
+ {
+ ret = -1;
+ break;
+ }
+ cur++;
+ }
+ }
+
+err:
+
+ if (!buffer)
+ pg_free(buffer);
+
+ if (src_fd != 0)
+ close(src_fd);
+
+ if (dst_fd != 0)
+ close(dst_fd);
+
+ return (errno == 0) ? NULL : getErrorText(errno);
+}
+
void
check_hard_link(void)
{
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 13aa891..d04d836 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -112,6 +112,10 @@ extern char *output_files[];
#define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031
/*
+ * The format of visibility map changed with this 9.6 commit,
+ */
+#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201510191
+/*
* pg_multixact format changed in 9.3 commit 0ac5ad5134f2769ccbaefec73844f85,
* ("Improve concurrency of foreign key locking") which also updated catalog
* version to this value. pg_upgrade behavior depends on whether old and new
@@ -396,6 +400,8 @@ const pageCnvCtx *setupPageConverter(void);
typedef void *pageCnvCtx;
#endif
+const char *copyOrRewriteFile(pageCnvCtx *pageConverter, const char *src,
+ const char *dst, bool force, bool rewrite_vm);
const char *copyAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
const char *dst, bool force);
const char *linkAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index c22df42..2fa5b47 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -171,6 +171,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
{
int mapnum;
bool vm_crashsafe_match = true;
+ bool vm_rewrite_needed = false;
/*
* Do the old and new cluster disagree on the crash-safetiness of the vm
@@ -180,6 +181,13 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
vm_crashsafe_match = false;
+ /*
+ * Do we need to rewrite visibilitymap?
+ */
+ if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ vm_rewrite_needed = true;
+
for (mapnum = 0; mapnum < size; mapnum++)
{
if (old_tablespace == NULL ||
@@ -195,7 +203,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
* Copy/link any fsm and vm files, if they exist
*/
transfer_relfile(pageConverter, &maps[mapnum], "_fsm");
- if (vm_crashsafe_match)
+ if (vm_crashsafe_match || vm_rewrite_needed)
transfer_relfile(pageConverter, &maps[mapnum], "_vm");
}
}
@@ -218,6 +226,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
int fd;
int segno;
char extent_suffix[65];
+ bool rewrite_vm = false;
/*
* Now copy/link any related segments as well. Remember, PG breaks large
@@ -276,7 +285,15 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
+ /*
+ * Do we need to rewrite visibilitymap?
+ */
+ if (strcmp(type_suffix, "_vm") == 0 &&
+ old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ rewrite_vm = true;
+
+ if ((msg = copyOrRewriteFile(pageConverter, old_file, new_file, true, rewrite_vm)) != NULL)
pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
diff --git a/src/bin/pg_upgrade/test.sh b/src/bin/pg_upgrade/test.sh
index aa7f399..6b058d4 100644
--- a/src/bin/pg_upgrade/test.sh
+++ b/src/bin/pg_upgrade/test.sh
@@ -174,6 +174,11 @@ if "$MAKE" -C "$oldsrc" installcheck; then
mv "$temp_root"/dump1.sql "$temp_root"/dump1.sql.orig
sed "s;$oldsrc;$newsrc;g" "$temp_root"/dump1.sql.orig >"$temp_root"/dump1.sql
fi
+
+ vm_sql="SELECT c.relname, c.relallvisible FROM pg_class as c, pg_namespace as n WHERE c.relnamespace = n.oid AND n.nspname NOT IN ('information_schema', 'pg_toast', 'pg_catalog') ORDER BY c.relname;"
+ # Test for rewriting visibility map
+ vacuumdb -d regression || visibilitymap_vacuum1_status=$?
+ psql -d regression -c "$vm_sql" > "$temp_root"/vm_test1.txt || visibilitymap_test1_status=$?
else
make_installcheck_status=$?
fi
@@ -188,6 +193,14 @@ if [ -n "$pg_dumpall1_status" ]; then
echo "pg_dumpall of pre-upgrade database cluster failed"
exit 1
fi
+if [ -n "$visibilitymap_vacuum1_status" ];then
+ echo "VACUUM of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+if [ -n "$visibilitymap_test1_status" ];then
+ echo "SELECT of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
PGDATA=$BASE_PGDATA
@@ -203,6 +216,8 @@ case $testhost in
esac
pg_dumpall -f "$temp_root"/dump2.sql || pg_dumpall2_status=$?
+vacuumdb -d regression || visibilitymap_vacuum2_status=$?
+psql -d regression -c "$vm_sql" > "$temp_root"/vm_test2.txt || visibilitymap_test2_status=$?
pg_ctl -m fast stop
# no need to echo commands anymore
@@ -214,11 +229,26 @@ if [ -n "$pg_dumpall2_status" ]; then
exit 1
fi
+if [ -n "$visibilitymap_vacuum2_status" ];then
+ echo "VACUUM of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
+if [ -n "$visibilitymap_test2_status" ];then
+ echo "SELECT of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
case $testhost in
MINGW*) cmd /c delete_old_cluster.bat ;;
*) sh ./delete_old_cluster.sh ;;
esac
+if ! diff "$temp_root"/vm_test1.txt "$temp_root"/vm_test2.txt >/dev/null; then
+ echo "Visibility map rewriting test failed"
+ exit 1
+fi
+
if diff "$temp_root"/dump1.sql "$temp_root"/dump2.sql >/dev/null; then
echo PASSED
exit 0
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index caa0f14..93afb10 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -320,9 +320,10 @@ typedef struct xl_heap_freeze_page
typedef struct xl_heap_visible
{
TransactionId cutoff_xid;
+ uint8 flags;
} xl_heap_visible;
-#define SizeOfHeapVisible (offsetof(xl_heap_visible, cutoff_xid) + sizeof(TransactionId))
+#define SizeOfHeapVisible (offsetof(xl_heap_visible, flags) + sizeof(uint8))
typedef struct xl_heap_new_cid
{
@@ -389,6 +390,6 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
xl_heap_freeze_tuple *xlrec_tp);
extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
- Buffer vm_buffer, TransactionId cutoff_xid);
+ Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
#endif /* HEAPAM_XLOG_H */
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 0c0e0ef..bacc349 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -19,15 +19,22 @@
#include "storage/buf.h"
#include "utils/relcache.h"
+/* Flags for bit map */
+#define VISIBILITYMAP_ALL_VISIBLE 0x01
+#define VISIBILITYMAP_ALL_FROZEN 0x02
+
extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk,
Buffer vmbuf);
extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
Buffer *vmbuf);
extern bool visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid);
-extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
-extern BlockNumber visibilitymap_count(Relation rel);
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags);
+extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf,
+ uint8 flags);
+extern void visibilitymap_count(Relation rel, BlockNumber *all_visible,
+ BlockNumber *all_frozen);
extern void visibilitymap_truncate(Relation rel, BlockNumber nheapblocks);
#endif /* VISIBILITYMAP_H */
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 15e0b97..f2ef868 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 201510161
+#define CATALOG_VERSION_NO 201510191
#endif
diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
index 06d287e..e8c1316 100644
--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -47,6 +47,8 @@ CATALOG(pg_class,1259) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83) BKI_SCHEMA_MACRO
float4 reltuples; /* # of tuples (not always up-to-date) */
int32 relallvisible; /* # of all-visible blocks (not always
* up-to-date) */
+ int32 relallfrozen; /* # of all-frozen blocks (not always
+ up-to-date) */
Oid reltoastrelid; /* OID of toast table; 0 if none */
bool relhasindex; /* T if has (or has had) any indexes */
bool relisshared; /* T if shared across databases */
@@ -96,7 +98,7 @@ typedef FormData_pg_class *Form_pg_class;
* ----------------
*/
-#define Natts_pg_class 31
+#define Natts_pg_class 32
#define Anum_pg_class_relname 1
#define Anum_pg_class_relnamespace 2
#define Anum_pg_class_reltype 3
@@ -108,26 +110,27 @@ typedef FormData_pg_class *Form_pg_class;
#define Anum_pg_class_relpages 9
#define Anum_pg_class_reltuples 10
#define Anum_pg_class_relallvisible 11
-#define Anum_pg_class_reltoastrelid 12
-#define Anum_pg_class_relhasindex 13
-#define Anum_pg_class_relisshared 14
-#define Anum_pg_class_relpersistence 15
-#define Anum_pg_class_relkind 16
-#define Anum_pg_class_relnatts 17
-#define Anum_pg_class_relchecks 18
-#define Anum_pg_class_relhasoids 19
-#define Anum_pg_class_relhaspkey 20
-#define Anum_pg_class_relhasrules 21
-#define Anum_pg_class_relhastriggers 22
-#define Anum_pg_class_relhassubclass 23
-#define Anum_pg_class_relrowsecurity 24
-#define Anum_pg_class_relforcerowsecurity 25
-#define Anum_pg_class_relispopulated 26
-#define Anum_pg_class_relreplident 27
-#define Anum_pg_class_relfrozenxid 28
-#define Anum_pg_class_relminmxid 29
-#define Anum_pg_class_relacl 30
-#define Anum_pg_class_reloptions 31
+#define Anum_pg_class_relallfrozen 12
+#define Anum_pg_class_reltoastrelid 13
+#define Anum_pg_class_relhasindex 14
+#define Anum_pg_class_relisshared 15
+#define Anum_pg_class_relpersistence 16
+#define Anum_pg_class_relkind 17
+#define Anum_pg_class_relnatts 18
+#define Anum_pg_class_relchecks 19
+#define Anum_pg_class_relhasoids 20
+#define Anum_pg_class_relhaspkey 21
+#define Anum_pg_class_relhasrules 22
+#define Anum_pg_class_relhastriggers 23
+#define Anum_pg_class_relhassubclass 24
+#define Anum_pg_class_relrowsecurity 25
+#define Anum_pg_class_relforcerowsecurity 26
+#define Anum_pg_class_relispopulated 27
+#define Anum_pg_class_relreplident 28
+#define Anum_pg_class_relfrozenxid 29
+#define Anum_pg_class_relminmxid 30
+#define Anum_pg_class_relacl 31
+#define Anum_pg_class_reloptions 32
/* ----------------
* initial contents of pg_class
@@ -142,13 +145,13 @@ typedef FormData_pg_class *Form_pg_class;
* Note: "3" in the relfrozenxid column stands for FirstNormalTransactionId;
* similarly, "1" in relminmxid stands for FirstMultiXactId
*/
-DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 f f p r 29 0 t f f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 0 f f p r 29 0 t f f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 f f p r 31 0 t f f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 0 f f p r 32 0 t f f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index e3a31af..d2bae2d 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -172,6 +172,7 @@ extern void vac_update_relstats(Relation relation,
BlockNumber num_pages,
double num_tuples,
BlockNumber num_all_visible_pages,
+ BlockNumber num_all_frozen_pages,
bool hasindex,
TransactionId frozenxid,
MultiXactId minmulti,
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index a2f78ee..1040885 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -178,8 +178,10 @@ typedef PageHeaderData *PageHeader;
* tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
+#define PD_ALL_FROZEN 0x0008 /* all tuples on page are completely
+ frozen */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x000F /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -369,6 +371,11 @@ typedef PageHeaderData *PageHeader;
#define PageClearAllVisible(page) \
(((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
+#define PageIsAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags & PD_ALL_FROZEN)
+#define PageSetAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags |= PD_ALL_FROZEN)
+
#define PageIsPrunable(page, oldestxmin) \
( \
AssertMacro(TransactionIdIsNormal(oldestxmin)), \
diff --git a/src/test/regress/expected/visibilitymap.out b/src/test/regress/expected/visibilitymap.out
new file mode 100644
index 0000000..0dd5cc1
--- /dev/null
+++ b/src/test/regress/expected/visibilitymap.out
@@ -0,0 +1,29 @@
+--
+-- Visibility map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+\set VERBOSITY terse
+-- All pages are become all-visible
+VACUUM vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+ ?column?
+----------
+ t
+(1 row)
+
+-- All pages are become all-frozen
+VACUUM FREEZE vmtest;
+SELECT relallfrozen = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+ ?column?
+----------
+ t
+(1 row)
+
+-- All pages are skipped according to VM
+VACUUM FREEZE VERBOSE vmtest;
+INFO: vacuuming "public.vmtest"
+INFO: index "vmtest_pkey" now contains 10000 row versions in 30 pages
+INFO: skipped 45 frozen pages according to visibility map
+INFO: "vmtest": found 0 removable, 0 nonremovable row versions in 0 out of 45 pages
+DROP TABLE vmtest;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index c63abf4..1d4cfdb 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -108,5 +108,8 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare without_oid c
# event triggers cannot run concurrently with any test that runs DDL
test: event_trigger
+# visibility map and vacuum test cannot run concurrently with any test that runs SQL
+test: visibilitymap
+
# run stats by itself because its delay may be insufficient under heavy load
test: stats
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 88dcd64..2ecfe56 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -159,3 +159,4 @@ test: with
test: xml
test: event_trigger
test: stats
+test: visibilitymap
diff --git a/src/test/regress/sql/visibilitymap.sql b/src/test/regress/sql/visibilitymap.sql
new file mode 100644
index 0000000..53d817e
--- /dev/null
+++ b/src/test/regress/sql/visibilitymap.sql
@@ -0,0 +1,20 @@
+--
+-- Visibility map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+
+\set VERBOSITY terse
+
+-- All pages are become all-visible
+VACUUM vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+
+-- All pages are become all-frozen
+VACUUM FREEZE vmtest;
+SELECT relallfrozen = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+
+-- All pages are skipped according to VM
+VACUUM FREEZE VERBOSE vmtest;
+
+DROP TABLE vmtest;
On 9 October 2015 at 15:20, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Oct 8, 2015 at 1:52 PM, Andres Freund <andres@anarazel.de> wrote:
I don't see the problem? I mean catversion will reliably tell you which
format the vm is in?
Totally agreed.
This isn't an agreement competition, its a cool look at what might cause
problems for all of us.
If we want to avoid bugs in future then we'd better start acting like that
is actually true in practice.
Why should we wave away this concern? Will we wave away a concern next time
you personally raise one? Bruce would have me believe that we added months
onto 9.5 to improve robustness. So lets actually do that. Starting at the
first opportunity.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2015-10-20 20:35:31 -0400, Simon Riggs wrote:
On 9 October 2015 at 15:20, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Oct 8, 2015 at 1:52 PM, Andres Freund <andres@anarazel.de> wrote:
I don't see the problem? I mean catversion will reliably tell you which
format the vm is in?
Totally agreed.
This isn't an agreement competition, its a cool look at what might cause
problems for all of us.
Uh, we form rough concensuses all the time.
If we want to avoid bugs in future then we'd better start acting like that
is actually true in practice.
Why should we wave away this concern? Will we wave away a concern next time
you personally raise one? Bruce would have me believe that we added months
onto 9.5 to improve robustness. So lets actually do that. Starting at the
first opportunity.
Meh. Adding complexity definitely needs to be weighed against the
benefits. As pointed out e.g. by all the multixact issues you mentioned
upthread. In this case your argument for changing the name doesn't seem
to hold much water.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 10/21/15 8:11 AM, Andres Freund wrote:
Meh. Adding complexity definitely needs to be weighed against the
benefits. As pointed out e.g. by all the multixact issues you mentioned
upthread. In this case your argument for changing the name doesn't seem
to hold much water.
ISTM VISIBILITY_MAP_FROZEN_BIT_CAT_VER shold be defined in catversion.h
instead of pg_upgrade.h though, to ensure it's correctly updated when
this gets committed though.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Jim Nasby wrote:
On 10/21/15 8:11 AM, Andres Freund wrote:
Meh. Adding complexity definitely needs to be weighed against the
benefits. As pointed out e.g. by all the multixact issues you mentioned
upthread. In this case your argument for changing the name doesn't seem
to hold much water.ISTM VISIBILITY_MAP_FROZEN_BIT_CAT_VER shold be defined in catversion.h
instead of pg_upgrade.h though, to ensure it's correctly updated when this
gets committed though.
That would be untidy and pointless. pg_upgrade.h contains other
catversions.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 21.10.2015 02:05, Masahiko Sawada wrote:
On Sat, Oct 10, 2015 at 4:20 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Oct 8, 2015 at 1:52 PM, Andres Freund <andres@anarazel.de> wrote:
I don't see the problem? I mean catversion will reliably tell you which format the vm is in?
Totally agreed.
We could additionally use the opportunity to as a metapage, but that seems like an independent thing.
I agree with that, too.
Attached the updated v18 patch fixes some bugs.
Please review the patch.
I've just checked the comments:
File: /doc/src/sgml/catalogs.sgml
+ Number of pages that are marked all-frozen in the tables's
Should be:
+ Number of pages that are marked all-frozen in the tables
+ <command>ANALYZE</command>, and a few DDL coomand such as
Should be:
+ <command>ANALYZE</command>, and a few DDL command such as
File: doc/src/sgml/maintenance.sgml
+ When the all pages of table are eventually marked as frozen by
<command>VACUUM</>,
Should be:
+ When all pages of the table are eventually marked as frozen by
<command>VACUUM</>,
File: /src/backend/access/heap/visibilitymap.c
+ * visibility map bit. Then, we lock the buffer. But this creates a race
Should be:
+ * visibility map bit. Than we lock the buffer. But this creates a race
+ * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that
happens,
Should be:
+ * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that
happens,
(Remove duplicate white space before if)
Please note i'm not a native speaker. There is a good chance that i am
wrong ;)
Greetings,
Torsten
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Oct 22, 2015 at 4:11 PM, Torsten Zühlsdorff
<mailinglists@toco-domains.de> wrote:
On 21.10.2015 02:05, Masahiko Sawada wrote:
On Sat, Oct 10, 2015 at 4:20 AM, Robert Haas <robertmhaas@gmail.com>
wrote:On Thu, Oct 8, 2015 at 1:52 PM, Andres Freund <andres@anarazel.de> wrote:
I don't see the problem? I mean catversion will reliably tell you which
format the vm is in?Totally agreed.
We could additionally use the opportunity to as a metapage, but that
seems like an independent thing.I agree with that, too.
Attached the updated v18 patch fixes some bugs.
Please review the patch.I've just checked the comments:
Thank you for taking the time to review this patch.
Attached updated patch(v19).
File: /doc/src/sgml/catalogs.sgml
+ Number of pages that are marked all-frozen in the tables's Should be: + Number of pages that are marked all-frozen in the tables
I changed it as follows.
+ Number of pages that are marked all-frozen in the table's
The similar sentence of relallvisible is exist.
+ <command>ANALYZE</command>, and a few DDL coomand such as Should be: + <command>ANALYZE</command>, and a few DDL command such as
Fixed.
File: doc/src/sgml/maintenance.sgml
+ When the all pages of table are eventually marked as frozen by <command>VACUUM</>, Should be: + When all pages of the table are eventually marked as frozen by <command>VACUUM</>,
Fixed.
File: /src/backend/access/heap/visibilitymap.c
+ * visibility map bit. Then, we lock the buffer. But this creates a race Should be: + * visibility map bit. Than we lock the buffer. But this creates a race
I didn't change this sentence actually. so kept it.
+ * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that happens, Should be: + * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that happens, (Remove duplicate white space before if)
The other sentence seems to have double white space after period.
I kept it.
Please review it.
Regards,
--
Masahiko Sawada
Attachments:
000_add_frozen_bit_into_visibilitymap_v19.patchapplication/octet-stream; name=000_add_frozen_bit_into_visibilitymap_v19.patchDownload
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 22c5f7a..b1b6a06 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -87,7 +87,7 @@ statapprox_heap(Relation rel, output_type *stat)
* If the page has only visible tuples, then we can find out the free
* space from the FSM and move on.
*/
- if (visibilitymap_test(rel, blkno, &vmbuffer))
+ if (visibilitymap_test(rel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
freespace = GetRecordedFreeSpace(rel, blkno);
stat->tuple_len += BLCKSZ - freespace;
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 97ef618..2ea3e78 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -1841,6 +1841,18 @@
</row>
<row>
+ <entry><structfield>relallfrozen</structfield></entry>
+ <entry><type>int4</type></entry>
+ <entry></entry>
+ <entry>
+ Number of pages that are marked all-frozen in the table's
+ visibility map. It is updated by <command>VACUUM</command>.
+ <command>ANALYZE</command>, and a few DDL command such as
+ <command>CREATE INDEX</command>.
+ </entry>
+ </row>
+
+ <row>
<entry><structfield>reltoastrelid</structfield></entry>
<entry><type>oid</type></entry>
<entry><literal><link linkend="catalog-pg-class"><structname>pg_class</structname></link>.oid</literal></entry>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5549de7..bb63bb9 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5900,7 +5900,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs a aggressive freezing if the table's
<structname>pg_class</>.<structfield>relfrozenxid</> field has reached
the age specified by this setting. The default is 150 million
transactions. Although users can set this value anywhere from zero to
@@ -5944,7 +5944,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs a aggressive freezing if the table's
<structname>pg_class</>.<structfield>relminmxid</> field has reached
the age specified by this setting. The default is 150 million multixacts.
Although users can set this value anywhere from zero to two billions,
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index b5d4050..9183aba 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -352,9 +352,9 @@
Vacuum maintains a <link linkend="storage-vm">visibility map</> for each
table to keep track of which pages contain only tuples that are known to be
visible to all active transactions (and all future transactions, until the
- page is again modified). This has two purposes. First, vacuum
- itself can skip such pages on the next run, since there is nothing to
- clean up.
+ page is again modified), and pages contain only tuples that are marked as
+ frozen. This has two purposes. First, vacuum itself can skip such pages
+ on the next run, since there is nothing to clean up.
</para>
<para>
@@ -438,23 +438,22 @@
</para>
<para>
- <command>VACUUM</> normally skips pages that don't have any dead row
- versions, but those pages might still have row versions with old XID
- values. To ensure all old row versions have been frozen, a
- scan of the whole table is needed.
+ <command>VACUUM</> skips pages that don't have any dead row
+ versions, and pages that have only frozen rows.
+ To ensure all old row versions have been frozen, a scan of all pages that
+ are not marked as frozen is needed.
<xref linkend="guc-vacuum-freeze-table-age"> controls when
- <command>VACUUM</> does that: a whole table sweep is forced if
- the table hasn't been fully scanned for <varname>vacuum_freeze_table_age</>
- minus <varname>vacuum_freeze_min_age</> transactions. Setting it to 0
- forces <command>VACUUM</> to always scan all pages, effectively ignoring
- the visibility map.
+ <command>VACUUM</> does that: a table sweep is forced if
+ the table hasn't been ensured all row versions are frozen for
+ <varname>vacuum_freeze_table_age</> minus <varname>vacuum_freeze_min_age</>
+ transcations.
</para>
<para>
The maximum time that a table can go unvacuumed is two billion
transactions minus the <varname>vacuum_freeze_min_age</> value at
- the time <command>VACUUM</> last scanned the whole table. If it were to go
- unvacuumed for longer than
+ the time <command>VACUUM</> last scanned pages that are not marked as frozen
+ If it were to go unvacuumed for longer than
that, data loss could result. To ensure that this does not happen,
autovacuum is invoked on any table that might contain unfrozen rows with
XIDs older than the age specified by the configuration parameter <xref
@@ -490,8 +489,7 @@
a regularly scheduled <command>VACUUM</> or an autovacuum triggered by
normal delete and update activity is run in that window. Setting it too
close could lead to anti-wraparound autovacuums, even though the table
- was recently vacuumed to reclaim space, whereas lower values lead to more
- frequent whole-table scans.
+ was recently vacuumed to reclaim space.
</para>
<para>
@@ -527,7 +525,7 @@
<structname>pg_database</>. In particular,
the <structfield>relfrozenxid</> column of a table's
<structname>pg_class</> row contains the freeze cutoff XID that was used
- by the last whole-table <command>VACUUM</> for that table. All rows
+ by the last whole-table freezing for that table. All rows
inserted by transactions with XIDs older than this cutoff XID are
guaranteed to have been frozen. Similarly,
the <structfield>datfrozenxid</> column of a database's
@@ -555,17 +553,19 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
<command>VACUUM</> normally
only scans pages that have been modified since the last vacuum, but
<structfield>relfrozenxid</> can only be advanced when the whole table is
- scanned. The whole table is scanned when <structfield>relfrozenxid</> is
- more than <varname>vacuum_freeze_table_age</> transactions old, when
- <command>VACUUM</>'s <literal>FREEZE</> option is used, or when all pages
- happen to
- require vacuuming to remove dead row versions. When <command>VACUUM</>
- scans the whole table, after it's finished <literal>age(relfrozenxid)</>
- should be a little more than the <varname>vacuum_freeze_min_age</> setting
- that was used (more by the number of transactions started since the
- <command>VACUUM</> started). If no whole-table-scanning <command>VACUUM</>
- is issued on the table until <varname>autovacuum_freeze_max_age</> is
- reached, an autovacuum will soon be forced for the table.
+ frozen. The whole table is scanned only when all pages happen to require
+ vacuuming to remove dead row versions. In other cases such as where
+ <structfield>relfrozenxid</> is more than <varname>vacuum_freeze_table_age</>
+ transcations old, where <command>VACUUM</>'s <literal>FREEZE</> option is used,
+ <command>VACUUM</> can skip the pages that all tuples on the page itself are
+ marked as frozen.
+ When all pages of table are eventually marked as frozen by <command>VACUUM</>,
+ after it's finished <literal>age(relfrozenxid)</> should be a little more
+ than the <varname>vacuum_freeze_min_age</> setting that was used (more by
+ the number of transcations started since the <command>VACUUM</> started).
+ If the advancing of <structfield>relfrozenxid</> is not happend until
+ <varname>autovacuum_freeze_max_age</> is reached, an autovacuum will soon
+ be forced for the table.
</para>
<para>
@@ -642,28 +642,28 @@ HINT: Stop the postmaster and vacuum that database in single-user mode.
<structname>pg_class</>.<structfield>relminmxid</> stores the oldest
possible multixact ID still appearing in any tuple of that table.
If this value is older than
- <xref linkend="guc-vacuum-multixact-freeze-table-age">, a whole-table
+ <xref linkend="guc-vacuum-multixact-freeze-table-age">, a table
scan is forced. <function>mxid_age()</> can be used on
<structname>pg_class</>.<structfield>relminmxid</> to find its age.
</para>
<para>
- Whole-table <command>VACUUM</> scans, regardless of
- what causes them, enable advancing the value for that table.
+ When <command>VACUUM</> scans all pages that are not marked as frozen,
+ regardless of what causes them, enable advancing the value for that table.
Eventually, as all tables in all databases are scanned and their
oldest multixact values are advanced, on-disk storage for older
multixacts can be removed.
</para>
<para>
- As a safety device, a whole-table vacuum scan will occur for any table
- whose multixact-age is greater than
- <xref linkend="guc-autovacuum-multixact-freeze-max-age">. Whole-table
+ As a safety device, a vacuum scan will occur for any table whose
+ multixact-age is greater than
+ <xref linkend="guc-autovacuum-multixact-freeze-max-age">. A
vacuum scans will also occur progressively for all tables, starting with
those that have the oldest multixact-age, if the amount of used member
storage space exceeds the amount 50% of the addressable storage space.
- Both of these kinds of whole-table scans will occur even if autovacuum is
- nominally disabled.
+ Both of these kinds of scans will occur even if autovacuum is nominally
+ disabled.
</para>
</sect3>
</sect2>
@@ -743,8 +743,8 @@ vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuple
<command>UPDATE</command> and <command>DELETE</command> operation. (It
is only semi-accurate because some information might be lost under heavy
load.) If the <structfield>relfrozenxid</> value of the table is more
- than <varname>vacuum_freeze_table_age</> transactions old, the whole
- table is scanned to freeze old tuples and advance
+ than <varname>vacuum_freeze_table_age</> transactions old, the table is
+ scanned to freeze old tuples and advance
<structfield>relfrozenxid</>, otherwise only pages that have been modified
since the last vacuum are scanned.
</para>
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 164d08c..9328cdf 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -623,18 +623,21 @@ can be used to examine the information stored in free space maps.
<para>
Each heap relation has a Visibility Map
(VM) to keep track of which pages contain only tuples that are known to be
-visible to all active transactions. It's stored
-alongside the main relation data in a separate relation fork, named after the
-filenode number of the relation, plus a <literal>_vm</> suffix. For example,
-if the filenode of a relation is 12345, the VM is stored in a file called
-<filename>12345_vm</>, in the same directory as the main relation file.
+visible to all active transactions, and pages contain only tuples that are
+marked as frozen.
+It's stored alongside the main relation data in a separate relation fork,
+named after the filenode number of the relation, plus a <literal>_vm</> suffix.
+For example, if the filenode of a relation is 12345, the VM is stored in a file
+called <filename>12345_vm</>, in the same directory as the main relation file.
Note that indexes do not have VMs.
</para>
<para>
-The visibility map simply stores one bit per heap page. A set bit means
-that all tuples on the page are known to be visible to all transactions.
-This means that the page does not contain any tuples that need to be vacuumed.
+The visibility map stores two bits per heap page: all-visible, all-frozen.
+A set all-visible bit means that all tuples on the page are known to be visible
+to all transactions. A set all-frozen bit means that all tuples on the page are
+completely marked as frozen. This means that the page does not contain any tuples
+that need to be vacuumed, even if a scan of whole table is required.
This information can also be used by <firstterm>index-only scans</> to answer
queries using only the index tuple.
</para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 66deb1f..e70f110 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2400,8 +2400,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
/*
- * Find buffer to insert this tuple into. If the page is all visible,
- * this will also pin the requisite visibility map page.
+ * Find buffer to insert this tuple into. If the page is all visible
+ * or all frozen, this will also pin the requisite visibility map and
+ * frozen map page.
*/
buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
InvalidBuffer, options, bistate,
@@ -3000,9 +3001,9 @@ heap_delete(Relation relation, ItemPointer tid,
/*
* If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, we'll have to unlock and
- * re-lock, to avoid holding the buffer lock across an I/O. That's a bit
- * unfortunate, but hopefully shouldn't happen often.
+ * visible or all frozen while we were busy locking the buffer, we'll
+ * have to unlock and re-lock, to avoid holding the buffer lock across an
+ * I/O. That's a bit unfortunate, but hopefully shouldn't happen often.
*/
if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
{
@@ -7166,7 +7167,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
*/
XLogRecPtr
log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
- TransactionId cutoff_xid)
+ TransactionId cutoff_xid, uint8 vmflags)
{
xl_heap_visible xlrec;
XLogRecPtr recptr;
@@ -7176,6 +7177,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
Assert(BufferIsValid(vm_buffer));
xlrec.cutoff_xid = cutoff_xid;
+ xlrec.flags = vmflags;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
@@ -7765,7 +7767,12 @@ heap_xlog_visible(XLogReaderState *record)
* the subsequent update won't be replayed to clear the flag.
*/
page = BufferGetPage(buffer);
- PageSetAllVisible(page);
+
+ if (xlrec->flags & VISIBILITYMAP_ALL_VISIBLE)
+ PageSetAllVisible(page);
+ if (xlrec->flags & VISIBILITYMAP_ALL_FROZEN)
+ PageSetAllFrozen(page);
+
MarkBufferDirty(buffer);
}
else if (action == BLK_RESTORED)
@@ -7817,7 +7824,7 @@ heap_xlog_visible(XLogReaderState *record)
*/
if (lsn > PageGetLSN(vmpage))
visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
- xlrec->cutoff_xid);
+ xlrec->cutoff_xid, xlrec->flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 7c38772..dc4d582 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -21,33 +21,41 @@
*
* NOTES
*
- * The visibility map is a bitmap with one bit per heap page. A set bit means
- * that all tuples on the page are known visible to all transactions, and
- * therefore the page doesn't need to be vacuumed. The map is conservative in
- * the sense that we make sure that whenever a bit is set, we know the
- * condition is true, but if a bit is not set, it might or might not be true.
+ * The visibility map is a bitmap with two bits (all-visible and all-frozen)
+ * per heap page. A set all-visible bit means that all tuples on the page are
+ * known visible to all transactions, and therefore the page doesn't need to
+ * be vacuumed. A set all-frozen bit means that all tuples on the page are
+ * completely frozen, and therefore the page doesn't need to be vacuumed even
+ * if whole table scanning vacuum is required (e.g. anti-wraparound vacuum).
+ * The all-frozen bit must be set only when the page is already all-visible.
+ * That is, all-frozen bit is always set with all-visible bit.
+ *
+ * The map is conservative in the sense that we make sure that whenever a bit
+ * is set, we know the condition is true, but if a bit is not set, it might or
+ * might not be true.
*
* Clearing a visibility map bit is not separately WAL-logged. The callers
* must make sure that whenever a bit is cleared, the bit is cleared on WAL
- * replay of the updating operation as well.
+ * replay of the updating operation as well. And all-frozen bit must be
+ * cleared with all-visible at the same time.
*
* When we *set* a visibility map during VACUUM, we must write WAL. This may
* seem counterintuitive, since the bit is basically a hint: if it is clear,
- * it may still be the case that every tuple on the page is visible to all
- * transactions; we just don't know that for certain. The difficulty is that
- * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
- * on the page itself, and the visibility map bit. If a crash occurs after the
- * visibility map page makes it to disk and before the updated heap page makes
- * it to disk, redo must set the bit on the heap page. Otherwise, the next
- * insert, update, or delete on the heap page will fail to realize that the
- * visibility map bit must be cleared, possibly causing index-only scans to
- * return wrong answers.
+ * it may still be the case that every tuple on the page is all-visible or
+ * all-frozen we just don't know that for certain. The difficulty is that
+ * there are two bits which are typically set together: the PD_ALL_VISIBLE
+ * or PD_ALL_FROZEN bit on the page itself, and the corresponding visibility map
+ * bit. If a crash occurs after the visibility map page makes it to disk and before
+ * the updated heap page makes it to disk, redo must set the bit on the heap page.
+ * Otherwise, the next insert, update, or delete on the heap page will fail to
+ * realize that the visibility map bit must be cleared, possibly causing index-only
+ * scans to return wrong answers.
*
* VACUUM will normally skip pages for which the visibility map bit is set;
* such pages can't contain any dead tuples and therefore don't need vacuuming.
- * The visibility map is not used for anti-wraparound vacuums, because
- * an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
- * present in the table, even on pages that don't have any dead tuples.
+ * The visibility map has the all-frozen bit which indicates all tuple on
+ * corresponding page has been completely frozen, so the visibility map is also
+ * used for anti-wraparound vacuum, even if freezing tuples is required.
*
* LOCKING
*
@@ -58,14 +66,14 @@
* section that logs the page modification. However, we don't want to hold
* the buffer lock over any I/O that may be required to read in the visibility
* map page. To avoid this, we examine the heap page before locking it;
- * if the page-level PD_ALL_VISIBLE bit is set, we pin the visibility map
- * bit. Then, we lock the buffer. But this creates a race condition: there
- * is a possibility that in the time it takes to lock the buffer, the
- * PD_ALL_VISIBLE bit gets set. If that happens, we have to unlock the
- * buffer, pin the visibility map page, and relock the buffer. This shouldn't
- * happen often, because only VACUUM currently sets visibility map bits,
- * and the race will only occur if VACUUM processes a given page at almost
- * exactly the same time that someone tries to further modify it.
+ * if the page-level PD_ALL_VISIBLE or PD_ALL_FROZEN bit is set, we pin the
+ * visibility map bit. Then, we lock the buffer. But this creates a race
+ * condition: there is a possibility that in the time it takes to lock the
+ * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that happens,
+ * we have to unlock the buffer, pin the visibility map page, and relock the
+ * buffer. This shouldn't happen often, because only VACUUM currently sets
+ * visibility map bits, and the race will only occur if VACUUM processes a given
+ * page at almost exactly the same time that someone tries to further modify it.
*
* To set a bit, you need to hold a lock on the heap page. That prevents
* the race condition where VACUUM sees that all tuples on the page are
@@ -101,11 +109,14 @@
*/
#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
-/* Number of bits allocated for each heap block. */
-#define BITS_PER_HEAPBLOCK 1
+/*
+ * Number of bits allocated for each heap block.
+ * One for all-visible, other for all-frozen.
+*/
+#define BITS_PER_HEAPBLOCK 2
/* Number of heap blocks we can represent in one byte. */
-#define HEAPBLOCKS_PER_BYTE 8
+#define HEAPBLOCKS_PER_BYTE 4
/* Number of heap blocks we can represent in one visibility map page. */
#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
@@ -115,24 +126,42 @@
#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
#define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
-/* table for fast counting of set bits */
-static const uint8 number_of_ones[256] = {
- 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
+/* tables for fast counting of set bits for visible and freeze */
+static const uint8 number_of_ones_for_visible[256] = {
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4
+};
+static const uint8 number_of_ones_for_frozen[256] = {
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4
};
/* prototypes for internal routines */
@@ -141,7 +170,7 @@ static void vm_extend(Relation rel, BlockNumber nvmblocks);
/*
- * visibilitymap_clear - clear a bit in visibility map
+ * visibilitymap_clear - clear all bits in visibility map
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -153,11 +182,12 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
int mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- uint8 mask = 1 << mapBit;
+ uint8 mask = (VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN) <<
+ (BITS_PER_HEAPBLOCK * mapBit);
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_clear %s block %d", RelationGetRelationName(rel), heapBlk);
#endif
if (!BufferIsValid(buf) || BufferGetBlockNumber(buf) != mapBlock)
@@ -225,7 +255,7 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
}
/*
- * visibilitymap_set - set a bit on a previously pinned page
+ * visibilitymap_set - set bit(s) on a previously pinned page
*
* recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
* or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
@@ -234,10 +264,11 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
* marked all-visible; it is needed for Hot Standby, and can be
* InvalidTransactionId if the page contains no tuples.
*
- * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
- * this function. Except in recovery, caller should also pass the heap
- * buffer. When checksums are enabled and we're not in recovery, we must add
- * the heap buffer to the WAL chain to protect it from being torn.
+ * Caller is expected to set the heap page's PD_ALL_VISIBLE or PD_ALL_FROZEN
+ * bit before calling this function. Except in recovery, caller should also
+ * pass the heap buffer and flags which indicates what flag we want to set.
+ * When checksums are enabled and we're not in recovery, we must add the heap
+ * buffer to the WAL chain to protect it from being torn.
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -245,7 +276,8 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
*/
void
visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid)
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -254,11 +286,13 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_set %s block %d, flag %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
Assert(InRecovery || BufferIsValid(heapBuf));
+ Assert((flags & VISIBILITYMAP_ALL_VISIBLE) ||
+ (flags & (VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN)));
/* Check that we have the right heap page pinned, if present */
if (BufferIsValid(heapBuf) && BufferGetBlockNumber(heapBuf) != heapBlk)
@@ -272,11 +306,11 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map = PageGetContents(page);
LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
- if (!(map[mapByte] & (1 << mapBit)))
+ if (flags != (map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))))
{
START_CRIT_SECTION();
- map[mapByte] |= (1 << mapBit);
+ map[mapByte] |= (flags << (BITS_PER_HEAPBLOCK * mapBit));
MarkBufferDirty(vmBuf);
if (RelationNeedsWAL(rel))
@@ -285,7 +319,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Assert(!InRecovery);
recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
- cutoff_xid);
+ cutoff_xid, flags);
/*
* If data checksums are enabled (or wal_log_hints=on), we
@@ -295,11 +329,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Page heapPage = BufferGetPage(heapBuf);
- /* caller is expected to set PD_ALL_VISIBLE first */
- Assert(PageIsAllVisible(heapPage));
+ /*
+ * Caller is expected to set PD_ALL_VISIBLE or
+ * PD_ALL_FROZEN first.
+ */
+ Assert(((flags | VISIBILITYMAP_ALL_VISIBLE) && PageIsAllVisible(heapPage)) ||
+ ((flags | VISIBILITYMAP_ALL_FROZEN) && PageIsAllFrozen(heapPage)));
PageSetLSN(heapPage, recptr);
}
}
+
PageSetLSN(page, recptr);
}
@@ -310,15 +349,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
}
/*
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_test - test if bit(s) is set
*
- * Are all tuples on heapBlk visible to all, according to the visibility map?
+ * Are all tuples on heapBlk visible or frozen to all, according to the visibility map?
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by an
* earlier call to visibilitymap_pin or visibilitymap_test on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk, or InvalidBuffer. The caller is responsible for
- * releasing *buf after it's done testing and setting bits.
+ * releasing *buf after it's done testing and setting bits, and must set flags
+ * which indicates what flag we want to test.
*
* NOTE: This function is typically called without a lock on the heap page,
* so somebody else could change the bit just after we look at it. In fact,
@@ -328,7 +368,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
* all concurrency issues!
*/
bool
-visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
+visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf, uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -337,7 +377,7 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_test %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_test %s block %d, flag %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
/* Reuse the old pinned buffer if possible */
@@ -360,11 +400,12 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
map = PageGetContents(BufferGetPage(*buf));
/*
- * A single-bit read is atomic. There could be memory-ordering effects
+ * A single or double bit read is atomic. There could be memory-ordering effects
* here, but for performance reasons we make it the caller's job to worry
* about that.
*/
- result = (map[mapByte] & (1 << mapBit)) ? true : false;
+ result = (map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))) ?
+ true : false;
return result;
}
@@ -374,14 +415,16 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
*
* Note: we ignore the possibility of race conditions when the table is being
* extended concurrently with the call. New pages added to the table aren't
- * going to be marked all-visible, so they won't affect the result.
+ * going to be marked all-visible or all-frozen, so they won't affect the result.
+ * The caller must set the flags which indicates what flag we want to count.
*/
-BlockNumber
-visibilitymap_count(Relation rel)
+void
+visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen)
{
- BlockNumber result = 0;
BlockNumber mapBlock;
+ *all_visible = *all_frozen = 0;
+
for (mapBlock = 0;; mapBlock++)
{
Buffer mapBuffer;
@@ -406,13 +449,12 @@ visibilitymap_count(Relation rel)
for (i = 0; i < MAPSIZE; i++)
{
- result += number_of_ones[map[i]];
+ *all_visible += number_of_ones_for_visible[map[i]];
+ *all_frozen += number_of_ones_for_frozen[map[i]];
}
ReleaseBuffer(mapBuffer);
}
-
- return result;
}
/*
@@ -435,7 +477,7 @@ visibilitymap_truncate(Relation rel, BlockNumber nheapblocks)
uint8 truncBit = HEAPBLK_TO_MAPBIT(nheapblocks);
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_truncate %s %d", RelationGetRelationName(rel), nheapblocks);
+ elog(DEBUG1, "vm_truncate %s block %d", RelationGetRelationName(rel), nheapblocks);
#endif
RelationOpenSmgr(rel);
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 7d7d062..c78bebe 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -790,6 +790,7 @@ InsertPgClassTuple(Relation pg_class_desc,
values[Anum_pg_class_relpages - 1] = Int32GetDatum(rd_rel->relpages);
values[Anum_pg_class_reltuples - 1] = Float4GetDatum(rd_rel->reltuples);
values[Anum_pg_class_relallvisible - 1] = Int32GetDatum(rd_rel->relallvisible);
+ values[Anum_pg_class_relallfrozen - 1] = Int32GetDatum(rd_rel->relallfrozen);
values[Anum_pg_class_reltoastrelid - 1] = ObjectIdGetDatum(rd_rel->reltoastrelid);
values[Anum_pg_class_relhasindex - 1] = BoolGetDatum(rd_rel->relhasindex);
values[Anum_pg_class_relisshared - 1] = BoolGetDatum(rd_rel->relisshared);
@@ -869,18 +870,21 @@ AddNewRelationTuple(Relation pg_class_desc,
new_rel_reltup->relpages = 0;
new_rel_reltup->reltuples = 0;
new_rel_reltup->relallvisible = 0;
+ new_rel_reltup->relallfrozen = 0;
break;
case RELKIND_SEQUENCE:
/* Sequences always have a known size */
new_rel_reltup->relpages = 1;
new_rel_reltup->reltuples = 1;
new_rel_reltup->relallvisible = 0;
+ new_rel_reltup->relallfrozen = 0;
break;
default:
/* Views, etc, have no disk storage */
new_rel_reltup->relpages = 0;
new_rel_reltup->reltuples = 0;
new_rel_reltup->relallvisible = 0;
+ new_rel_reltup->relallfrozen = 0;
break;
}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e59b163..7753f66 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1813,8 +1813,8 @@ FormIndexDatum(IndexInfo *indexInfo,
* isprimary: if true, set relhaspkey true; else no change
* reltuples: if >= 0, set reltuples to this value; else no change
*
- * If reltuples >= 0, relpages and relallvisible are also updated (using
- * RelationGetNumberOfBlocks() and visibilitymap_count()).
+ * If reltuples >= 0, relpages, relallvisible and relallfrozen are also updated
+ * (using RelationGetNumberOfBlocks() and visibilitymap_count()).
*
* NOTE: an important side-effect of this operation is that an SI invalidation
* message is sent out to all backends --- including me --- causing relcache
@@ -1859,8 +1859,8 @@ index_update_stats(Relation rel,
* true is safe even if there are no indexes (VACUUM will eventually fix
* it), likewise for relhaspkey. And of course the new relpages and
* reltuples counts are correct regardless. However, we don't want to
- * change relpages (or relallvisible) if the caller isn't providing an
- * updated reltuples count, because that would bollix the
+ * change relpages (or relallvisible/relallfrozen) if the caller isn't
+ * providing an updated reltuples count, because that would bollix the
* reltuples/relpages ratio which is what's really important.
*/
@@ -1918,12 +1918,15 @@ index_update_stats(Relation rel,
if (reltuples >= 0)
{
BlockNumber relpages = RelationGetNumberOfBlocks(rel);
- BlockNumber relallvisible;
+ BlockNumber relallvisible, relallfrozen;
if (rd_rel->relkind != RELKIND_INDEX)
- relallvisible = visibilitymap_count(rel);
+ visibilitymap_count(rel, &relallvisible, &relallfrozen);
else /* don't bother for indexes */
+ {
relallvisible = 0;
+ relallfrozen = 0;
+ }
if (rd_rel->relpages != (int32) relpages)
{
@@ -1940,6 +1943,11 @@ index_update_stats(Relation rel,
rd_rel->relallvisible = (int32) relallvisible;
dirty = true;
}
+ if (rd_rel->relallfrozen != (int32) relallfrozen)
+ {
+ rd_rel->relallfrozen = (int32) relallfrozen;
+ dirty = true;
+ }
}
/*
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index ddb68ab..681b4a9 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -314,6 +314,8 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
+ BlockNumber relallvisible,
+ relallfrozen;
if (inh)
ereport(elevel,
@@ -564,6 +566,10 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
}
}
+ /* Calculate the number of all-visible and all-frozen bit */
+ if (!inh)
+ visibilitymap_count(onerel, &relallvisible, &relallfrozen);
+
/*
* Update pages/tuples stats in pg_class ... but not if we're doing
* inherited stats.
@@ -572,7 +578,8 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
vac_update_relstats(onerel,
relpages,
totalrows,
- visibilitymap_count(onerel),
+ relallvisible,
+ relallfrozen,
hasindex,
InvalidTransactionId,
InvalidMultiXactId,
@@ -595,6 +602,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
RelationGetNumberOfBlocks(Irel[ind]),
totalindexrows,
0,
+ 0,
false,
InvalidTransactionId,
InvalidMultiXactId,
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 7ab4874..beb0ecf 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1266,6 +1266,7 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
int32 swap_pages;
float4 swap_tuples;
int32 swap_allvisible;
+ int32 swap_allfrozen;
swap_pages = relform1->relpages;
relform1->relpages = relform2->relpages;
@@ -1278,6 +1279,10 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
swap_allvisible = relform1->relallvisible;
relform1->relallvisible = relform2->relallvisible;
relform2->relallvisible = swap_allvisible;
+
+ swap_allfrozen = relform1->relallfrozen;
+ relform1->relallfrozen = relform2->relallfrozen;
+ relform2->relallfrozen = swap_allfrozen;
}
/*
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 6d55148..19b768d 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -729,11 +729,11 @@ vac_estimate_reltuples(Relation relation, bool is_analyze,
* marked with xmin = our xid.
*
* In addition to fundamentally nontransactional statistics such as
- * relpages and relallvisible, we try to maintain certain lazily-updated
- * DDL flags such as relhasindex, by clearing them if no longer correct.
- * It's safe to do this in VACUUM, which can't run in parallel with
- * CREATE INDEX/RULE/TRIGGER and can't be part of a transaction block.
- * However, it's *not* safe to do it in an ANALYZE that's within an
+ * relpages, relallvisible and relallfrozen, we try to maintain certain
+ * lazily-updated DDL flags such as relhasindex, by clearing them if no
+ * longer correct. It's safe to do this in VACUUM, which can't run in
+ * parallel with CREATE INDEX/RULE/TRIGGER and can't be part of a transaction
+ * block. However, it's *not* safe to do it in an ANALYZE that's within an
* outer transaction, because for example the current transaction might
* have dropped the last index; then we'd think relhasindex should be
* cleared, but if the transaction later rolls back this would be wrong.
@@ -747,6 +747,7 @@ void
vac_update_relstats(Relation relation,
BlockNumber num_pages, double num_tuples,
BlockNumber num_all_visible_pages,
+ BlockNumber num_all_frozen_pages,
bool hasindex, TransactionId frozenxid,
MultiXactId minmulti,
bool in_outer_xact)
@@ -784,6 +785,11 @@ vac_update_relstats(Relation relation,
pgcform->relallvisible = (int32) num_all_visible_pages;
dirty = true;
}
+ if (pgcform->relallfrozen != (int32) num_all_frozen_pages)
+ {
+ pgcform->relallfrozen = (int32) num_all_frozen_pages;
+ dirty = true;
+ }
/* Apply DDL updates, but not inside an outer transaction (see above) */
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index a01cfb4..e31597f 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -106,6 +106,8 @@ typedef struct LVRelStats
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* number of pages we examined */
BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber vmskipped_frozen_pages; /* # of pages we skipped by all-frozen bit
+ of visibility map */
double scanned_tuples; /* counts only tuples on scanned pages */
double old_rel_tuples; /* previous value of pg_class.reltuples */
double new_rel_tuples; /* new estimated total # of tuples */
@@ -156,7 +158,7 @@ static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
static int vac_cmp_itemptr(const void *left, const void *right);
static bool heap_page_is_all_visible(Relation rel, Buffer buf,
- TransactionId *visibility_cutoff_xid);
+ TransactionId *visibility_cutoff_xid, bool *all_frozen);
/*
@@ -188,7 +190,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
MultiXactId mxactFullScanLimit;
BlockNumber new_rel_pages;
double new_rel_tuples;
- BlockNumber new_rel_allvisible;
+ BlockNumber new_rel_allvisible,
+ new_rel_allfrozen;
double new_live_tuples;
TransactionId new_frozen_xid;
MultiXactId new_min_multi;
@@ -222,6 +225,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* than or equal to the requested Xid full-table scan limit; or if the
* table's minimum MultiXactId is older than or equal to the requested
* mxid full-table scan limit.
+ * Even if scan_all is set so far, we could skip to scan some pages
+ * according by frozen map.
*/
scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
xidFullScanLimit);
@@ -253,7 +258,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
- if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages + vacrelstats->vmskipped_frozen_pages)
+ < vacrelstats->rel_pages)
{
Assert(!scan_all);
scanned_all = false;
@@ -286,9 +292,9 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* density") with nonzero relpages and reltuples=0 (which means "zero
* tuple density") unless there's some actual evidence for the latter.
*
- * We do update relallvisible even in the corner case, since if the table
- * is all-visible we'd definitely like to know that. But clamp the value
- * to be not more than what we're setting relpages to.
+ * We do update relallvisible and relallfrozen even in the corner case,
+ * since if the table is all-visible we'd definitely like to know that.
+ * But clamp the value to be not more than what we're setting relpages to.
*
* Also, don't change relfrozenxid/relminmxid if we skipped any pages,
* since then we don't know for certain that all tuples have a newer xmin.
@@ -301,10 +307,13 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_tuples = vacrelstats->old_rel_tuples;
}
- new_rel_allvisible = visibilitymap_count(onerel);
+ visibilitymap_count(onerel, &new_rel_allvisible, &new_rel_allfrozen);
if (new_rel_allvisible > new_rel_pages)
new_rel_allvisible = new_rel_pages;
+ if (new_rel_allfrozen > new_rel_pages)
+ new_rel_allfrozen = new_rel_pages;
+
new_frozen_xid = scanned_all ? FreezeLimit : InvalidTransactionId;
new_min_multi = scanned_all ? MultiXactCutoff : InvalidMultiXactId;
@@ -312,6 +321,7 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_pages,
new_rel_tuples,
new_rel_allvisible,
+ new_rel_allfrozen,
vacrelstats->hasindex,
new_frozen_xid,
new_min_multi,
@@ -360,10 +370,11 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel),
vacrelstats->num_index_scans);
- appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins\n"),
+ appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped according to vm\n"),
vacrelstats->pages_removed,
vacrelstats->rel_pages,
- vacrelstats->pinskipped_pages);
+ vacrelstats->pinskipped_pages,
+ vacrelstats->vmskipped_frozen_pages);
appendStringInfo(&buf,
_("tuples: %.0f removed, %.0f remain, %.0f are dead but not yet removable\n"),
vacrelstats->tuples_deleted,
@@ -486,9 +497,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* consecutive pages. Since we're reading sequentially, the OS should be
* doing readahead for us, so there's no gain in skipping a page now and
* then; that's likely to disable readahead and so be counterproductive.
- * Also, skipping even a single page means that we can't update
- * relfrozenxid, so we only want to do it if we can skip a goodly number
- * of pages.
+ * Also, skipping even a single page accorinding to all-visible bit of
+ * visibility map means that we can't update relfrozenxid, so we only want
+ * to do it if we can skip a goodly number. On the other hand, we count
+ * both how many pages we skipped according to all-frozen bit of visibility
+ * map and how many pages we freeze page, so we can update relfrozenxid if
+ * the sum of them is as many as pages of table.
*
* Before entering the main loop, establish the invariant that
* next_not_all_visible_block is the next block number >= blkno that's not
@@ -498,24 +512,25 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* started skipping blocks, we may as well skip everything up to the next
* not-all-visible block.
*
- * Note: if scan_all is true, we won't actually skip any pages; but we
+ * Note: if scan_all is true, we might not actually skip any pages; but we
* maintain next_not_all_visible_block anyway, so as to set up the
* all_visible_according_to_vm flag correctly for each page.
*
* Note: The value returned by visibilitymap_test could be slightly
* out-of-date, since we make this test before reading the corresponding
* heap page or locking the buffer. This is OK. If we mistakenly think
- * that the page is all-visible when in fact the flag's just been cleared,
- * we might fail to vacuum the page. But it's OK to skip pages when
- * scan_all is not set, so no great harm done; the next vacuum will find
- * them. If we make the reverse mistake and vacuum a page unnecessarily,
- * it'll just be a no-op.
+ * that the page is all-visible/all-frozen when in fact the flag's just
+ * been cleared, we might fail to vacuum the page. But it's OK to skip
+ * pages when scan_all is not set, so no great harm done; the next vacuum
+ * will find them. If we make the reverse mistake and vacuum a page
+ * unnecessarily, it'll just be a no-op.
*/
for (next_not_all_visible_block = 0;
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer))
+ if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer,
+ VISIBILITYMAP_ALL_VISIBLE))
break;
vacuum_delay_point();
}
@@ -533,9 +548,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
bool tupgone,
hastup;
int prev_dead_count;
- int nfrozen;
+ int nfrozen; /* # of tuples is frozen */
+ int nalready_frozen; /* # of tuples is already frozen */
+ int ntotal_frozen; /* # of frozen tuples in single page */
+ int ntup_per_page;
Size freespace;
bool all_visible_according_to_vm;
+ bool all_frozen_according_to_vm;
bool all_visible;
bool has_dead_tuples;
TransactionId visibility_cutoff_xid = InvalidTransactionId;
@@ -548,7 +567,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block++)
{
if (!visibilitymap_test(onerel, next_not_all_visible_block,
- &vmbuffer))
+ &vmbuffer,
+ VISIBILITYMAP_ALL_VISIBLE))
break;
vacuum_delay_point();
}
@@ -563,13 +583,30 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
skipping_all_visible_blocks = false;
all_visible_according_to_vm = false;
+ all_frozen_according_to_vm = false;
}
else
{
- /* Current block is all-visible */
- if (skipping_all_visible_blocks && !scan_all)
- continue;
+ /*
+ * This block is at least all-visible according to visibility map.
+ * We check whehter this block is all-frozen or not, to skip to
+ * vacuum this page even if scan_all is true.
+ */
+ bool all_frozen = visibilitymap_test(onerel, blkno, &vmbuffer,
+ VISIBILITYMAP_ALL_FROZEN);
+ if (scan_all)
+ {
+ if (all_frozen)
+ {
+ vacrelstats->vmskipped_frozen_pages++;
+ continue;
+ }
+ }
+ else if (skipping_all_visible_blocks)
+ continue;
+
all_visible_according_to_vm = true;
+ all_frozen_according_to_vm = all_frozen;
}
vacuum_delay_point();
@@ -740,7 +777,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
PageSetAllVisible(page);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, InvalidTransactionId);
+ vmbuffer, InvalidTransactionId,
+ VISIBILITYMAP_ALL_VISIBLE);
END_CRIT_SECTION();
}
@@ -764,6 +802,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
all_visible = true;
has_dead_tuples = false;
nfrozen = 0;
+ nalready_frozen = 0;
+ ntup_per_page = 0;
hastup = false;
prev_dead_count = vacrelstats->num_dead_tuples;
maxoff = PageGetMaxOffsetNumber(page);
@@ -918,8 +958,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
{
num_tuples += 1;
+ ntup_per_page += 1;;
hastup = true;
+ /* Check whether this tuple is alrady frozen or not */
+ if (HeapTupleHeaderXminFrozen(tuple.t_data))
+ nalready_frozen += 1;
+
/*
* Each non-removable tuple must be checked to see if it needs
* freezing. Note we already have exclusive buffer lock.
@@ -931,9 +976,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
} /* scan along page */
/*
- * If we froze any tuples, mark the buffer dirty, and write a WAL
- * record recording the changes. We must log the changes to be
- * crash-safe against future truncation of CLOG.
+ * If we froze any tuples or any tuples are already frozen,
+ * mark the buffer dirty, and write a WAL record recording the changes.
+ * We must log the changes to be crash-safe against future truncation
+ * of CLOG.
*/
if (nfrozen > 0)
{
@@ -966,6 +1012,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
END_CRIT_SECTION();
}
+ /* Compute the number of frozen tuples in a page */
+ ntotal_frozen = nfrozen + nalready_frozen;
+
/*
* If there are no indexes then we can vacuum the page right now
* instead of doing a second scan.
@@ -988,26 +1037,46 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
freespace = PageGetHeapFreeSpace(page);
- /* mark page all-visible, if appropriate */
- if (all_visible && !all_visible_according_to_vm)
+ /* This page is all visible */
+ if (all_visible)
{
- /*
- * It should never be the case that the visibility map page is set
- * while the page-level bit is clear, but the reverse is allowed
- * (if checksums are not enabled). Regardless, set the both bits
- * so that we get back in sync.
- *
- * NB: If the heap page is all-visible but the VM bit is not set,
- * we don't need to dirty the heap page. However, if checksums
- * are enabled, we do need to make sure that the heap page is
- * dirtied before passing it to visibilitymap_set(), because it
- * may be logged. Given that this situation should only happen in
- * rare cases after a crash, it is not worth optimizing.
- */
- PageSetAllVisible(page);
- MarkBufferDirty(buf);
- visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ /* mark page all-visible, if appropriate */
+ if (!all_visible_according_to_vm)
+ {
+ /*
+ * It should never be the case that the visibility map page is set
+ * while the page-level bit is clear, but the reverse is allowed
+ * (if checksums are not enabled). Regardless, set the both bits
+ * so that we get back in sync.
+ *
+ * NB: If the heap page is all-visible but the VM bit is not set,
+ * we don't need to dirty the heap page. However, if checksums
+ * are enabled, we do need to make sure that the heap page is
+ * dirtied before passing it to visibilitymap_set(), because it
+ * may be logged. Given that this situation should only happen in
+ * rare cases after a crash, it is not worth optimizing.
+ */
+ PageSetAllVisible(page);
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+ }
+
+ /* mark page all-frozen, if all tuples are frozen and not marked yet */
+ if ((ntotal_frozen == ntup_per_page) && !all_frozen_according_to_vm)
+ {
+ Assert(PageIsAllVisible(page));
+
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ {
+ MarkBufferDirty(buf);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+ vmbuffer, visibility_cutoff_xid, flags);
+ }
}
/*
@@ -1018,7 +1087,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* that something bad has happened.
*/
else if (all_visible_according_to_vm && !PageIsAllVisible(page)
- && visibilitymap_test(onerel, blkno, &vmbuffer))
+ && visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
relname, blkno);
@@ -1078,7 +1147,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
num_tuples);
/*
- * Release any remaining pin on visibility map page.
+ * Release any remaining pin on visibility map and frozen map page.
*/
if (BufferIsValid(vmbuffer))
{
@@ -1114,6 +1183,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
RelationGetRelationName(onerel),
tups_vacuumed, vacuumed_pages)));
+ /* Report how many frozen pages vacuum skipped according to visibility map */
+ ereport(elevel,
+ (errmsg_plural("skipped %d frozen page according to visibility map",
+ "skipped %d frozen pages according to visibility map",
+ vacrelstats->vmskipped_frozen_pages,
+ vacrelstats->vmskipped_frozen_pages)));
+
/*
* This is pretty messy, but we split it up so that we can skip emitting
* individual parts of the message when not applicable.
@@ -1226,6 +1302,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
OffsetNumber unused[MaxOffsetNumber];
int uncnt = 0;
TransactionId visibility_cutoff_xid;
+ bool all_frozen;
START_CRIT_SECTION();
@@ -1277,19 +1354,34 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
* dirty, exclusively locked, and, if needed, a full page image has been
* emitted in the log_heap_clean() above.
*/
- if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid))
+ if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid, &all_frozen))
PageSetAllVisible(page);
/*
* All the changes to the heap page have been done. If the all-visible
- * flag is now set, also set the VM bit.
+ * flag is now set, also set the VM all-visible bit.
+ * Also, if this page is all-frozen, set the VM all-frozen bit and flag.
*/
- if (PageIsAllVisible(page) &&
- !visibilitymap_test(onerel, blkno, vmbuffer))
+ if (PageIsAllVisible(page))
{
+ uint8 flags = 0;
+
+ if (!visibilitymap_test(onerel, blkno, vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+
+ /* Set the VM all-frozen bit to flag, if needed */
+ if (all_frozen &&
+ !visibilitymap_test(onerel, blkno, vmbuffer, VISIBILITYMAP_ALL_FROZEN))
+ {
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
Assert(BufferIsValid(*vmbuffer));
- visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
- visibility_cutoff_xid);
+
+ if (flags)
+ visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
+ visibility_cutoff_xid, flags);
}
return tupindex;
@@ -1408,6 +1500,7 @@ lazy_cleanup_index(Relation indrel,
stats->num_pages,
stats->num_index_tuples,
0,
+ 0,
false,
InvalidTransactionId,
InvalidMultiXactId,
@@ -1779,10 +1872,12 @@ vac_cmp_itemptr(const void *left, const void *right)
/*
* Check if every tuple in the given page is visible to all current and future
* transactions. Also return the visibility_cutoff_xid which is the highest
- * xmin amongst the visible tuples.
+ * xmin amongst the visible tuples, and all_frozen which implies that all tuples
+ * of this page are frozen.
*/
static bool
-heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid)
+heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid,
+ bool *all_frozen)
{
Page page = BufferGetPage(buf);
BlockNumber blockno = BufferGetBlockNumber(buf);
@@ -1791,6 +1886,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
bool all_visible = true;
*visibility_cutoff_xid = InvalidTransactionId;
+ *all_frozen = true;
/*
* This is a stripped down version of the line pointer scan in
@@ -1814,11 +1910,12 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/*
* Dead line pointers can have index pointers pointing to them. So
- * they can't be treated as visible
+ * they can't be treated as visible and frozen.
*/
if (ItemIdIsDead(itemid))
{
all_visible = false;
+ *all_frozen = false;
break;
}
@@ -1855,6 +1952,10 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
+
+ /* Check whether this tuple is alrady frozen or not */
+ if (!HeapTupleHeaderXminFrozen(tuple.t_data))
+ *all_frozen = false;
}
break;
@@ -1863,6 +1964,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
case HEAPTUPLE_INSERT_IN_PROGRESS:
case HEAPTUPLE_DELETE_IN_PROGRESS:
all_visible = false;
+ *all_frozen = false;
break;
default:
@@ -1871,5 +1973,8 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
}
} /* scan along page */
+ if (!all_visible)
+ *all_frozen = false;
+
return all_visible;
}
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 9f54c46..08df289 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -116,7 +116,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
*/
if (!visibilitymap_test(scandesc->heapRelation,
ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
+ &node->ioss_VMBuffer, VISIBILITYMAP_ALL_VISIBLE))
{
/*
* Rats, we have to visit the heap to check visibility.
diff --git a/src/backend/rewrite/rewriteDefine.c b/src/backend/rewrite/rewriteDefine.c
index 39c83a6..560cf5a 100644
--- a/src/backend/rewrite/rewriteDefine.c
+++ b/src/backend/rewrite/rewriteDefine.c
@@ -604,6 +604,7 @@ DefineQueryRewrite(char *rulename,
classForm->relpages = 0;
classForm->reltuples = 0;
classForm->relallvisible = 0;
+ classForm->relallfrozen = 0;
classForm->reltoastrelid = InvalidOid;
classForm->relhasindex = false;
classForm->relkind = RELKIND_VIEW;
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 9c3d096..8aa8470 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1613,6 +1613,7 @@ formrdesc(const char *relationName, Oid relationReltype,
relation->rd_rel->relpages = 0;
relation->rd_rel->reltuples = 0;
relation->rd_rel->relallvisible = 0;
+ relation->rd_rel->relallfrozen = 0;
relation->rd_rel->relkind = RELKIND_RELATION;
relation->rd_rel->relhasoids = hasoids;
relation->rd_rel->relnatts = (int16) natts;
@@ -3114,6 +3115,7 @@ RelationSetNewRelfilenode(Relation relation, char persistence,
classform->relpages = 0; /* it's empty until further notice */
classform->reltuples = 0;
classform->relallvisible = 0;
+ classform->relallfrozen = 0;
}
classform->relfrozenxid = freezeXid;
classform->relminmxid = minmulti;
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 41d4606..a3ce324 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -231,6 +231,15 @@ check_cluster_versions(void)
if (old_cluster.major_version > new_cluster.major_version)
pg_fatal("This utility cannot be used to downgrade to older major PostgreSQL versions.\n");
+ /*
+ * We cant't allow upgrading with link mode between 9.5 or before and 9.6 or later,
+ * because the format of visibility map has been changed on version 9.6.
+ */
+ if (user_opts.transfer_mode == TRANSFER_MODE_LINK &&
+ GET_MAJOR_VERSION(old_cluster.major_version) < 906 &&
+ GET_MAJOR_VERSION(new_cluster.major_version) >= 906)
+ pg_fatal("This utility cannot upgrade from PostgreSQL version from 9.5 or before to 9.6 or later with link mode.\n");
+
/* get old and new binary versions */
get_bin_version(&old_cluster);
get_bin_version(&new_cluster);
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 79d9390..d47a98b 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -10,6 +10,7 @@
#include "postgres_fe.h"
#include "pg_upgrade.h"
+#include "storage/bufpage.h"
#include <fcntl.h>
@@ -21,6 +22,43 @@ static int copy_file(const char *fromfile, const char *tofile, bool force);
static int win32_pghardlink(const char *src, const char *dst);
#endif
+static const char *rewriteVisibilitymap(const char *fromfile, const char *tofile,
+ bool force);
+
+/* table for fast rewriting vm file in order to add all-frozen information */
+static const uint16 rewrite_vm_table[256] = {
+ 0, 1, 4, 5, 16, 17, 20, 21, 64, 65, 68, 69, 80, 81, 84, 85,
+ 256, 257, 260, 261, 272, 273, 276, 277, 320, 321, 324, 325, 336, 337, 340, 341,
+ 1024, 1025, 1028, 1029, 1040, 1041, 1044, 1045, 1088, 1089, 1092, 1093, 1104, 1105, 1108, 1109,
+ 1280, 1281, 1284, 1285, 1296, 1297, 1300, 1301, 1344, 1345, 1348, 1349, 1360, 1361, 1364, 1365,
+ 4096, 4097, 4100, 4101, 4112, 4113, 4116, 4117, 4160, 4161, 4164, 4165, 4176, 4177, 4180, 4181,
+ 4352, 4353, 4356, 4357, 4368, 4369, 4372, 4373, 4416, 4417, 4420, 4421, 4432, 4433, 4436, 4437,
+ 5120, 5121, 5124, 5125, 5136, 5137, 5140, 5141, 5184, 5185, 5188, 5189, 5200, 5201, 5204, 5205,
+ 5376, 5377, 5380, 5381, 5392, 5393, 5396, 5397, 5440, 5441, 5444, 5445, 5456, 5457, 5460, 5461,
+ 16384, 16385, 16388, 16389, 16400, 16401, 16404, 16405, 16448, 16449, 16452, 16453, 16464, 16465, 16468, 16469,
+ 16640, 16641, 16644, 16645, 16656, 16657, 16660, 16661, 16704, 16705, 16708, 16709, 16720, 16721, 16724, 16725,
+ 17408, 17409, 17412, 17413, 17424, 17425, 17428, 17429, 17472, 17473, 17476, 17477, 17488, 17489, 17492, 17493,
+ 17664, 17665, 17668, 17669, 17680, 17681, 17684, 17685, 17728, 17729, 17732, 17733, 17744, 17745, 17748, 17749,
+ 20480, 20481, 20484, 20485, 20496, 20497, 20500, 20501, 20544, 20545, 20548, 20549, 20560, 20561, 20564, 20565,
+ 20736, 20737, 20740, 20741, 20752, 20753, 20756, 20757, 20800, 20801, 20804, 20805, 20816, 20817, 20820, 20821,
+ 21504, 21505, 21508, 21509, 21520, 21521, 21524, 21525, 21568, 21569, 21572, 21573, 21584, 21585, 21588, 21589,
+ 21760, 21761, 21764, 21765, 21776, 21777, 21780, 21781, 21824, 21825, 21828, 21829, 21840, 21841, 21844, 21845
+};
+
+/*
+ * copyOrRewriteFile()
+ * This function copies file or rewrite visibility map file.
+ * If rewrite_vm is true, we have to rewrite visibility map regardless value of pageConverter.
+ */
+const char *
+copyOrRewriteFile(pageCnvCtx *pageConverter,
+ const char *src, const char *dst, bool force, bool rewrite_vm)
+{
+ if (rewrite_vm)
+ return rewriteVisibilitymap(src, dst, force);
+ else
+ return copyAndUpdateFile(pageConverter, src, dst, force);
+}
/*
* copyAndUpdateFile()
@@ -201,6 +239,97 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
#endif
+/*
+ * rewriteVisibilitymap()
+ *
+ * A additional bit which indicates that all tuples on page is completely
+ * frozen is added into visibility map at PG 9.6. So the format of visibiilty
+ * map has been changed.
+ * Copies a visibility map file while adding all-frozen bit(0) into each bit.
+ */
+static const char *
+rewriteVisibilitymap(const char *fromfile, const char *tofile, bool force)
+{
+#define REWRITE_BUF_SIZE (50 * BLCKSZ)
+#define BITS_PER_HEAPBLOCK 2
+
+ int src_fd = 0;
+ int dst_fd = 0;
+ uint16 vm_bits;
+ ssize_t nbytes;
+ char *buffer = NULL;
+ int ret = 0;
+
+ if ((fromfile == NULL) || (tofile == NULL))
+ return getErrorText(EINVAL);
+
+ if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
+ goto err;
+
+ if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+ goto err;
+
+ buffer = (char *) pg_malloc(REWRITE_BUF_SIZE);
+
+ /* Copy page header data in advance */
+ if ((nbytes = read(src_fd, buffer, MAXALIGN(SizeOfPageHeaderData))) <= 0)
+ goto err;
+
+ if (write(dst_fd, buffer, nbytes) != nbytes)
+ {
+ /* if write didn't set errno, assume problem is no disk space */
+ if (errno == 0)
+ errno = ENOSPC;
+ goto err;
+ }
+
+ /* perform data rewriting i.e read src srouce, write to destination */
+ while (true)
+ {
+ ssize_t nbytes = read(src_fd, buffer, REWRITE_BUF_SIZE);
+ char *cur, *end;
+
+ if (nbytes < 0)
+ {
+ ret = -1;
+ break;
+ }
+
+ if (nbytes == 0)
+ break;
+
+ cur = buffer;
+ end = buffer + nbytes;
+
+ /* Rewrite a byte and write dest_fd per BITS_PER_HEAPBLOCK bytes */
+ while (end > cur)
+ {
+ /* Get rewritten bit from table and its string representation */
+ vm_bits = rewrite_vm_table[(uint8) *cur];
+
+ if (write(dst_fd, &vm_bits, BITS_PER_HEAPBLOCK) != BITS_PER_HEAPBLOCK)
+ {
+ ret = -1;
+ break;
+ }
+ cur++;
+ }
+ }
+
+err:
+
+ if (!buffer)
+ pg_free(buffer);
+
+ if (src_fd != 0)
+ close(src_fd);
+
+ if (dst_fd != 0)
+ close(dst_fd);
+
+ return (errno == 0) ? NULL : getErrorText(errno);
+}
+
void
check_hard_link(void)
{
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 13aa891..d04d836 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -112,6 +112,10 @@ extern char *output_files[];
#define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031
/*
+ * The format of visibility map changed with this 9.6 commit,
+ */
+#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201510191
+/*
* pg_multixact format changed in 9.3 commit 0ac5ad5134f2769ccbaefec73844f85,
* ("Improve concurrency of foreign key locking") which also updated catalog
* version to this value. pg_upgrade behavior depends on whether old and new
@@ -396,6 +400,8 @@ const pageCnvCtx *setupPageConverter(void);
typedef void *pageCnvCtx;
#endif
+const char *copyOrRewriteFile(pageCnvCtx *pageConverter, const char *src,
+ const char *dst, bool force, bool rewrite_vm);
const char *copyAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
const char *dst, bool force);
const char *linkAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index c22df42..2fa5b47 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -171,6 +171,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
{
int mapnum;
bool vm_crashsafe_match = true;
+ bool vm_rewrite_needed = false;
/*
* Do the old and new cluster disagree on the crash-safetiness of the vm
@@ -180,6 +181,13 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
vm_crashsafe_match = false;
+ /*
+ * Do we need to rewrite visibilitymap?
+ */
+ if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ vm_rewrite_needed = true;
+
for (mapnum = 0; mapnum < size; mapnum++)
{
if (old_tablespace == NULL ||
@@ -195,7 +203,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
* Copy/link any fsm and vm files, if they exist
*/
transfer_relfile(pageConverter, &maps[mapnum], "_fsm");
- if (vm_crashsafe_match)
+ if (vm_crashsafe_match || vm_rewrite_needed)
transfer_relfile(pageConverter, &maps[mapnum], "_vm");
}
}
@@ -218,6 +226,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
int fd;
int segno;
char extent_suffix[65];
+ bool rewrite_vm = false;
/*
* Now copy/link any related segments as well. Remember, PG breaks large
@@ -276,7 +285,15 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
+ /*
+ * Do we need to rewrite visibilitymap?
+ */
+ if (strcmp(type_suffix, "_vm") == 0 &&
+ old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ rewrite_vm = true;
+
+ if ((msg = copyOrRewriteFile(pageConverter, old_file, new_file, true, rewrite_vm)) != NULL)
pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
diff --git a/src/bin/pg_upgrade/test.sh b/src/bin/pg_upgrade/test.sh
index aa7f399..6b058d4 100644
--- a/src/bin/pg_upgrade/test.sh
+++ b/src/bin/pg_upgrade/test.sh
@@ -174,6 +174,11 @@ if "$MAKE" -C "$oldsrc" installcheck; then
mv "$temp_root"/dump1.sql "$temp_root"/dump1.sql.orig
sed "s;$oldsrc;$newsrc;g" "$temp_root"/dump1.sql.orig >"$temp_root"/dump1.sql
fi
+
+ vm_sql="SELECT c.relname, c.relallvisible FROM pg_class as c, pg_namespace as n WHERE c.relnamespace = n.oid AND n.nspname NOT IN ('information_schema', 'pg_toast', 'pg_catalog') ORDER BY c.relname;"
+ # Test for rewriting visibility map
+ vacuumdb -d regression || visibilitymap_vacuum1_status=$?
+ psql -d regression -c "$vm_sql" > "$temp_root"/vm_test1.txt || visibilitymap_test1_status=$?
else
make_installcheck_status=$?
fi
@@ -188,6 +193,14 @@ if [ -n "$pg_dumpall1_status" ]; then
echo "pg_dumpall of pre-upgrade database cluster failed"
exit 1
fi
+if [ -n "$visibilitymap_vacuum1_status" ];then
+ echo "VACUUM of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+if [ -n "$visibilitymap_test1_status" ];then
+ echo "SELECT of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
PGDATA=$BASE_PGDATA
@@ -203,6 +216,8 @@ case $testhost in
esac
pg_dumpall -f "$temp_root"/dump2.sql || pg_dumpall2_status=$?
+vacuumdb -d regression || visibilitymap_vacuum2_status=$?
+psql -d regression -c "$vm_sql" > "$temp_root"/vm_test2.txt || visibilitymap_test2_status=$?
pg_ctl -m fast stop
# no need to echo commands anymore
@@ -214,11 +229,26 @@ if [ -n "$pg_dumpall2_status" ]; then
exit 1
fi
+if [ -n "$visibilitymap_vacuum2_status" ];then
+ echo "VACUUM of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
+if [ -n "$visibilitymap_test2_status" ];then
+ echo "SELECT of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
case $testhost in
MINGW*) cmd /c delete_old_cluster.bat ;;
*) sh ./delete_old_cluster.sh ;;
esac
+if ! diff "$temp_root"/vm_test1.txt "$temp_root"/vm_test2.txt >/dev/null; then
+ echo "Visibility map rewriting test failed"
+ exit 1
+fi
+
if diff "$temp_root"/dump1.sql "$temp_root"/dump2.sql >/dev/null; then
echo PASSED
exit 0
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index caa0f14..93afb10 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -320,9 +320,10 @@ typedef struct xl_heap_freeze_page
typedef struct xl_heap_visible
{
TransactionId cutoff_xid;
+ uint8 flags;
} xl_heap_visible;
-#define SizeOfHeapVisible (offsetof(xl_heap_visible, cutoff_xid) + sizeof(TransactionId))
+#define SizeOfHeapVisible (offsetof(xl_heap_visible, flags) + sizeof(uint8))
typedef struct xl_heap_new_cid
{
@@ -389,6 +390,6 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
xl_heap_freeze_tuple *xlrec_tp);
extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
- Buffer vm_buffer, TransactionId cutoff_xid);
+ Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
#endif /* HEAPAM_XLOG_H */
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 0c0e0ef..bacc349 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -19,15 +19,22 @@
#include "storage/buf.h"
#include "utils/relcache.h"
+/* Flags for bit map */
+#define VISIBILITYMAP_ALL_VISIBLE 0x01
+#define VISIBILITYMAP_ALL_FROZEN 0x02
+
extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk,
Buffer vmbuf);
extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
Buffer *vmbuf);
extern bool visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid);
-extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
-extern BlockNumber visibilitymap_count(Relation rel);
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags);
+extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf,
+ uint8 flags);
+extern void visibilitymap_count(Relation rel, BlockNumber *all_visible,
+ BlockNumber *all_frozen);
extern void visibilitymap_truncate(Relation rel, BlockNumber nheapblocks);
#endif /* VISIBILITYMAP_H */
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 15e0b97..f2ef868 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 201510161
+#define CATALOG_VERSION_NO 201510191
#endif
diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
index 06d287e..e8c1316 100644
--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -47,6 +47,8 @@ CATALOG(pg_class,1259) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83) BKI_SCHEMA_MACRO
float4 reltuples; /* # of tuples (not always up-to-date) */
int32 relallvisible; /* # of all-visible blocks (not always
* up-to-date) */
+ int32 relallfrozen; /* # of all-frozen blocks (not always
+ up-to-date) */
Oid reltoastrelid; /* OID of toast table; 0 if none */
bool relhasindex; /* T if has (or has had) any indexes */
bool relisshared; /* T if shared across databases */
@@ -96,7 +98,7 @@ typedef FormData_pg_class *Form_pg_class;
* ----------------
*/
-#define Natts_pg_class 31
+#define Natts_pg_class 32
#define Anum_pg_class_relname 1
#define Anum_pg_class_relnamespace 2
#define Anum_pg_class_reltype 3
@@ -108,26 +110,27 @@ typedef FormData_pg_class *Form_pg_class;
#define Anum_pg_class_relpages 9
#define Anum_pg_class_reltuples 10
#define Anum_pg_class_relallvisible 11
-#define Anum_pg_class_reltoastrelid 12
-#define Anum_pg_class_relhasindex 13
-#define Anum_pg_class_relisshared 14
-#define Anum_pg_class_relpersistence 15
-#define Anum_pg_class_relkind 16
-#define Anum_pg_class_relnatts 17
-#define Anum_pg_class_relchecks 18
-#define Anum_pg_class_relhasoids 19
-#define Anum_pg_class_relhaspkey 20
-#define Anum_pg_class_relhasrules 21
-#define Anum_pg_class_relhastriggers 22
-#define Anum_pg_class_relhassubclass 23
-#define Anum_pg_class_relrowsecurity 24
-#define Anum_pg_class_relforcerowsecurity 25
-#define Anum_pg_class_relispopulated 26
-#define Anum_pg_class_relreplident 27
-#define Anum_pg_class_relfrozenxid 28
-#define Anum_pg_class_relminmxid 29
-#define Anum_pg_class_relacl 30
-#define Anum_pg_class_reloptions 31
+#define Anum_pg_class_relallfrozen 12
+#define Anum_pg_class_reltoastrelid 13
+#define Anum_pg_class_relhasindex 14
+#define Anum_pg_class_relisshared 15
+#define Anum_pg_class_relpersistence 16
+#define Anum_pg_class_relkind 17
+#define Anum_pg_class_relnatts 18
+#define Anum_pg_class_relchecks 19
+#define Anum_pg_class_relhasoids 20
+#define Anum_pg_class_relhaspkey 21
+#define Anum_pg_class_relhasrules 22
+#define Anum_pg_class_relhastriggers 23
+#define Anum_pg_class_relhassubclass 24
+#define Anum_pg_class_relrowsecurity 25
+#define Anum_pg_class_relforcerowsecurity 26
+#define Anum_pg_class_relispopulated 27
+#define Anum_pg_class_relreplident 28
+#define Anum_pg_class_relfrozenxid 29
+#define Anum_pg_class_relminmxid 30
+#define Anum_pg_class_relacl 31
+#define Anum_pg_class_reloptions 32
/* ----------------
* initial contents of pg_class
@@ -142,13 +145,13 @@ typedef FormData_pg_class *Form_pg_class;
* Note: "3" in the relfrozenxid column stands for FirstNormalTransactionId;
* similarly, "1" in relminmxid stands for FirstMultiXactId
*/
-DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1247 ( pg_type PGNSP 71 0 PGUID 0 0 0 0 0 0 0 0 f f p r 30 0 t f f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1249 ( pg_attribute PGNSP 75 0 PGUID 0 0 0 0 0 0 0 0 f f p r 21 0 f f f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 f f p r 29 0 t f f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1255 ( pg_proc PGNSP 81 0 PGUID 0 0 0 0 0 0 0 0 f f p r 29 0 t f f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
-DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 f f p r 31 0 t f f f f f f t n 3 1 _null_ _null_ ));
+DATA(insert OID = 1259 ( pg_class PGNSP 83 0 PGUID 0 0 0 0 0 0 0 0 f f p r 32 0 t f f f f f f t n 3 1 _null_ _null_ ));
DESCR("");
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index e3a31af..d2bae2d 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -172,6 +172,7 @@ extern void vac_update_relstats(Relation relation,
BlockNumber num_pages,
double num_tuples,
BlockNumber num_all_visible_pages,
+ BlockNumber num_all_frozen_pages,
bool hasindex,
TransactionId frozenxid,
MultiXactId minmulti,
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index a2f78ee..1040885 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -178,8 +178,10 @@ typedef PageHeaderData *PageHeader;
* tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
+#define PD_ALL_FROZEN 0x0008 /* all tuples on page are completely
+ frozen */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x000F /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -369,6 +371,11 @@ typedef PageHeaderData *PageHeader;
#define PageClearAllVisible(page) \
(((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
+#define PageIsAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags & PD_ALL_FROZEN)
+#define PageSetAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags |= PD_ALL_FROZEN)
+
#define PageIsPrunable(page, oldestxmin) \
( \
AssertMacro(TransactionIdIsNormal(oldestxmin)), \
diff --git a/src/test/regress/expected/visibilitymap.out b/src/test/regress/expected/visibilitymap.out
new file mode 100644
index 0000000..0dd5cc1
--- /dev/null
+++ b/src/test/regress/expected/visibilitymap.out
@@ -0,0 +1,29 @@
+--
+-- Visibility map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+\set VERBOSITY terse
+-- All pages are become all-visible
+VACUUM vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+ ?column?
+----------
+ t
+(1 row)
+
+-- All pages are become all-frozen
+VACUUM FREEZE vmtest;
+SELECT relallfrozen = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+ ?column?
+----------
+ t
+(1 row)
+
+-- All pages are skipped according to VM
+VACUUM FREEZE VERBOSE vmtest;
+INFO: vacuuming "public.vmtest"
+INFO: index "vmtest_pkey" now contains 10000 row versions in 30 pages
+INFO: skipped 45 frozen pages according to visibility map
+INFO: "vmtest": found 0 removable, 0 nonremovable row versions in 0 out of 45 pages
+DROP TABLE vmtest;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index c63abf4..1d4cfdb 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -108,5 +108,8 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare without_oid c
# event triggers cannot run concurrently with any test that runs DDL
test: event_trigger
+# visibility map and vacuum test cannot run concurrently with any test that runs SQL
+test: visibilitymap
+
# run stats by itself because its delay may be insufficient under heavy load
test: stats
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 88dcd64..2ecfe56 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -159,3 +159,4 @@ test: with
test: xml
test: event_trigger
test: stats
+test: visibilitymap
diff --git a/src/test/regress/sql/visibilitymap.sql b/src/test/regress/sql/visibilitymap.sql
new file mode 100644
index 0000000..53d817e
--- /dev/null
+++ b/src/test/regress/sql/visibilitymap.sql
@@ -0,0 +1,20 @@
+--
+-- Visibility map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+
+\set VERBOSITY terse
+
+-- All pages are become all-visible
+VACUUM vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+
+-- All pages are become all-frozen
+VACUUM FREEZE vmtest;
+SELECT relallfrozen = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+
+-- All pages are skipped according to VM
+VACUUM FREEZE VERBOSE vmtest;
+
+DROP TABLE vmtest;
On Mon, Oct 5, 2015 at 9:53 PM, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Mon, Oct 5, 2015 at 11:03 PM, Fujii Masao <masao.fujii@gmail.com>
wrote:
On Fri, Oct 2, 2015 at 8:14 PM, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
+#define Anum_pg_class_relallfrozen 12
Why is pg_class.relallfrozen necessary? ISTM that there is no user of
it now.
The relallfrozen would be useful for user to estimate time to vacuum
freeze or anti-wrapping vacuum before being done them actually.
(Also this value is used on regression test.)
But this information is not used on planning like relallvisible, so it
would be good to move this information to another system view like
pg_stat_*_tables.Or make pgstattuple and pgstattuple_approx report even the number
of frozen tuples?But we cannot know the number of frozen pages without installation of
pageinspect module.
I'm a bit concerned about that the all projects cannot install
extentension module into postgresql on production environment.
I think we need to provide such feature at least into core.
I think we can display information about relallfrozen it in pg_stat_*_tables
as suggested by you. It doesn't make much sense to keep it in pg_class
unless we have some usecase for the same.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Sat, Oct 24, 2015 at 10:59 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Oct 5, 2015 at 9:53 PM, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:On Mon, Oct 5, 2015 at 11:03 PM, Fujii Masao <masao.fujii@gmail.com>
wrote:On Fri, Oct 2, 2015 at 8:14 PM, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:+#define Anum_pg_class_relallfrozen 12
Why is pg_class.relallfrozen necessary? ISTM that there is no user of
it now.The relallfrozen would be useful for user to estimate time to vacuum
freeze or anti-wrapping vacuum before being done them actually.
(Also this value is used on regression test.)
But this information is not used on planning like relallvisible, so it
would be good to move this information to another system view like
pg_stat_*_tables.Or make pgstattuple and pgstattuple_approx report even the number
of frozen tuples?But we cannot know the number of frozen pages without installation of
pageinspect module.
I'm a bit concerned about that the all projects cannot install
extentension module into postgresql on production environment.
I think we need to provide such feature at least into core.I think we can display information about relallfrozen it in pg_stat_*_tables
as suggested by you. It doesn't make much sense to keep it in pg_class
unless we have some usecase for the same.
I'm thinking a bit about implementing the read-only table that is
restricted to update/delete and is ensured that whole table is frozen,
if this feature is committed.
The value of relallfrozen might be useful for such feature.
Regards,
--
Masahiko Sawada
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Oct 24, 2015 at 2:24 PM, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Sat, Oct 24, 2015 at 10:59 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:
I think we can display information about relallfrozen it in
pg_stat_*_tables
as suggested by you. It doesn't make much sense to keep it in pg_class
unless we have some usecase for the same.I'm thinking a bit about implementing the read-only table that is
restricted to update/delete and is ensured that whole table is frozen,
if this feature is committed.
The value of relallfrozen might be useful for such feature.
If we need this for read-only table feature, then better lets add that
after discussing the design of that feature. It doesn't seem to be
advisable to have an extra field in system table which we might
need in yet not completely-discussed feature.
Review Comments:
-------------------------------
1.
/*
- * Find buffer to insert this tuple into. If the page is all visible,
- * this will also pin
the requisite visibility map page.
+ * Find buffer to insert this tuple into. If the page is all
visible
+ * or all frozen, this will also pin the requisite visibility map and
+ * frozen map page.
*/
buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
InvalidBuffer, options, bistate,
I think it is sufficient to say in the end 'visibility map page'.
Let's not include 'frozen map page'.
2.
+ * corresponding page has been completely frozen, so the visibility map is
also
+ * used for anti-wraparound
vacuum, even if freezing tuples is required.
/all tuple/all tuples
/freezing tuples/freezing of tuples
3.
- * Are all tuples on heapBlk visible to all, according to the visibility
map?
+ * Are all tuples on heapBlk
visible or frozen to all, according to the visibility map?
I think it is better to modify the above statement as:
Are all tuples on heapBlk visible to all or are marked as frozen, according
to the visibility map?
4.
+ * releasing *buf after it's done testing and setting bits, and must set
flags
+ * which indicates what flag
we want to test.
Here are you talking about the flags passed to visibilitymap_set(), if
yes, then above comment is not clear, how about:
and must pass flags
for which it needs to check the value in visibility map.
5.
+ * both how many pages we skipped according to all-frozen bit of visibility
+ * map and how many
pages we freeze page, so we can update relfrozenxid if
In above sentence word 'page' after freeze sounds redundant.
/we freeze page/we freeze
Another suggestion:
/sum of them/sum of two
6.
+ * This block is at least all-visible according to visibility map.
+
* We check whehter this block is all-frozen or not, to skip to
whether is mis-spelled
7.
+ * If we froze any tuples or any tuples are already frozen,
+ * mark the buffer
dirty, and write a WAL record recording the changes.
Here, I think WAL record is written only when we mark some
tuple/'s as frozen not if we they are already frozen,
so in that regard, I think above comment is wrong.
8.
+ /*
+ * We cant't allow upgrading with link mode between 9.5 or before and 9.6
or later,
+ *
because the format of visibility map has been changed on version 9.6.
+ */
a. /cant't/can't
b. changed on version 9.6/changed in version 9.6
b. Won't such a change needs to be updated in pg_upgrade
documentation (Notes Section)?
9.
@@ -180,6 +181,13 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
vm_crashsafe_match = false;
+
/*
+ * Do we need to rewrite visibilitymap?
+ */
+ if (old_cluster.controldata.cat_ver <
VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >=
VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ vm_rewrite_needed = true;
..
@@ -276,7 +285,15 @@ transfer_relfile(pageCnvCtx *pageConverter,
FileNameMap *map,
{
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg =
copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
+ /*
+
* Do we need to rewrite visibilitymap?
+ */
+ if (strcmp
(type_suffix, "_vm") == 0 &&
+ old_cluster.controldata.cat_ver <
VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >=
VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ rewrite_vm = true;
Instead of doing re-check in transfer_relfile(), I think it is better
to pass an additional parameter in this function.
10.
You have mentioned up-thread that, you have changed the patch so that
PageClearAllVisible clear both bits, can you please point me to this
change?
Basically after applying the patch, I see below code in bufpage.h:
#define PageClearAllVisible(page) \
(((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
Don't we need to clear the PD_ALL_FROZEN separately?
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Wed, Oct 28, 2015 at 12:58 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Sat, Oct 24, 2015 at 2:24 PM, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:On Sat, Oct 24, 2015 at 10:59 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:I think we can display information about relallfrozen it in
pg_stat_*_tables
as suggested by you. It doesn't make much sense to keep it in pg_class
unless we have some usecase for the same.I'm thinking a bit about implementing the read-only table that is
restricted to update/delete and is ensured that whole table is frozen,
if this feature is committed.
The value of relallfrozen might be useful for such feature.
Thank you for reviewing!
If we need this for read-only table feature, then better lets add that
after discussing the design of that feature. It doesn't seem to be
advisable to have an extra field in system table which we might
need in yet not completely-discussed feature.
I changed it so that the number of frozen pages is stored in
pg_stat_all_tables as statistics information.
Also, the tests related to counting all-visible bit and skipping
vacuum are added to visibility map test, and the test related to
counting all-frozen is added to stats collector test.
Attached updated v20 patch.
Review Comments: ------------------------------- 1. /* - * Find buffer to insert this tuple into. If the page is all visible, - * this will also pin the requisite visibility map page. + * Find buffer to insert this tuple into. If the page is all visible + * or all frozen, this will also pin the requisite visibility map and + * frozen map page.*/
buffer = RelationGetBufferForTuple(relation, heaptup->t_len,InvalidBuffer, options, bistate,
I think it is sufficient to say in the end 'visibility map page'.
Let's not include 'frozen map page'.
Fixed.
2. + * corresponding page has been completely frozen, so the visibility map is also + * used for anti-wraparound vacuum, even if freezing tuples is required./all tuple/all tuples
/freezing tuples/freezing of tuples
Fixed.
3. - * Are all tuples on heapBlk visible to all, according to the visibility map? + * Are all tuples on heapBlk visible or frozen to all, according to the visibility map?I think it is better to modify the above statement as:
Are all tuples on heapBlk visible to all or are marked as frozen, according
to the visibility map?
Fixed.
4. + * releasing *buf after it's done testing and setting bits, and must set flags + * which indicates what flag we want to test.Here are you talking about the flags passed to visibilitymap_set(), if
yes, then above comment is not clear, how about:and must pass flags
for which it needs to check the value in visibility map.
Fixed.
5. + * both how many pages we skipped according to all-frozen bit of visibility + * map and how many pages we freeze page, so we can update relfrozenxid ifIn above sentence word 'page' after freeze sounds redundant.
/we freeze page/we freezeAnother suggestion:
/sum of them/sum of two
Fixed.
6. + * This block is at least all-visible according to visibility map. + * We check whehter this block is all-frozen or not, to skip towhether is mis-spelled
Fixed.
7. + * If we froze any tuples or any tuples are already frozen, + * mark the buffer dirty, and write a WAL record recording the changes.Here, I think WAL record is written only when we mark some
tuple/'s as frozen not if we they are already frozen,
so in that regard, I think above comment is wrong.
It's wrong.
Fixed.
8. + /* + * We cant't allow upgrading with link mode between 9.5 or before and 9.6 or later, + * because the format of visibility map has been changed on version 9.6. + */a. /cant't/can't
b. changed on version 9.6/changed in version 9.6
b. Won't such a change needs to be updated in pg_upgrade
documentation (Notes Section)?
Fixed.
And updated document.
9.
@@ -180,6 +181,13 @@ transfer_single_new_db(pageCnvCtx *pageConverter,new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
vm_crashsafe_match = false;+ /* + * Do we need to rewrite visibilitymap? + */ + if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER && + new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER) + vm_rewrite_needed = true;..
@@ -276,7 +285,15 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap
*map,
{pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL) + /* + * Do we need to rewrite visibilitymap? + */ + if (strcmp (type_suffix, "_vm") == 0 && + old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER && + new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER) + rewrite_vm = true;Instead of doing re-check in transfer_relfile(), I think it is better
to pass an additional parameter in this function.
I agree.
Fixed.
10.
You have mentioned up-thread that, you have changed the patch so that
PageClearAllVisible clear both bits, can you please point me to this
change?
Basically after applying the patch, I see below code in bufpage.h:
#define PageClearAllVisible(page) \
(((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)Don't we need to clear the PD_ALL_FROZEN separately?
Previous patch is wrong. PageClearAllVisible() should be;
#define PageClearAllVisible(page) \
(((PageHeader) (page))->pd_flags &= ~(PD_ALL_VISIBLE | PD_ALL_FROZEN))
The all-frozen flag/bit is cleared only by modifying page, so it is
impossible that only all-frozen flags/bit is cleared.
The clearing of all-visible flag/bit also means that the page has some
garbage, and is needed to vacuum.
Regards,
--
Masahiko Sawada
Attachments:
000_add_frozen_bit_into_visibilitymap_v20.patchtext/x-patch; charset=US-ASCII; name=000_add_frozen_bit_into_visibilitymap_v20.patchDownload
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 22c5f7a..b1b6a06 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -87,7 +87,7 @@ statapprox_heap(Relation rel, output_type *stat)
* If the page has only visible tuples, then we can find out the free
* space from the FSM and move on.
*/
- if (visibilitymap_test(rel, blkno, &vmbuffer))
+ if (visibilitymap_test(rel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
freespace = GetRecordedFreeSpace(rel, blkno);
stat->tuple_len += BLCKSZ - freespace;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5549de7..bb63bb9 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5900,7 +5900,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs a aggressive freezing if the table's
<structname>pg_class</>.<structfield>relfrozenxid</> field has reached
the age specified by this setting. The default is 150 million
transactions. Although users can set this value anywhere from zero to
@@ -5944,7 +5944,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs a aggressive freezing if the table's
<structname>pg_class</>.<structfield>relminmxid</> field has reached
the age specified by this setting. The default is 150 million multixacts.
Although users can set this value anywhere from zero to two billions,
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index b5d4050..9183aba 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -352,9 +352,9 @@
Vacuum maintains a <link linkend="storage-vm">visibility map</> for each
table to keep track of which pages contain only tuples that are known to be
visible to all active transactions (and all future transactions, until the
- page is again modified). This has two purposes. First, vacuum
- itself can skip such pages on the next run, since there is nothing to
- clean up.
+ page is again modified), and pages contain only tuples that are marked as
+ frozen. This has two purposes. First, vacuum itself can skip such pages
+ on the next run, since there is nothing to clean up.
</para>
<para>
@@ -438,23 +438,22 @@
</para>
<para>
- <command>VACUUM</> normally skips pages that don't have any dead row
- versions, but those pages might still have row versions with old XID
- values. To ensure all old row versions have been frozen, a
- scan of the whole table is needed.
+ <command>VACUUM</> skips pages that don't have any dead row
+ versions, and pages that have only frozen rows.
+ To ensure all old row versions have been frozen, a scan of all pages that
+ are not marked as frozen is needed.
<xref linkend="guc-vacuum-freeze-table-age"> controls when
- <command>VACUUM</> does that: a whole table sweep is forced if
- the table hasn't been fully scanned for <varname>vacuum_freeze_table_age</>
- minus <varname>vacuum_freeze_min_age</> transactions. Setting it to 0
- forces <command>VACUUM</> to always scan all pages, effectively ignoring
- the visibility map.
+ <command>VACUUM</> does that: a table sweep is forced if
+ the table hasn't been ensured all row versions are frozen for
+ <varname>vacuum_freeze_table_age</> minus <varname>vacuum_freeze_min_age</>
+ transcations.
</para>
<para>
The maximum time that a table can go unvacuumed is two billion
transactions minus the <varname>vacuum_freeze_min_age</> value at
- the time <command>VACUUM</> last scanned the whole table. If it were to go
- unvacuumed for longer than
+ the time <command>VACUUM</> last scanned pages that are not marked as frozen
+ If it were to go unvacuumed for longer than
that, data loss could result. To ensure that this does not happen,
autovacuum is invoked on any table that might contain unfrozen rows with
XIDs older than the age specified by the configuration parameter <xref
@@ -490,8 +489,7 @@
a regularly scheduled <command>VACUUM</> or an autovacuum triggered by
normal delete and update activity is run in that window. Setting it too
close could lead to anti-wraparound autovacuums, even though the table
- was recently vacuumed to reclaim space, whereas lower values lead to more
- frequent whole-table scans.
+ was recently vacuumed to reclaim space.
</para>
<para>
@@ -527,7 +525,7 @@
<structname>pg_database</>. In particular,
the <structfield>relfrozenxid</> column of a table's
<structname>pg_class</> row contains the freeze cutoff XID that was used
- by the last whole-table <command>VACUUM</> for that table. All rows
+ by the last whole-table freezing for that table. All rows
inserted by transactions with XIDs older than this cutoff XID are
guaranteed to have been frozen. Similarly,
the <structfield>datfrozenxid</> column of a database's
@@ -555,17 +553,19 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
<command>VACUUM</> normally
only scans pages that have been modified since the last vacuum, but
<structfield>relfrozenxid</> can only be advanced when the whole table is
- scanned. The whole table is scanned when <structfield>relfrozenxid</> is
- more than <varname>vacuum_freeze_table_age</> transactions old, when
- <command>VACUUM</>'s <literal>FREEZE</> option is used, or when all pages
- happen to
- require vacuuming to remove dead row versions. When <command>VACUUM</>
- scans the whole table, after it's finished <literal>age(relfrozenxid)</>
- should be a little more than the <varname>vacuum_freeze_min_age</> setting
- that was used (more by the number of transactions started since the
- <command>VACUUM</> started). If no whole-table-scanning <command>VACUUM</>
- is issued on the table until <varname>autovacuum_freeze_max_age</> is
- reached, an autovacuum will soon be forced for the table.
+ frozen. The whole table is scanned only when all pages happen to require
+ vacuuming to remove dead row versions. In other cases such as where
+ <structfield>relfrozenxid</> is more than <varname>vacuum_freeze_table_age</>
+ transcations old, where <command>VACUUM</>'s <literal>FREEZE</> option is used,
+ <command>VACUUM</> can skip the pages that all tuples on the page itself are
+ marked as frozen.
+ When all pages of table are eventually marked as frozen by <command>VACUUM</>,
+ after it's finished <literal>age(relfrozenxid)</> should be a little more
+ than the <varname>vacuum_freeze_min_age</> setting that was used (more by
+ the number of transcations started since the <command>VACUUM</> started).
+ If the advancing of <structfield>relfrozenxid</> is not happend until
+ <varname>autovacuum_freeze_max_age</> is reached, an autovacuum will soon
+ be forced for the table.
</para>
<para>
@@ -642,28 +642,28 @@ HINT: Stop the postmaster and vacuum that database in single-user mode.
<structname>pg_class</>.<structfield>relminmxid</> stores the oldest
possible multixact ID still appearing in any tuple of that table.
If this value is older than
- <xref linkend="guc-vacuum-multixact-freeze-table-age">, a whole-table
+ <xref linkend="guc-vacuum-multixact-freeze-table-age">, a table
scan is forced. <function>mxid_age()</> can be used on
<structname>pg_class</>.<structfield>relminmxid</> to find its age.
</para>
<para>
- Whole-table <command>VACUUM</> scans, regardless of
- what causes them, enable advancing the value for that table.
+ When <command>VACUUM</> scans all pages that are not marked as frozen,
+ regardless of what causes them, enable advancing the value for that table.
Eventually, as all tables in all databases are scanned and their
oldest multixact values are advanced, on-disk storage for older
multixacts can be removed.
</para>
<para>
- As a safety device, a whole-table vacuum scan will occur for any table
- whose multixact-age is greater than
- <xref linkend="guc-autovacuum-multixact-freeze-max-age">. Whole-table
+ As a safety device, a vacuum scan will occur for any table whose
+ multixact-age is greater than
+ <xref linkend="guc-autovacuum-multixact-freeze-max-age">. A
vacuum scans will also occur progressively for all tables, starting with
those that have the oldest multixact-age, if the amount of used member
storage space exceeds the amount 50% of the addressable storage space.
- Both of these kinds of whole-table scans will occur even if autovacuum is
- nominally disabled.
+ Both of these kinds of scans will occur even if autovacuum is nominally
+ disabled.
</para>
</sect3>
</sect2>
@@ -743,8 +743,8 @@ vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuple
<command>UPDATE</command> and <command>DELETE</command> operation. (It
is only semi-accurate because some information might be lost under heavy
load.) If the <structfield>relfrozenxid</> value of the table is more
- than <varname>vacuum_freeze_table_age</> transactions old, the whole
- table is scanned to freeze old tuples and advance
+ than <varname>vacuum_freeze_table_age</> transactions old, the table is
+ scanned to freeze old tuples and advance
<structfield>relfrozenxid</>, otherwise only pages that have been modified
since the last vacuum are scanned.
</para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e64b7ef..1908a4d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1332,6 +1332,11 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
<entry>Estimated number of rows modified since this table was last analyzed</entry>
</row>
<row>
+ <entry><structfield>n_frozen_page</></entry>
+ <entry><type>integer</></entry>
+ <entry>Number of frozen pages</entry>
+ </row>
+ <row>
<entry><structfield>last_vacuum</></entry>
<entry><type>timestamp with time zone</></entry>
<entry>Last time at which this table was manually vacuumed
diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index eb113c2..854a900 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -657,6 +657,12 @@ psql --username postgres --file script.sql postgres
</para>
<para>
+ Since the format of visibility map has been changed in version 9.6,
+ <application>pg_upgrade</> does not support upgrading of databases
+ from 9.5 or before to 9.6 or later with link mode (-k).
+ </para>
+
+ <para>
All failure, rebuild, and reindex cases will be reported by
<application>pg_upgrade</> if they affect your installation;
post-upgrade scripts to rebuild tables and indexes will be
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 164d08c..9328cdf 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -623,18 +623,21 @@ can be used to examine the information stored in free space maps.
<para>
Each heap relation has a Visibility Map
(VM) to keep track of which pages contain only tuples that are known to be
-visible to all active transactions. It's stored
-alongside the main relation data in a separate relation fork, named after the
-filenode number of the relation, plus a <literal>_vm</> suffix. For example,
-if the filenode of a relation is 12345, the VM is stored in a file called
-<filename>12345_vm</>, in the same directory as the main relation file.
+visible to all active transactions, and pages contain only tuples that are
+marked as frozen.
+It's stored alongside the main relation data in a separate relation fork,
+named after the filenode number of the relation, plus a <literal>_vm</> suffix.
+For example, if the filenode of a relation is 12345, the VM is stored in a file
+called <filename>12345_vm</>, in the same directory as the main relation file.
Note that indexes do not have VMs.
</para>
<para>
-The visibility map simply stores one bit per heap page. A set bit means
-that all tuples on the page are known to be visible to all transactions.
-This means that the page does not contain any tuples that need to be vacuumed.
+The visibility map stores two bits per heap page: all-visible, all-frozen.
+A set all-visible bit means that all tuples on the page are known to be visible
+to all transactions. A set all-frozen bit means that all tuples on the page are
+completely marked as frozen. This means that the page does not contain any tuples
+that need to be vacuumed, even if a scan of whole table is required.
This information can also be used by <firstterm>index-only scans</> to answer
queries using only the index tuple.
</para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 66deb1f..9cd58be 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2400,8 +2400,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
/*
- * Find buffer to insert this tuple into. If the page is all visible,
- * this will also pin the requisite visibility map page.
+ * Find buffer to insert this tuple into. If the page is all visible
+ * or all frozen, this will also pin the requisite visibility map.
*/
buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
InvalidBuffer, options, bistate,
@@ -3000,9 +3000,9 @@ heap_delete(Relation relation, ItemPointer tid,
/*
* If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, we'll have to unlock and
- * re-lock, to avoid holding the buffer lock across an I/O. That's a bit
- * unfortunate, but hopefully shouldn't happen often.
+ * visible or all frozen while we were busy locking the buffer, we'll
+ * have to unlock and re-lock, to avoid holding the buffer lock across an
+ * I/O. That's a bit unfortunate, but hopefully shouldn't happen often.
*/
if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
{
@@ -7166,7 +7166,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
*/
XLogRecPtr
log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
- TransactionId cutoff_xid)
+ TransactionId cutoff_xid, uint8 vmflags)
{
xl_heap_visible xlrec;
XLogRecPtr recptr;
@@ -7176,6 +7176,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
Assert(BufferIsValid(vm_buffer));
xlrec.cutoff_xid = cutoff_xid;
+ xlrec.flags = vmflags;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
@@ -7765,7 +7766,12 @@ heap_xlog_visible(XLogReaderState *record)
* the subsequent update won't be replayed to clear the flag.
*/
page = BufferGetPage(buffer);
- PageSetAllVisible(page);
+
+ if (xlrec->flags & VISIBILITYMAP_ALL_VISIBLE)
+ PageSetAllVisible(page);
+ if (xlrec->flags & VISIBILITYMAP_ALL_FROZEN)
+ PageSetAllFrozen(page);
+
MarkBufferDirty(buffer);
}
else if (action == BLK_RESTORED)
@@ -7817,7 +7823,7 @@ heap_xlog_visible(XLogReaderState *record)
*/
if (lsn > PageGetLSN(vmpage))
visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
- xlrec->cutoff_xid);
+ xlrec->cutoff_xid, xlrec->flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 7c38772..56ab497 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -21,33 +21,40 @@
*
* NOTES
*
- * The visibility map is a bitmap with one bit per heap page. A set bit means
- * that all tuples on the page are known visible to all transactions, and
- * therefore the page doesn't need to be vacuumed. The map is conservative in
- * the sense that we make sure that whenever a bit is set, we know the
- * condition is true, but if a bit is not set, it might or might not be true.
+ * The visibility map is a bitmap with two bits (all-visible and all-frozen)
+ * per heap page. A set all-visible bit means that all tuples on the page are
+ * known visible to all transactions, and therefore the page doesn't need to
+ * be vacuumed. A set all-frozen bit means that all tuples on the page are
+ * completely frozen, and therefore the page doesn't need to be vacuumed even
+ * if whole table scanning vacuum is required (e.g. anti-wraparound vacuum).
+ * The all-frozen bit must be set only when the page is already all-visible.
+ *
+ * The map is conservative in the sense that we make sure that whenever a bit
+ * is set, we know the condition is true, but if a bit is not set, it might or
+ * might not be true.
*
* Clearing a visibility map bit is not separately WAL-logged. The callers
* must make sure that whenever a bit is cleared, the bit is cleared on WAL
- * replay of the updating operation as well.
+ * replay of the updating operation as well. And all-frozen bit must be
+ * cleared with all-visible at the same time.
*
* When we *set* a visibility map during VACUUM, we must write WAL. This may
* seem counterintuitive, since the bit is basically a hint: if it is clear,
- * it may still be the case that every tuple on the page is visible to all
- * transactions; we just don't know that for certain. The difficulty is that
- * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
- * on the page itself, and the visibility map bit. If a crash occurs after the
- * visibility map page makes it to disk and before the updated heap page makes
- * it to disk, redo must set the bit on the heap page. Otherwise, the next
- * insert, update, or delete on the heap page will fail to realize that the
- * visibility map bit must be cleared, possibly causing index-only scans to
- * return wrong answers.
+ * it may still be the case that every tuple on the page is all-visible or
+ * all-frozen we just don't know that for certain. The difficulty is that
+ * there are two bits which are typically set together: the PD_ALL_VISIBLE
+ * or PD_ALL_FROZEN bit on the page itself, and the corresponding visibility map
+ * bit. If a crash occurs after the visibility map page makes it to disk and before
+ * the updated heap page makes it to disk, redo must set the bit on the heap page.
+ * Otherwise, the next insert, update, or delete on the heap page will fail to
+ * realize that the visibility map bit must be cleared, possibly causing index-only
+ * scans to return wrong answers.
*
* VACUUM will normally skip pages for which the visibility map bit is set;
* such pages can't contain any dead tuples and therefore don't need vacuuming.
- * The visibility map is not used for anti-wraparound vacuums, because
- * an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
- * present in the table, even on pages that don't have any dead tuples.
+ * The visibility map has the all-frozen bit which indicates all tuples on
+ * corresponding page has been completely frozen, so the visibility map is also
+ * used for anti-wraparound vacuum, even if freezing of tuples is required.
*
* LOCKING
*
@@ -58,14 +65,14 @@
* section that logs the page modification. However, we don't want to hold
* the buffer lock over any I/O that may be required to read in the visibility
* map page. To avoid this, we examine the heap page before locking it;
- * if the page-level PD_ALL_VISIBLE bit is set, we pin the visibility map
- * bit. Then, we lock the buffer. But this creates a race condition: there
- * is a possibility that in the time it takes to lock the buffer, the
- * PD_ALL_VISIBLE bit gets set. If that happens, we have to unlock the
- * buffer, pin the visibility map page, and relock the buffer. This shouldn't
- * happen often, because only VACUUM currently sets visibility map bits,
- * and the race will only occur if VACUUM processes a given page at almost
- * exactly the same time that someone tries to further modify it.
+ * if the page-level PD_ALL_VISIBLE or PD_ALL_FROZEN bit is set, we pin the
+ * visibility map bit. Then, we lock the buffer. But this creates a race
+ * condition: there is a possibility that in the time it takes to lock the
+ * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that happens,
+ * we have to unlock the buffer, pin the visibility map page, and relock the
+ * buffer. This shouldn't happen often, because only VACUUM currently sets
+ * visibility map bits, and the race will only occur if VACUUM processes a given
+ * page at almost exactly the same time that someone tries to further modify it.
*
* To set a bit, you need to hold a lock on the heap page. That prevents
* the race condition where VACUUM sees that all tuples on the page are
@@ -101,11 +108,14 @@
*/
#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
-/* Number of bits allocated for each heap block. */
-#define BITS_PER_HEAPBLOCK 1
+/*
+ * Number of bits allocated for each heap block.
+ * One for all-visible, other for all-frozen.
+*/
+#define BITS_PER_HEAPBLOCK 2
/* Number of heap blocks we can represent in one byte. */
-#define HEAPBLOCKS_PER_BYTE 8
+#define HEAPBLOCKS_PER_BYTE 4
/* Number of heap blocks we can represent in one visibility map page. */
#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
@@ -115,24 +125,42 @@
#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
#define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
-/* table for fast counting of set bits */
-static const uint8 number_of_ones[256] = {
- 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
+/* tables for fast counting of set bits for visible and freeze */
+static const uint8 number_of_ones_for_visible[256] = {
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4
+};
+static const uint8 number_of_ones_for_frozen[256] = {
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4
};
/* prototypes for internal routines */
@@ -141,7 +169,7 @@ static void vm_extend(Relation rel, BlockNumber nvmblocks);
/*
- * visibilitymap_clear - clear a bit in visibility map
+ * visibilitymap_clear - clear all bits in visibility map
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -153,11 +181,12 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
int mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- uint8 mask = 1 << mapBit;
+ uint8 mask = (VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN) <<
+ (BITS_PER_HEAPBLOCK * mapBit);
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_clear %s block %d", RelationGetRelationName(rel), heapBlk);
#endif
if (!BufferIsValid(buf) || BufferGetBlockNumber(buf) != mapBlock)
@@ -225,7 +254,7 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
}
/*
- * visibilitymap_set - set a bit on a previously pinned page
+ * visibilitymap_set - set bit(s) on a previously pinned page
*
* recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
* or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
@@ -234,10 +263,11 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
* marked all-visible; it is needed for Hot Standby, and can be
* InvalidTransactionId if the page contains no tuples.
*
- * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
- * this function. Except in recovery, caller should also pass the heap
- * buffer. When checksums are enabled and we're not in recovery, we must add
- * the heap buffer to the WAL chain to protect it from being torn.
+ * Caller is expected to set the heap page's PD_ALL_VISIBLE or PD_ALL_FROZEN
+ * bit before calling this function. Except in recovery, caller should also
+ * pass the heap buffer and flags which indicates what flag we want to set.
+ * When checksums are enabled and we're not in recovery, we must add the heap
+ * buffer to the WAL chain to protect it from being torn.
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -245,7 +275,8 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
*/
void
visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid)
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -254,11 +285,13 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_set %s block %d, flag %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
Assert(InRecovery || BufferIsValid(heapBuf));
+ Assert((flags & VISIBILITYMAP_ALL_VISIBLE) ||
+ (flags & (VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN)));
/* Check that we have the right heap page pinned, if present */
if (BufferIsValid(heapBuf) && BufferGetBlockNumber(heapBuf) != heapBlk)
@@ -272,11 +305,11 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map = PageGetContents(page);
LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
- if (!(map[mapByte] & (1 << mapBit)))
+ if (flags != (map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))))
{
START_CRIT_SECTION();
- map[mapByte] |= (1 << mapBit);
+ map[mapByte] |= (flags << (BITS_PER_HEAPBLOCK * mapBit));
MarkBufferDirty(vmBuf);
if (RelationNeedsWAL(rel))
@@ -285,7 +318,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Assert(!InRecovery);
recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
- cutoff_xid);
+ cutoff_xid, flags);
/*
* If data checksums are enabled (or wal_log_hints=on), we
@@ -295,11 +328,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Page heapPage = BufferGetPage(heapBuf);
- /* caller is expected to set PD_ALL_VISIBLE first */
- Assert(PageIsAllVisible(heapPage));
+ /*
+ * Caller is expected to set PD_ALL_VISIBLE or
+ * PD_ALL_FROZEN first.
+ */
+ Assert(((flags | VISIBILITYMAP_ALL_VISIBLE) && PageIsAllVisible(heapPage)) ||
+ ((flags | VISIBILITYMAP_ALL_FROZEN) && PageIsAllFrozen(heapPage)));
PageSetLSN(heapPage, recptr);
}
}
+
PageSetLSN(page, recptr);
}
@@ -310,15 +348,17 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
}
/*
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_test - test if bit(s) is set
*
- * Are all tuples on heapBlk visible to all, according to the visibility map?
+ * Are all tuples on heapBlk visible to all or are marked frozen, according
+ * to the visibility map?
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by an
* earlier call to visibilitymap_pin or visibilitymap_test on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk, or InvalidBuffer. The caller is responsible for
- * releasing *buf after it's done testing and setting bits.
+ * releasing *buf after it's done testing and setting bits, and must pass flags
+ * for which it needs to check the value in visibility map.
*
* NOTE: This function is typically called without a lock on the heap page,
* so somebody else could change the bit just after we look at it. In fact,
@@ -328,7 +368,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
* all concurrency issues!
*/
bool
-visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
+visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf, uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -337,7 +377,7 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_test %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_test %s block %d, flag %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
/* Reuse the old pinned buffer if possible */
@@ -360,11 +400,11 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
map = PageGetContents(BufferGetPage(*buf));
/*
- * A single-bit read is atomic. There could be memory-ordering effects
+ * A single or double bit read is atomic. There could be memory-ordering effects
* here, but for performance reasons we make it the caller's job to worry
* about that.
*/
- result = (map[mapByte] & (1 << mapBit)) ? true : false;
+ result = (map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))) ? true : false;
return result;
}
@@ -374,14 +414,16 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
*
* Note: we ignore the possibility of race conditions when the table is being
* extended concurrently with the call. New pages added to the table aren't
- * going to be marked all-visible, so they won't affect the result.
+ * going to be marked all-visible or all-frozen, so they won't affect the result.
+ * The caller must set the flags which indicates what flag we want to count.
*/
-BlockNumber
-visibilitymap_count(Relation rel)
+void
+visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen)
{
- BlockNumber result = 0;
BlockNumber mapBlock;
+ *all_visible = *all_frozen = 0;
+
for (mapBlock = 0;; mapBlock++)
{
Buffer mapBuffer;
@@ -406,13 +448,12 @@ visibilitymap_count(Relation rel)
for (i = 0; i < MAPSIZE; i++)
{
- result += number_of_ones[map[i]];
+ *all_visible += number_of_ones_for_visible[map[i]];
+ *all_frozen += number_of_ones_for_frozen[map[i]];
}
ReleaseBuffer(mapBuffer);
}
-
- return result;
}
/*
@@ -435,7 +476,7 @@ visibilitymap_truncate(Relation rel, BlockNumber nheapblocks)
uint8 truncBit = HEAPBLK_TO_MAPBIT(nheapblocks);
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_truncate %s %d", RelationGetRelationName(rel), nheapblocks);
+ elog(DEBUG1, "vm_truncate %s block %d", RelationGetRelationName(rel), nheapblocks);
#endif
RelationOpenSmgr(rel);
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e59b163..5f1733e 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1813,7 +1813,7 @@ FormIndexDatum(IndexInfo *indexInfo,
* isprimary: if true, set relhaspkey true; else no change
* reltuples: if >= 0, set reltuples to this value; else no change
*
- * If reltuples >= 0, relpages and relallvisible are also updated (using
+ * If reltuples >= 0, relpages, relallvisible are also updated (using
* RelationGetNumberOfBlocks() and visibilitymap_count()).
*
* NOTE: an important side-effect of this operation is that an SI invalidation
@@ -1919,9 +1919,10 @@ index_update_stats(Relation rel,
{
BlockNumber relpages = RelationGetNumberOfBlocks(rel);
BlockNumber relallvisible;
+ BlockNumber relallfrozen; /* not used */
if (rd_rel->relkind != RELKIND_INDEX)
- relallvisible = visibilitymap_count(rel);
+ visibilitymap_count(rel, &relallvisible, &relallfrozen);
else /* don't bother for indexes */
relallvisible = 0;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index ccc030f..8c555eb 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -444,6 +444,7 @@ CREATE VIEW pg_stat_all_tables AS
pg_stat_get_live_tuples(C.oid) AS n_live_tup,
pg_stat_get_dead_tuples(C.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(C.oid) AS n_mod_since_analyze,
+ pg_stat_get_frozen_pages(C.oid) AS n_frozen_page,
pg_stat_get_last_vacuum_time(C.oid) as last_vacuum,
pg_stat_get_last_autovacuum_time(C.oid) as last_autovacuum,
pg_stat_get_last_analyze_time(C.oid) as last_analyze,
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index ddb68ab..bdbd7db 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -314,6 +314,8 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
+ BlockNumber relallvisible,
+ relallfrozen;
if (inh)
ereport(elevel,
@@ -564,6 +566,10 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
}
}
+ /* Calculate the number of all-visible and all-frozen bit */
+ if (!inh)
+ visibilitymap_count(onerel, &relallvisible, &relallfrozen);
+
/*
* Update pages/tuples stats in pg_class ... but not if we're doing
* inherited stats.
@@ -572,7 +578,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
vac_update_relstats(onerel,
relpages,
totalrows,
- visibilitymap_count(onerel),
+ relallvisible,
hasindex,
InvalidTransactionId,
InvalidMultiXactId,
@@ -608,7 +614,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
* tracks per-table stats.
*/
if (!inh)
- pgstat_report_analyze(onerel, totalrows, totaldeadrows);
+ pgstat_report_analyze(onerel, totalrows, totaldeadrows, relallfrozen);
/* If this isn't part of VACUUM ANALYZE, let index AMs do cleanup */
if (!(options & VACOPT_VACUUM))
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7c4ef58..0a02a25 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -729,11 +729,11 @@ vac_estimate_reltuples(Relation relation, bool is_analyze,
* marked with xmin = our xid.
*
* In addition to fundamentally nontransactional statistics such as
- * relpages and relallvisible, we try to maintain certain lazily-updated
- * DDL flags such as relhasindex, by clearing them if no longer correct.
- * It's safe to do this in VACUUM, which can't run in parallel with
- * CREATE INDEX/RULE/TRIGGER and can't be part of a transaction block.
- * However, it's *not* safe to do it in an ANALYZE that's within an
+ * relpages, relallvisible, we try to maintain certain lazily-updated
+ * DDL flags such as relhasindex, by clearing them if no onger correct.
+ * It's safe to do this in VACUUM, which can't run in
+ * parallel with CREATE INDEX/RULE/TRIGGER and can't be part of a transaction
+ * block. However, it's *not* safe to do it in an ANALYZE that's within an
* outer transaction, because for example the current transaction might
* have dropped the last index; then we'd think relhasindex should be
* cleared, but if the transaction later rolls back this would be wrong.
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 2429889..0fa31a3 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -106,6 +106,8 @@ typedef struct LVRelStats
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* number of pages we examined */
BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber vmskipped_frozen_pages; /* # of pages we skipped by all-frozen bit
+ of visibility map */
double scanned_tuples; /* counts only tuples on scanned pages */
double old_rel_tuples; /* previous value of pg_class.reltuples */
double new_rel_tuples; /* new estimated total # of tuples */
@@ -156,7 +158,7 @@ static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
static int vac_cmp_itemptr(const void *left, const void *right);
static bool heap_page_is_all_visible(Relation rel, Buffer buf,
- TransactionId *visibility_cutoff_xid);
+ TransactionId *visibility_cutoff_xid, bool *all_frozen);
/*
@@ -188,7 +190,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
MultiXactId mxactFullScanLimit;
BlockNumber new_rel_pages;
double new_rel_tuples;
- BlockNumber new_rel_allvisible;
+ BlockNumber new_rel_allvisible,
+ new_rel_allfrozen;
double new_live_tuples;
TransactionId new_frozen_xid;
MultiXactId new_min_multi;
@@ -222,6 +225,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* than or equal to the requested Xid full-table scan limit; or if the
* table's minimum MultiXactId is older than or equal to the requested
* mxid full-table scan limit.
+ * Even if scan_all is set so far, we could skip to scan some pages
+ * according by all-frozen bit of visibility amp.
*/
scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
xidFullScanLimit);
@@ -253,7 +258,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
- if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages + vacrelstats->vmskipped_frozen_pages)
+ < vacrelstats->rel_pages)
{
Assert(!scan_all);
scanned_all = false;
@@ -287,8 +293,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* tuple density") unless there's some actual evidence for the latter.
*
* We do update relallvisible even in the corner case, since if the table
- * is all-visible we'd definitely like to know that. But clamp the value
- * to be not more than what we're setting relpages to.
+ * is all-visible we'd definitely like to know that.
+ * But clamp the value to be not more than what we're setting relpages to.
*
* Also, don't change relfrozenxid/relminmxid if we skipped any pages,
* since then we don't know for certain that all tuples have a newer xmin.
@@ -301,10 +307,13 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_tuples = vacrelstats->old_rel_tuples;
}
- new_rel_allvisible = visibilitymap_count(onerel);
+ visibilitymap_count(onerel, &new_rel_allvisible, &new_rel_allfrozen);
if (new_rel_allvisible > new_rel_pages)
new_rel_allvisible = new_rel_pages;
+ if (new_rel_allfrozen > new_rel_pages)
+ new_rel_allfrozen = new_rel_pages;
+
new_frozen_xid = scanned_all ? FreezeLimit : InvalidTransactionId;
new_min_multi = scanned_all ? MultiXactCutoff : InvalidMultiXactId;
@@ -325,7 +334,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
pgstat_report_vacuum(RelationGetRelid(onerel),
onerel->rd_rel->relisshared,
new_live_tuples,
- vacrelstats->new_dead_tuples);
+ vacrelstats->new_dead_tuples,
+ new_rel_allfrozen);
/* and log the action if appropriate */
if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
@@ -360,10 +370,11 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel),
vacrelstats->num_index_scans);
- appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins\n"),
+ appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped according to vm\n"),
vacrelstats->pages_removed,
vacrelstats->rel_pages,
- vacrelstats->pinskipped_pages);
+ vacrelstats->pinskipped_pages,
+ vacrelstats->vmskipped_frozen_pages);
appendStringInfo(&buf,
_("tuples: %.0f removed, %.0f remain, %.0f are dead but not yet removable\n"),
vacrelstats->tuples_deleted,
@@ -486,9 +497,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* consecutive pages. Since we're reading sequentially, the OS should be
* doing readahead for us, so there's no gain in skipping a page now and
* then; that's likely to disable readahead and so be counterproductive.
- * Also, skipping even a single page means that we can't update
- * relfrozenxid, so we only want to do it if we can skip a goodly number
- * of pages.
+ * Also, skipping even a single page accorinding to all-visible bit of
+ * visibility map means that we can't update relfrozenxid, so we only want
+ * to do it if we can skip a goodly number. On the other hand, we count
+ * both how many pages we skipped according to all-frozen bit of visibility
+ * map and how many pages we freeze, so we can update relfrozenxid if
+ * the sum of two is as many as pages of table.
*
* Before entering the main loop, establish the invariant that
* next_not_all_visible_block is the next block number >= blkno that's not
@@ -498,24 +512,25 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* started skipping blocks, we may as well skip everything up to the next
* not-all-visible block.
*
- * Note: if scan_all is true, we won't actually skip any pages; but we
+ * Note: if scan_all is true, we might not actually skip any pages; but we
* maintain next_not_all_visible_block anyway, so as to set up the
* all_visible_according_to_vm flag correctly for each page.
*
* Note: The value returned by visibilitymap_test could be slightly
* out-of-date, since we make this test before reading the corresponding
* heap page or locking the buffer. This is OK. If we mistakenly think
- * that the page is all-visible when in fact the flag's just been cleared,
- * we might fail to vacuum the page. But it's OK to skip pages when
- * scan_all is not set, so no great harm done; the next vacuum will find
- * them. If we make the reverse mistake and vacuum a page unnecessarily,
- * it'll just be a no-op.
+ * that the page is all-visible/all-frozen when in fact the flag's just
+ * been cleared, we might fail to vacuum the page. But it's OK to skip
+ * pages when scan_all is not set, so no great harm done; the next vacuum
+ * will find them. If we make the reverse mistake and vacuum a page
+ * unnecessarily, it'll just be a no-op.
*/
for (next_not_all_visible_block = 0;
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer))
+ if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer,
+ VISIBILITYMAP_ALL_VISIBLE))
break;
vacuum_delay_point();
}
@@ -533,9 +548,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
bool tupgone,
hastup;
int prev_dead_count;
- int nfrozen;
+ int nfrozen; /* # of tuples is frozen */
+ int nalready_frozen; /* # of tuples is already frozen */
+ int ntotal_frozen; /* # of frozen tuples in single page */
+ int ntup_per_page;
Size freespace;
bool all_visible_according_to_vm;
+ bool all_frozen_according_to_vm;
bool all_visible;
bool has_dead_tuples;
TransactionId visibility_cutoff_xid = InvalidTransactionId;
@@ -548,7 +567,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block++)
{
if (!visibilitymap_test(onerel, next_not_all_visible_block,
- &vmbuffer))
+ &vmbuffer,
+ VISIBILITYMAP_ALL_VISIBLE))
break;
vacuum_delay_point();
}
@@ -563,13 +583,30 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
skipping_all_visible_blocks = false;
all_visible_according_to_vm = false;
+ all_frozen_according_to_vm = false;
}
else
{
- /* Current block is all-visible */
- if (skipping_all_visible_blocks && !scan_all)
- continue;
+ /*
+ * This block is at least all-visible according to visibility map.
+ * We check whether this block is all-frozen or not, to skip to
+ * vacuum this page even if scan_all is true.
+ */
+ bool all_frozen = visibilitymap_test(onerel, blkno, &vmbuffer,
+ VISIBILITYMAP_ALL_FROZEN);
+ if (scan_all)
+ {
+ if (all_frozen)
+ {
+ vacrelstats->vmskipped_frozen_pages++;
+ continue;
+ }
+ }
+ else if (skipping_all_visible_blocks)
+ continue;
+
all_visible_according_to_vm = true;
+ all_frozen_according_to_vm = all_frozen;
}
vacuum_delay_point();
@@ -740,7 +777,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
PageSetAllVisible(page);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, InvalidTransactionId);
+ vmbuffer, InvalidTransactionId,
+ VISIBILITYMAP_ALL_VISIBLE);
END_CRIT_SECTION();
}
@@ -764,6 +802,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
all_visible = true;
has_dead_tuples = false;
nfrozen = 0;
+ nalready_frozen = 0;
+ ntup_per_page = 0;
hastup = false;
prev_dead_count = vacrelstats->num_dead_tuples;
maxoff = PageGetMaxOffsetNumber(page);
@@ -918,8 +958,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
{
num_tuples += 1;
+ ntup_per_page += 1;;
hastup = true;
+ /* Check whether this tuple is alrady frozen or not */
+ if (HeapTupleHeaderXminFrozen(tuple.t_data))
+ nalready_frozen += 1;
+
/*
* Each non-removable tuple must be checked to see if it needs
* freezing. Note we already have exclusive buffer lock.
@@ -931,9 +976,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
} /* scan along page */
/*
- * If we froze any tuples, mark the buffer dirty, and write a WAL
- * record recording the changes. We must log the changes to be
- * crash-safe against future truncation of CLOG.
+ * If we froze any tuples then we mark the buffer dirty, and write a WAL
+ * record recording the changes. We must log the changes to be crash-safe
+ * against future truncation of CLOG.
*/
if (nfrozen > 0)
{
@@ -966,6 +1011,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
END_CRIT_SECTION();
}
+ /* Compute the number of frozen tuples in a page */
+ ntotal_frozen = nfrozen + nalready_frozen;
+
/*
* If there are no indexes then we can vacuum the page right now
* instead of doing a second scan.
@@ -988,26 +1036,46 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
freespace = PageGetHeapFreeSpace(page);
- /* mark page all-visible, if appropriate */
- if (all_visible && !all_visible_according_to_vm)
+ /* This page is all visible */
+ if (all_visible)
{
- /*
- * It should never be the case that the visibility map page is set
- * while the page-level bit is clear, but the reverse is allowed
- * (if checksums are not enabled). Regardless, set the both bits
- * so that we get back in sync.
- *
- * NB: If the heap page is all-visible but the VM bit is not set,
- * we don't need to dirty the heap page. However, if checksums
- * are enabled, we do need to make sure that the heap page is
- * dirtied before passing it to visibilitymap_set(), because it
- * may be logged. Given that this situation should only happen in
- * rare cases after a crash, it is not worth optimizing.
- */
- PageSetAllVisible(page);
- MarkBufferDirty(buf);
- visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ /* mark page all-visible, if appropriate */
+ if (!all_visible_according_to_vm)
+ {
+ /*
+ * It should never be the case that the visibility map page is set
+ * while the page-level bit is clear, but the reverse is allowed
+ * (if checksums are not enabled). Regardless, set the both bits
+ * so that we get back in sync.
+ *
+ * NB: If the heap page is all-visible but the VM bit is not set,
+ * we don't need to dirty the heap page. However, if checksums
+ * are enabled, we do need to make sure that the heap page is
+ * dirtied before passing it to visibilitymap_set(), because it
+ * may be logged. Given that this situation should only happen in
+ * rare cases after a crash, it is not worth optimizing.
+ */
+ PageSetAllVisible(page);
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+ }
+
+ /* mark page all-frozen, if all tuples are frozen and not marked yet */
+ if ((ntotal_frozen == ntup_per_page) && !all_frozen_according_to_vm)
+ {
+ Assert(PageIsAllVisible(page));
+
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ {
+ MarkBufferDirty(buf);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+ vmbuffer, visibility_cutoff_xid, flags);
+ }
}
/*
@@ -1018,7 +1086,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* that something bad has happened.
*/
else if (all_visible_according_to_vm && !PageIsAllVisible(page)
- && visibilitymap_test(onerel, blkno, &vmbuffer))
+ && visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
relname, blkno);
@@ -1078,7 +1146,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
num_tuples);
/*
- * Release any remaining pin on visibility map page.
+ * Release any remaining pin on visibility map.
*/
if (BufferIsValid(vmbuffer))
{
@@ -1114,6 +1182,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
RelationGetRelationName(onerel),
tups_vacuumed, vacuumed_pages)));
+ /* Report how many frozen pages vacuum skipped according to visibility map */
+ ereport(elevel,
+ (errmsg_plural("skipped %d frozen page according to visibility map",
+ "skipped %d frozen pages according to visibility map",
+ vacrelstats->vmskipped_frozen_pages,
+ vacrelstats->vmskipped_frozen_pages)));
+
/*
* This is pretty messy, but we split it up so that we can skip emitting
* individual parts of the message when not applicable.
@@ -1230,6 +1305,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
OffsetNumber unused[MaxOffsetNumber];
int uncnt = 0;
TransactionId visibility_cutoff_xid;
+ bool all_frozen;
START_CRIT_SECTION();
@@ -1281,19 +1357,34 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
* dirty, exclusively locked, and, if needed, a full page image has been
* emitted in the log_heap_clean() above.
*/
- if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid))
+ if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid, &all_frozen))
PageSetAllVisible(page);
/*
* All the changes to the heap page have been done. If the all-visible
- * flag is now set, also set the VM bit.
+ * flag is now set, also set the VM all-visible bit.
+ * Also, if this page is all-frozen, set the VM all-frozen bit and flag.
*/
- if (PageIsAllVisible(page) &&
- !visibilitymap_test(onerel, blkno, vmbuffer))
+ if (PageIsAllVisible(page))
{
+ uint8 flags = 0;
+
+ if (!visibilitymap_test(onerel, blkno, vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+
+ /* Set the VM all-frozen bit to flag, if needed */
+ if (all_frozen &&
+ !visibilitymap_test(onerel, blkno, vmbuffer, VISIBILITYMAP_ALL_FROZEN))
+ {
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
Assert(BufferIsValid(*vmbuffer));
- visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
- visibility_cutoff_xid);
+
+ if (flags)
+ visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
+ visibility_cutoff_xid, flags);
}
return tupindex;
@@ -1783,10 +1874,12 @@ vac_cmp_itemptr(const void *left, const void *right)
/*
* Check if every tuple in the given page is visible to all current and future
* transactions. Also return the visibility_cutoff_xid which is the highest
- * xmin amongst the visible tuples.
+ * xmin amongst the visible tuples, and all_frozen which implies that all tuples
+ * of this page are frozen.
*/
static bool
-heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid)
+heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid,
+ bool *all_frozen)
{
Page page = BufferGetPage(buf);
BlockNumber blockno = BufferGetBlockNumber(buf);
@@ -1795,6 +1888,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
bool all_visible = true;
*visibility_cutoff_xid = InvalidTransactionId;
+ *all_frozen = true;
/*
* This is a stripped down version of the line pointer scan in
@@ -1818,11 +1912,12 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/*
* Dead line pointers can have index pointers pointing to them. So
- * they can't be treated as visible
+ * they can't be treated as visible and frozen.
*/
if (ItemIdIsDead(itemid))
{
all_visible = false;
+ *all_frozen = false;
break;
}
@@ -1859,6 +1954,10 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
+
+ /* Check whether this tuple is alrady frozen or not */
+ if (!HeapTupleHeaderXminFrozen(tuple.t_data))
+ *all_frozen = false;
}
break;
@@ -1867,6 +1966,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
case HEAPTUPLE_INSERT_IN_PROGRESS:
case HEAPTUPLE_DELETE_IN_PROGRESS:
all_visible = false;
+ *all_frozen = false;
break;
default:
@@ -1875,5 +1975,8 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
}
} /* scan along page */
+ if (!all_visible)
+ *all_frozen = false;
+
return all_visible;
}
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 9f54c46..08df289 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -116,7 +116,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
*/
if (!visibilitymap_test(scandesc->heapRelation,
ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
+ &node->ioss_VMBuffer, VISIBILITYMAP_ALL_VISIBLE))
{
/*
* Rats, we have to visit the heap to check visibility.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index ab018c4..05a17e1 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1329,7 +1329,8 @@ pgstat_report_autovac(Oid dboid)
*/
void
pgstat_report_vacuum(Oid tableoid, bool shared,
- PgStat_Counter livetuples, PgStat_Counter deadtuples)
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages)
{
PgStat_MsgVacuum msg;
@@ -1343,6 +1344,7 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
msg.m_vacuumtime = GetCurrentTimestamp();
msg.m_live_tuples = livetuples;
msg.m_dead_tuples = deadtuples;
+ msg.m_frozen_pages = frozenpages;
pgstat_send(&msg, sizeof(msg));
}
@@ -1354,7 +1356,8 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
*/
void
pgstat_report_analyze(Relation rel,
- PgStat_Counter livetuples, PgStat_Counter deadtuples)
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages)
{
PgStat_MsgAnalyze msg;
@@ -1394,6 +1397,7 @@ pgstat_report_analyze(Relation rel,
msg.m_analyzetime = GetCurrentTimestamp();
msg.m_live_tuples = livetuples;
msg.m_dead_tuples = deadtuples;
+ msg.m_frozen_pages = frozenpages;
pgstat_send(&msg, sizeof(msg));
}
@@ -5069,6 +5073,7 @@ pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
tabentry->n_live_tuples = msg->m_live_tuples;
tabentry->n_dead_tuples = msg->m_dead_tuples;
+ tabentry->n_frozen_pages = msg->m_frozen_pages;
if (msg->m_autovacuum)
{
@@ -5103,6 +5108,7 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
tabentry->n_live_tuples = msg->m_live_tuples;
tabentry->n_dead_tuples = msg->m_dead_tuples;
+ tabentry->n_frozen_pages = msg->m_frozen_pages;
/*
* We reset changes_since_analyze to zero, forgetting any changes that
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index f7c9bf6..98c14f9 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -46,6 +46,7 @@ extern Datum pg_stat_get_vacuum_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_autovacuum_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_analyze_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_autoanalyze_count(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_frozen_pages(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_function_calls(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_function_total_time(PG_FUNCTION_ARGS);
@@ -450,6 +451,21 @@ pg_stat_get_autoanalyze_count(PG_FUNCTION_ARGS)
}
Datum
+pg_stat_get_frozen_pages(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int32 result;
+ PgStat_StatTabEntry *tabentry;
+
+ if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+ result = 0;
+ else
+ result = (int32) (tabentry->n_frozen_pages);
+
+ PG_RETURN_INT32(result);
+}
+
+Datum
pg_stat_get_function_calls(PG_FUNCTION_ARGS)
{
Oid funcid = PG_GETARG_OID(0);
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 41d4606..3a666f8 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -231,6 +231,15 @@ check_cluster_versions(void)
if (old_cluster.major_version > new_cluster.major_version)
pg_fatal("This utility cannot be used to downgrade to older major PostgreSQL versions.\n");
+ /*
+ * We can't allow upgrading with link mode between 9.5 or before and 9.6 or later,
+ * because the format of visibility map has been changed in version 9.6.
+ */
+ if (user_opts.transfer_mode == TRANSFER_MODE_LINK &&
+ GET_MAJOR_VERSION(old_cluster.major_version) < 906 &&
+ GET_MAJOR_VERSION(new_cluster.major_version) >= 906)
+ pg_fatal("This utility cannot upgrade from PostgreSQL version from 9.5 or before to 9.6 or later with link mode.\n");
+
/* get old and new binary versions */
get_bin_version(&old_cluster);
get_bin_version(&new_cluster);
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 79d9390..d47a98b 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -10,6 +10,7 @@
#include "postgres_fe.h"
#include "pg_upgrade.h"
+#include "storage/bufpage.h"
#include <fcntl.h>
@@ -21,6 +22,43 @@ static int copy_file(const char *fromfile, const char *tofile, bool force);
static int win32_pghardlink(const char *src, const char *dst);
#endif
+static const char *rewriteVisibilitymap(const char *fromfile, const char *tofile,
+ bool force);
+
+/* table for fast rewriting vm file in order to add all-frozen information */
+static const uint16 rewrite_vm_table[256] = {
+ 0, 1, 4, 5, 16, 17, 20, 21, 64, 65, 68, 69, 80, 81, 84, 85,
+ 256, 257, 260, 261, 272, 273, 276, 277, 320, 321, 324, 325, 336, 337, 340, 341,
+ 1024, 1025, 1028, 1029, 1040, 1041, 1044, 1045, 1088, 1089, 1092, 1093, 1104, 1105, 1108, 1109,
+ 1280, 1281, 1284, 1285, 1296, 1297, 1300, 1301, 1344, 1345, 1348, 1349, 1360, 1361, 1364, 1365,
+ 4096, 4097, 4100, 4101, 4112, 4113, 4116, 4117, 4160, 4161, 4164, 4165, 4176, 4177, 4180, 4181,
+ 4352, 4353, 4356, 4357, 4368, 4369, 4372, 4373, 4416, 4417, 4420, 4421, 4432, 4433, 4436, 4437,
+ 5120, 5121, 5124, 5125, 5136, 5137, 5140, 5141, 5184, 5185, 5188, 5189, 5200, 5201, 5204, 5205,
+ 5376, 5377, 5380, 5381, 5392, 5393, 5396, 5397, 5440, 5441, 5444, 5445, 5456, 5457, 5460, 5461,
+ 16384, 16385, 16388, 16389, 16400, 16401, 16404, 16405, 16448, 16449, 16452, 16453, 16464, 16465, 16468, 16469,
+ 16640, 16641, 16644, 16645, 16656, 16657, 16660, 16661, 16704, 16705, 16708, 16709, 16720, 16721, 16724, 16725,
+ 17408, 17409, 17412, 17413, 17424, 17425, 17428, 17429, 17472, 17473, 17476, 17477, 17488, 17489, 17492, 17493,
+ 17664, 17665, 17668, 17669, 17680, 17681, 17684, 17685, 17728, 17729, 17732, 17733, 17744, 17745, 17748, 17749,
+ 20480, 20481, 20484, 20485, 20496, 20497, 20500, 20501, 20544, 20545, 20548, 20549, 20560, 20561, 20564, 20565,
+ 20736, 20737, 20740, 20741, 20752, 20753, 20756, 20757, 20800, 20801, 20804, 20805, 20816, 20817, 20820, 20821,
+ 21504, 21505, 21508, 21509, 21520, 21521, 21524, 21525, 21568, 21569, 21572, 21573, 21584, 21585, 21588, 21589,
+ 21760, 21761, 21764, 21765, 21776, 21777, 21780, 21781, 21824, 21825, 21828, 21829, 21840, 21841, 21844, 21845
+};
+
+/*
+ * copyOrRewriteFile()
+ * This function copies file or rewrite visibility map file.
+ * If rewrite_vm is true, we have to rewrite visibility map regardless value of pageConverter.
+ */
+const char *
+copyOrRewriteFile(pageCnvCtx *pageConverter,
+ const char *src, const char *dst, bool force, bool rewrite_vm)
+{
+ if (rewrite_vm)
+ return rewriteVisibilitymap(src, dst, force);
+ else
+ return copyAndUpdateFile(pageConverter, src, dst, force);
+}
/*
* copyAndUpdateFile()
@@ -201,6 +239,97 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
#endif
+/*
+ * rewriteVisibilitymap()
+ *
+ * A additional bit which indicates that all tuples on page is completely
+ * frozen is added into visibility map at PG 9.6. So the format of visibiilty
+ * map has been changed.
+ * Copies a visibility map file while adding all-frozen bit(0) into each bit.
+ */
+static const char *
+rewriteVisibilitymap(const char *fromfile, const char *tofile, bool force)
+{
+#define REWRITE_BUF_SIZE (50 * BLCKSZ)
+#define BITS_PER_HEAPBLOCK 2
+
+ int src_fd = 0;
+ int dst_fd = 0;
+ uint16 vm_bits;
+ ssize_t nbytes;
+ char *buffer = NULL;
+ int ret = 0;
+
+ if ((fromfile == NULL) || (tofile == NULL))
+ return getErrorText(EINVAL);
+
+ if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
+ goto err;
+
+ if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+ goto err;
+
+ buffer = (char *) pg_malloc(REWRITE_BUF_SIZE);
+
+ /* Copy page header data in advance */
+ if ((nbytes = read(src_fd, buffer, MAXALIGN(SizeOfPageHeaderData))) <= 0)
+ goto err;
+
+ if (write(dst_fd, buffer, nbytes) != nbytes)
+ {
+ /* if write didn't set errno, assume problem is no disk space */
+ if (errno == 0)
+ errno = ENOSPC;
+ goto err;
+ }
+
+ /* perform data rewriting i.e read src srouce, write to destination */
+ while (true)
+ {
+ ssize_t nbytes = read(src_fd, buffer, REWRITE_BUF_SIZE);
+ char *cur, *end;
+
+ if (nbytes < 0)
+ {
+ ret = -1;
+ break;
+ }
+
+ if (nbytes == 0)
+ break;
+
+ cur = buffer;
+ end = buffer + nbytes;
+
+ /* Rewrite a byte and write dest_fd per BITS_PER_HEAPBLOCK bytes */
+ while (end > cur)
+ {
+ /* Get rewritten bit from table and its string representation */
+ vm_bits = rewrite_vm_table[(uint8) *cur];
+
+ if (write(dst_fd, &vm_bits, BITS_PER_HEAPBLOCK) != BITS_PER_HEAPBLOCK)
+ {
+ ret = -1;
+ break;
+ }
+ cur++;
+ }
+ }
+
+err:
+
+ if (!buffer)
+ pg_free(buffer);
+
+ if (src_fd != 0)
+ close(src_fd);
+
+ if (dst_fd != 0)
+ close(dst_fd);
+
+ return (errno == 0) ? NULL : getErrorText(errno);
+}
+
void
check_hard_link(void)
{
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 13aa891..d04d836 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -112,6 +112,10 @@ extern char *output_files[];
#define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031
/*
+ * The format of visibility map changed with this 9.6 commit,
+ */
+#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201510191
+/*
* pg_multixact format changed in 9.3 commit 0ac5ad5134f2769ccbaefec73844f85,
* ("Improve concurrency of foreign key locking") which also updated catalog
* version to this value. pg_upgrade behavior depends on whether old and new
@@ -396,6 +400,8 @@ const pageCnvCtx *setupPageConverter(void);
typedef void *pageCnvCtx;
#endif
+const char *copyOrRewriteFile(pageCnvCtx *pageConverter, const char *src,
+ const char *dst, bool force, bool rewrite_vm);
const char *copyAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
const char *dst, bool force);
const char *linkAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index c22df42..5d07fff 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -18,7 +18,7 @@
static void transfer_single_new_db(pageCnvCtx *pageConverter,
FileNameMap *maps, int size, char *old_tablespace);
static void transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *suffix);
+ const char *suffix, bool vm_need_rewrite);
/*
@@ -171,6 +171,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
{
int mapnum;
bool vm_crashsafe_match = true;
+ bool vm_need_rewrite = false;
/*
* Do the old and new cluster disagree on the crash-safetiness of the vm
@@ -180,13 +181,20 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
vm_crashsafe_match = false;
+ /*
+ * Do we need to rewrite visibilitymap?
+ */
+ if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ vm_need_rewrite = true;
+
for (mapnum = 0; mapnum < size; mapnum++)
{
if (old_tablespace == NULL ||
strcmp(maps[mapnum].old_tablespace, old_tablespace) == 0)
{
/* transfer primary file */
- transfer_relfile(pageConverter, &maps[mapnum], "");
+ transfer_relfile(pageConverter, &maps[mapnum], "", false);
/* fsm/vm files added in PG 8.4 */
if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)
@@ -194,9 +202,9 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
/*
* Copy/link any fsm and vm files, if they exist
*/
- transfer_relfile(pageConverter, &maps[mapnum], "_fsm");
- if (vm_crashsafe_match)
- transfer_relfile(pageConverter, &maps[mapnum], "_vm");
+ transfer_relfile(pageConverter, &maps[mapnum], "_fsm", false);
+ if (vm_crashsafe_match || vm_need_rewrite)
+ transfer_relfile(pageConverter, &maps[mapnum], "_vm", vm_need_rewrite);
}
}
}
@@ -210,7 +218,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
*/
static void
transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *type_suffix)
+ const char *type_suffix, bool vm_need_rewrite)
{
const char *msg;
char old_file[MAXPGPATH];
@@ -218,6 +226,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
int fd;
int segno;
char extent_suffix[65];
+ bool rewrite_vm = false;
/*
* Now copy/link any related segments as well. Remember, PG breaks large
@@ -276,7 +285,13 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
+ /*
+ * Do we need to rewrite visibilitymap?
+ */
+ if (strcmp(type_suffix, "_vm") == 0 && vm_need_rewrite)
+ rewrite_vm = true;
+
+ if ((msg = copyOrRewriteFile(pageConverter, old_file, new_file, true, rewrite_vm)) != NULL)
pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
diff --git a/src/bin/pg_upgrade/test.sh b/src/bin/pg_upgrade/test.sh
index aa7f399..6b058d4 100644
--- a/src/bin/pg_upgrade/test.sh
+++ b/src/bin/pg_upgrade/test.sh
@@ -174,6 +174,11 @@ if "$MAKE" -C "$oldsrc" installcheck; then
mv "$temp_root"/dump1.sql "$temp_root"/dump1.sql.orig
sed "s;$oldsrc;$newsrc;g" "$temp_root"/dump1.sql.orig >"$temp_root"/dump1.sql
fi
+
+ vm_sql="SELECT c.relname, c.relallvisible FROM pg_class as c, pg_namespace as n WHERE c.relnamespace = n.oid AND n.nspname NOT IN ('information_schema', 'pg_toast', 'pg_catalog') ORDER BY c.relname;"
+ # Test for rewriting visibility map
+ vacuumdb -d regression || visibilitymap_vacuum1_status=$?
+ psql -d regression -c "$vm_sql" > "$temp_root"/vm_test1.txt || visibilitymap_test1_status=$?
else
make_installcheck_status=$?
fi
@@ -188,6 +193,14 @@ if [ -n "$pg_dumpall1_status" ]; then
echo "pg_dumpall of pre-upgrade database cluster failed"
exit 1
fi
+if [ -n "$visibilitymap_vacuum1_status" ];then
+ echo "VACUUM of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+if [ -n "$visibilitymap_test1_status" ];then
+ echo "SELECT of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
PGDATA=$BASE_PGDATA
@@ -203,6 +216,8 @@ case $testhost in
esac
pg_dumpall -f "$temp_root"/dump2.sql || pg_dumpall2_status=$?
+vacuumdb -d regression || visibilitymap_vacuum2_status=$?
+psql -d regression -c "$vm_sql" > "$temp_root"/vm_test2.txt || visibilitymap_test2_status=$?
pg_ctl -m fast stop
# no need to echo commands anymore
@@ -214,11 +229,26 @@ if [ -n "$pg_dumpall2_status" ]; then
exit 1
fi
+if [ -n "$visibilitymap_vacuum2_status" ];then
+ echo "VACUUM of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
+if [ -n "$visibilitymap_test2_status" ];then
+ echo "SELECT of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
case $testhost in
MINGW*) cmd /c delete_old_cluster.bat ;;
*) sh ./delete_old_cluster.sh ;;
esac
+if ! diff "$temp_root"/vm_test1.txt "$temp_root"/vm_test2.txt >/dev/null; then
+ echo "Visibility map rewriting test failed"
+ exit 1
+fi
+
if diff "$temp_root"/dump1.sql "$temp_root"/dump2.sql >/dev/null; then
echo PASSED
exit 0
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index caa0f14..93afb10 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -320,9 +320,10 @@ typedef struct xl_heap_freeze_page
typedef struct xl_heap_visible
{
TransactionId cutoff_xid;
+ uint8 flags;
} xl_heap_visible;
-#define SizeOfHeapVisible (offsetof(xl_heap_visible, cutoff_xid) + sizeof(TransactionId))
+#define SizeOfHeapVisible (offsetof(xl_heap_visible, flags) + sizeof(uint8))
typedef struct xl_heap_new_cid
{
@@ -389,6 +390,6 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
xl_heap_freeze_tuple *xlrec_tp);
extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
- Buffer vm_buffer, TransactionId cutoff_xid);
+ Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
#endif /* HEAPAM_XLOG_H */
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 0c0e0ef..bacc349 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -19,15 +19,22 @@
#include "storage/buf.h"
#include "utils/relcache.h"
+/* Flags for bit map */
+#define VISIBILITYMAP_ALL_VISIBLE 0x01
+#define VISIBILITYMAP_ALL_FROZEN 0x02
+
extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk,
Buffer vmbuf);
extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
Buffer *vmbuf);
extern bool visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid);
-extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
-extern BlockNumber visibilitymap_count(Relation rel);
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags);
+extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf,
+ uint8 flags);
+extern void visibilitymap_count(Relation rel, BlockNumber *all_visible,
+ BlockNumber *all_frozen);
extern void visibilitymap_truncate(Relation rel, BlockNumber nheapblocks);
#endif /* VISIBILITYMAP_H */
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 15e0b97..f2ef868 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 201510161
+#define CATALOG_VERSION_NO 201510191
#endif
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index f688454..de9f11a 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2777,6 +2777,8 @@ DATA(insert OID = 3056 ( pg_stat_get_analyze_count PGNSP PGUID 12 1 0 0 0 f f f
DESCR("statistics: number of manual analyzes for a table");
DATA(insert OID = 3057 ( pg_stat_get_autoanalyze_count PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_autoanalyze_count _null_ _null_ _null_ ));
DESCR("statistics: number of auto analyzes for a table");
+DATA(insert OID = 6015 ( pg_stat_get_frozen_pages PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 23 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_frozen_pages _null_ _null_ _null_ ));
+DESCR("statistics: number of frozen pages of table");
DATA(insert OID = 1936 ( pg_stat_get_backend_idset PGNSP PGUID 12 1 100 0 0 f f f f t t s r 0 0 23 "" _null_ _null_ _null_ _null_ _null_ pg_stat_get_backend_idset _null_ _null_ _null_ ));
DESCR("statistics: currently active backend IDs");
DATA(insert OID = 2022 ( pg_stat_get_activity PGNSP PGUID 12 1 100 0 0 f f f f f t s r 1 0 2249 "23" "{23,26,23,26,25,25,25,16,1184,1184,1184,1184,869,25,23,28,28,16,25,25,23,16,25}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,datid,pid,usesysid,application_name,state,query,waiting,xact_start,query_start,backend_start,state_change,client_addr,client_hostname,client_port,backend_xid,backend_xmin,ssl,sslversion,sslcipher,sslbits,sslcompression,sslclientdn}" _null_ _null_ pg_stat_get_activity _null_ _null_ _null_ ));
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 9ecc163..3bca9f5 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -355,6 +355,7 @@ typedef struct PgStat_MsgVacuum
TimestampTz m_vacuumtime;
PgStat_Counter m_live_tuples;
PgStat_Counter m_dead_tuples;
+ int32 m_frozen_pages;
} PgStat_MsgVacuum;
@@ -372,6 +373,7 @@ typedef struct PgStat_MsgAnalyze
TimestampTz m_analyzetime;
PgStat_Counter m_live_tuples;
PgStat_Counter m_dead_tuples;
+ int32 m_frozen_pages;
} PgStat_MsgAnalyze;
@@ -615,6 +617,8 @@ typedef struct PgStat_StatTabEntry
PgStat_Counter n_dead_tuples;
PgStat_Counter changes_since_analyze;
+ int32 n_frozen_pages;
+
PgStat_Counter blocks_fetched;
PgStat_Counter blocks_hit;
@@ -917,9 +921,11 @@ extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type t
extern void pgstat_report_autovac(Oid dboid);
extern void pgstat_report_vacuum(Oid tableoid, bool shared,
- PgStat_Counter livetuples, PgStat_Counter deadtuples);
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages);
extern void pgstat_report_analyze(Relation rel,
- PgStat_Counter livetuples, PgStat_Counter deadtuples);
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages);
extern void pgstat_report_recovery_conflict(int reason);
extern void pgstat_report_deadlock(void);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index a2f78ee..102aa81 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -178,8 +178,10 @@ typedef PageHeaderData *PageHeader;
* tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
+#define PD_ALL_FROZEN 0x0008 /* all tuples on page are completely
+ frozen */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x000F /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -367,7 +369,12 @@ typedef PageHeaderData *PageHeader;
#define PageSetAllVisible(page) \
(((PageHeader) (page))->pd_flags |= PD_ALL_VISIBLE)
#define PageClearAllVisible(page) \
- (((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
+ (((PageHeader) (page))->pd_flags &= ~(PD_ALL_VISIBLE | PD_ALL_FROZEN))
+
+#define PageIsAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags & PD_ALL_FROZEN)
+#define PageSetAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags |= PD_ALL_FROZEN)
#define PageIsPrunable(page, oldestxmin) \
( \
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 80374e4..b259e65 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1683,6 +1683,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,
pg_stat_get_live_tuples(c.oid) AS n_live_tup,
pg_stat_get_dead_tuples(c.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(c.oid) AS n_mod_since_analyze,
+ pg_stat_get_frozen_pages(c.oid) AS n_frozen_page,
pg_stat_get_last_vacuum_time(c.oid) AS last_vacuum,
pg_stat_get_last_autovacuum_time(c.oid) AS last_autovacuum,
pg_stat_get_last_analyze_time(c.oid) AS last_analyze,
@@ -1795,6 +1796,7 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
+ pg_stat_all_tables.n_frozen_page,
pg_stat_all_tables.last_vacuum,
pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,
@@ -1838,6 +1840,7 @@ pg_stat_user_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
+ pg_stat_all_tables.n_frozen_page,
pg_stat_all_tables.last_vacuum,
pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index f5be70f..dd49786 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -127,6 +127,8 @@ SELECT count(*) FROM tenk2 WHERE unique1 = 1;
1
(1 row)
+-- do VACUUM FREEZE
+VACUUM FREEZE tenk2;
-- force the rate-limiting logic in pgstat_report_tabstat() to time out
-- and send a message
SELECT pg_sleep(1.0);
@@ -175,6 +177,14 @@ SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages,
t | t
(1 row)
+SELECT n_frozen_page = (pg_relation_size('tenk2') / current_setting('block_size')::int)
+ FROM pg_stat_user_tables
+ WHERE relname ='tenk2';
+ ?column?
+----------
+ t
+(1 row)
+
SELECT pr.snap_ts < pg_stat_get_snapshot_timestamp() as snapshot_newer
FROM prevstats AS pr;
snapshot_newer
diff --git a/src/test/regress/expected/visibilitymap.out b/src/test/regress/expected/visibilitymap.out
new file mode 100644
index 0000000..1c74a59
--- /dev/null
+++ b/src/test/regress/expected/visibilitymap.out
@@ -0,0 +1,22 @@
+--
+-- Visibility map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+-- All pages are become all-visible
+VACUUM FREEZE vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+ ?column?
+----------
+ t
+(1 row)
+
+-- Check whether vacuum skips all-frozen pages
+\set VERBOSITY terse
+VACUUM FREEZE VERBOSE vmtest;
+INFO: vacuuming "public.vmtest"
+INFO: index "vmtest_pkey" now contains 10000 row versions in 24 pages
+INFO: skipped 40 frozen pages according to visibility map
+INFO: "vmtest": found 0 removable, 0 nonremovable row versions in 0 out of 40 pages
+\set VERBOSITY default
+DROP TABLE vmtest;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index c63abf4..7e905df 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -110,3 +110,6 @@ test: event_trigger
# run stats by itself because its delay may be insufficient under heavy load
test: stats
+
+# visibility map and vacuum test cannot run concurrently with any test that runs SQL
+test: visibilitymap
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 88dcd64..2ecfe56 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -159,3 +159,4 @@ test: with
test: xml
test: event_trigger
test: stats
+test: visibilitymap
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index cd2d592..b3c640f 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -120,6 +120,8 @@ ROLLBACK;
SELECT count(*) FROM tenk2;
-- do an indexscan
SELECT count(*) FROM tenk2 WHERE unique1 = 1;
+-- do VACUUM FREEZE
+VACUUM FREEZE tenk2;
-- force the rate-limiting logic in pgstat_report_tabstat() to time out
-- and send a message
@@ -145,6 +147,10 @@ SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages,
FROM pg_statio_user_tables AS st, pg_class AS cl, prevstats AS pr
WHERE st.relname='tenk2' AND cl.relname='tenk2';
+SELECT n_frozen_page = (pg_relation_size('tenk2') / current_setting('block_size')::int)
+ FROM pg_stat_user_tables
+ WHERE relname ='tenk2';
+
SELECT pr.snap_ts < pg_stat_get_snapshot_timestamp() as snapshot_newer
FROM prevstats AS pr;
diff --git a/src/test/regress/sql/visibilitymap.sql b/src/test/regress/sql/visibilitymap.sql
new file mode 100644
index 0000000..832120b
--- /dev/null
+++ b/src/test/regress/sql/visibilitymap.sql
@@ -0,0 +1,16 @@
+--
+-- Visibility map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+
+-- All pages are become all-visible
+VACUUM FREEZE vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+
+-- Check whether vacuum skips all-frozen pages
+\set VERBOSITY terse
+VACUUM FREEZE VERBOSE vmtest;
+\set VERBOSITY default
+
+DROP TABLE vmtest;
On Fri, Oct 30, 2015 at 1:26 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Wed, Oct 28, 2015 at 12:58 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Sat, Oct 24, 2015 at 2:24 PM, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:On Sat, Oct 24, 2015 at 10:59 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:I think we can display information about relallfrozen it in
pg_stat_*_tables
as suggested by you. It doesn't make much sense to keep it in pg_class
unless we have some usecase for the same.I'm thinking a bit about implementing the read-only table that is
restricted to update/delete and is ensured that whole table is frozen,
if this feature is committed.
The value of relallfrozen might be useful for such feature.Thank you for reviewing!
If we need this for read-only table feature, then better lets add that
after discussing the design of that feature. It doesn't seem to be
advisable to have an extra field in system table which we might
need in yet not completely-discussed feature.I changed it so that the number of frozen pages is stored in
pg_stat_all_tables as statistics information.
Also, the tests related to counting all-visible bit and skipping
vacuum are added to visibility map test, and the test related to
counting all-frozen is added to stats collector test.Attached updated v20 patch.
Review Comments: ------------------------------- 1. /* - * Find buffer to insert this tuple into. If the page is all visible, - * this will also pin the requisite visibility map page. + * Find buffer to insert this tuple into. If the page is all visible + * or all frozen, this will also pin the requisite visibility map and + * frozen map page.*/
buffer = RelationGetBufferForTuple(relation, heaptup->t_len,InvalidBuffer, options, bistate,
I think it is sufficient to say in the end 'visibility map page'.
Let's not include 'frozen map page'.Fixed.
2. + * corresponding page has been completely frozen, so the visibility map is also + * used for anti-wraparound vacuum, even if freezing tuples is required./all tuple/all tuples
/freezing tuples/freezing of tuplesFixed.
3. - * Are all tuples on heapBlk visible to all, according to the visibility map? + * Are all tuples on heapBlk visible or frozen to all, according to the visibility map?I think it is better to modify the above statement as:
Are all tuples on heapBlk visible to all or are marked as frozen, according
to the visibility map?Fixed.
4. + * releasing *buf after it's done testing and setting bits, and must set flags + * which indicates what flag we want to test.Here are you talking about the flags passed to visibilitymap_set(), if
yes, then above comment is not clear, how about:and must pass flags
for which it needs to check the value in visibility map.Fixed.
5. + * both how many pages we skipped according to all-frozen bit of visibility + * map and how many pages we freeze page, so we can update relfrozenxid ifIn above sentence word 'page' after freeze sounds redundant.
/we freeze page/we freezeAnother suggestion:
/sum of them/sum of twoFixed.
6. + * This block is at least all-visible according to visibility map. + * We check whehter this block is all-frozen or not, to skip towhether is mis-spelled
Fixed.
7. + * If we froze any tuples or any tuples are already frozen, + * mark the buffer dirty, and write a WAL record recording the changes.Here, I think WAL record is written only when we mark some
tuple/'s as frozen not if we they are already frozen,
so in that regard, I think above comment is wrong.It's wrong.
Fixed.8. + /* + * We cant't allow upgrading with link mode between 9.5 or before and 9.6 or later, + * because the format of visibility map has been changed on version 9.6. + */a. /cant't/can't
b. changed on version 9.6/changed in version 9.6
b. Won't such a change needs to be updated in pg_upgrade
documentation (Notes Section)?Fixed.
And updated document.9.
@@ -180,6 +181,13 @@ transfer_single_new_db(pageCnvCtx *pageConverter,new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
vm_crashsafe_match = false;+ /* + * Do we need to rewrite visibilitymap? + */ + if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER && + new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER) + vm_rewrite_needed = true;..
@@ -276,7 +285,15 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap
*map,
{pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL) + /* + * Do we need to rewrite visibilitymap? + */ + if (strcmp (type_suffix, "_vm") == 0 && + old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER && + new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER) + rewrite_vm = true;Instead of doing re-check in transfer_relfile(), I think it is better
to pass an additional parameter in this function.I agree.
Fixed.10.
You have mentioned up-thread that, you have changed the patch so that
PageClearAllVisible clear both bits, can you please point me to this
change?
Basically after applying the patch, I see below code in bufpage.h:
#define PageClearAllVisible(page) \
(((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)Don't we need to clear the PD_ALL_FROZEN separately?
Previous patch is wrong. PageClearAllVisible() should be;
#define PageClearAllVisible(page) \
(((PageHeader) (page))->pd_flags &= ~(PD_ALL_VISIBLE | PD_ALL_FROZEN))The all-frozen flag/bit is cleared only by modifying page, so it is
impossible that only all-frozen flags/bit is cleared.
The clearing of all-visible flag/bit also means that the page has some
garbage, and is needed to vacuum.
v20 patch has a bug in result of regression test.
Attached updated v21 patch.
Regards,
--
Masahiko Sawada
Attachments:
000_add_frozen_bit_into_visibilitymap_v21.patchtext/x-patch; charset=US-ASCII; name=000_add_frozen_bit_into_visibilitymap_v21.patchDownload
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 22c5f7a..b1b6a06 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -87,7 +87,7 @@ statapprox_heap(Relation rel, output_type *stat)
* If the page has only visible tuples, then we can find out the free
* space from the FSM and move on.
*/
- if (visibilitymap_test(rel, blkno, &vmbuffer))
+ if (visibilitymap_test(rel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
freespace = GetRecordedFreeSpace(rel, blkno);
stat->tuple_len += BLCKSZ - freespace;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5549de7..bb63bb9 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5900,7 +5900,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs a aggressive freezing if the table's
<structname>pg_class</>.<structfield>relfrozenxid</> field has reached
the age specified by this setting. The default is 150 million
transactions. Although users can set this value anywhere from zero to
@@ -5944,7 +5944,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs a aggressive freezing if the table's
<structname>pg_class</>.<structfield>relminmxid</> field has reached
the age specified by this setting. The default is 150 million multixacts.
Although users can set this value anywhere from zero to two billions,
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index b5d4050..9183aba 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -352,9 +352,9 @@
Vacuum maintains a <link linkend="storage-vm">visibility map</> for each
table to keep track of which pages contain only tuples that are known to be
visible to all active transactions (and all future transactions, until the
- page is again modified). This has two purposes. First, vacuum
- itself can skip such pages on the next run, since there is nothing to
- clean up.
+ page is again modified), and pages contain only tuples that are marked as
+ frozen. This has two purposes. First, vacuum itself can skip such pages
+ on the next run, since there is nothing to clean up.
</para>
<para>
@@ -438,23 +438,22 @@
</para>
<para>
- <command>VACUUM</> normally skips pages that don't have any dead row
- versions, but those pages might still have row versions with old XID
- values. To ensure all old row versions have been frozen, a
- scan of the whole table is needed.
+ <command>VACUUM</> skips pages that don't have any dead row
+ versions, and pages that have only frozen rows.
+ To ensure all old row versions have been frozen, a scan of all pages that
+ are not marked as frozen is needed.
<xref linkend="guc-vacuum-freeze-table-age"> controls when
- <command>VACUUM</> does that: a whole table sweep is forced if
- the table hasn't been fully scanned for <varname>vacuum_freeze_table_age</>
- minus <varname>vacuum_freeze_min_age</> transactions. Setting it to 0
- forces <command>VACUUM</> to always scan all pages, effectively ignoring
- the visibility map.
+ <command>VACUUM</> does that: a table sweep is forced if
+ the table hasn't been ensured all row versions are frozen for
+ <varname>vacuum_freeze_table_age</> minus <varname>vacuum_freeze_min_age</>
+ transcations.
</para>
<para>
The maximum time that a table can go unvacuumed is two billion
transactions minus the <varname>vacuum_freeze_min_age</> value at
- the time <command>VACUUM</> last scanned the whole table. If it were to go
- unvacuumed for longer than
+ the time <command>VACUUM</> last scanned pages that are not marked as frozen
+ If it were to go unvacuumed for longer than
that, data loss could result. To ensure that this does not happen,
autovacuum is invoked on any table that might contain unfrozen rows with
XIDs older than the age specified by the configuration parameter <xref
@@ -490,8 +489,7 @@
a regularly scheduled <command>VACUUM</> or an autovacuum triggered by
normal delete and update activity is run in that window. Setting it too
close could lead to anti-wraparound autovacuums, even though the table
- was recently vacuumed to reclaim space, whereas lower values lead to more
- frequent whole-table scans.
+ was recently vacuumed to reclaim space.
</para>
<para>
@@ -527,7 +525,7 @@
<structname>pg_database</>. In particular,
the <structfield>relfrozenxid</> column of a table's
<structname>pg_class</> row contains the freeze cutoff XID that was used
- by the last whole-table <command>VACUUM</> for that table. All rows
+ by the last whole-table freezing for that table. All rows
inserted by transactions with XIDs older than this cutoff XID are
guaranteed to have been frozen. Similarly,
the <structfield>datfrozenxid</> column of a database's
@@ -555,17 +553,19 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
<command>VACUUM</> normally
only scans pages that have been modified since the last vacuum, but
<structfield>relfrozenxid</> can only be advanced when the whole table is
- scanned. The whole table is scanned when <structfield>relfrozenxid</> is
- more than <varname>vacuum_freeze_table_age</> transactions old, when
- <command>VACUUM</>'s <literal>FREEZE</> option is used, or when all pages
- happen to
- require vacuuming to remove dead row versions. When <command>VACUUM</>
- scans the whole table, after it's finished <literal>age(relfrozenxid)</>
- should be a little more than the <varname>vacuum_freeze_min_age</> setting
- that was used (more by the number of transactions started since the
- <command>VACUUM</> started). If no whole-table-scanning <command>VACUUM</>
- is issued on the table until <varname>autovacuum_freeze_max_age</> is
- reached, an autovacuum will soon be forced for the table.
+ frozen. The whole table is scanned only when all pages happen to require
+ vacuuming to remove dead row versions. In other cases such as where
+ <structfield>relfrozenxid</> is more than <varname>vacuum_freeze_table_age</>
+ transcations old, where <command>VACUUM</>'s <literal>FREEZE</> option is used,
+ <command>VACUUM</> can skip the pages that all tuples on the page itself are
+ marked as frozen.
+ When all pages of table are eventually marked as frozen by <command>VACUUM</>,
+ after it's finished <literal>age(relfrozenxid)</> should be a little more
+ than the <varname>vacuum_freeze_min_age</> setting that was used (more by
+ the number of transcations started since the <command>VACUUM</> started).
+ If the advancing of <structfield>relfrozenxid</> is not happend until
+ <varname>autovacuum_freeze_max_age</> is reached, an autovacuum will soon
+ be forced for the table.
</para>
<para>
@@ -642,28 +642,28 @@ HINT: Stop the postmaster and vacuum that database in single-user mode.
<structname>pg_class</>.<structfield>relminmxid</> stores the oldest
possible multixact ID still appearing in any tuple of that table.
If this value is older than
- <xref linkend="guc-vacuum-multixact-freeze-table-age">, a whole-table
+ <xref linkend="guc-vacuum-multixact-freeze-table-age">, a table
scan is forced. <function>mxid_age()</> can be used on
<structname>pg_class</>.<structfield>relminmxid</> to find its age.
</para>
<para>
- Whole-table <command>VACUUM</> scans, regardless of
- what causes them, enable advancing the value for that table.
+ When <command>VACUUM</> scans all pages that are not marked as frozen,
+ regardless of what causes them, enable advancing the value for that table.
Eventually, as all tables in all databases are scanned and their
oldest multixact values are advanced, on-disk storage for older
multixacts can be removed.
</para>
<para>
- As a safety device, a whole-table vacuum scan will occur for any table
- whose multixact-age is greater than
- <xref linkend="guc-autovacuum-multixact-freeze-max-age">. Whole-table
+ As a safety device, a vacuum scan will occur for any table whose
+ multixact-age is greater than
+ <xref linkend="guc-autovacuum-multixact-freeze-max-age">. A
vacuum scans will also occur progressively for all tables, starting with
those that have the oldest multixact-age, if the amount of used member
storage space exceeds the amount 50% of the addressable storage space.
- Both of these kinds of whole-table scans will occur even if autovacuum is
- nominally disabled.
+ Both of these kinds of scans will occur even if autovacuum is nominally
+ disabled.
</para>
</sect3>
</sect2>
@@ -743,8 +743,8 @@ vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuple
<command>UPDATE</command> and <command>DELETE</command> operation. (It
is only semi-accurate because some information might be lost under heavy
load.) If the <structfield>relfrozenxid</> value of the table is more
- than <varname>vacuum_freeze_table_age</> transactions old, the whole
- table is scanned to freeze old tuples and advance
+ than <varname>vacuum_freeze_table_age</> transactions old, the table is
+ scanned to freeze old tuples and advance
<structfield>relfrozenxid</>, otherwise only pages that have been modified
since the last vacuum are scanned.
</para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e64b7ef..1908a4d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1332,6 +1332,11 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
<entry>Estimated number of rows modified since this table was last analyzed</entry>
</row>
<row>
+ <entry><structfield>n_frozen_page</></entry>
+ <entry><type>integer</></entry>
+ <entry>Number of frozen pages</entry>
+ </row>
+ <row>
<entry><structfield>last_vacuum</></entry>
<entry><type>timestamp with time zone</></entry>
<entry>Last time at which this table was manually vacuumed
diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index eb113c2..854a900 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -657,6 +657,12 @@ psql --username postgres --file script.sql postgres
</para>
<para>
+ Since the format of visibility map has been changed in version 9.6,
+ <application>pg_upgrade</> does not support upgrading of databases
+ from 9.5 or before to 9.6 or later with link mode (-k).
+ </para>
+
+ <para>
All failure, rebuild, and reindex cases will be reported by
<application>pg_upgrade</> if they affect your installation;
post-upgrade scripts to rebuild tables and indexes will be
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 164d08c..9328cdf 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -623,18 +623,21 @@ can be used to examine the information stored in free space maps.
<para>
Each heap relation has a Visibility Map
(VM) to keep track of which pages contain only tuples that are known to be
-visible to all active transactions. It's stored
-alongside the main relation data in a separate relation fork, named after the
-filenode number of the relation, plus a <literal>_vm</> suffix. For example,
-if the filenode of a relation is 12345, the VM is stored in a file called
-<filename>12345_vm</>, in the same directory as the main relation file.
+visible to all active transactions, and pages contain only tuples that are
+marked as frozen.
+It's stored alongside the main relation data in a separate relation fork,
+named after the filenode number of the relation, plus a <literal>_vm</> suffix.
+For example, if the filenode of a relation is 12345, the VM is stored in a file
+called <filename>12345_vm</>, in the same directory as the main relation file.
Note that indexes do not have VMs.
</para>
<para>
-The visibility map simply stores one bit per heap page. A set bit means
-that all tuples on the page are known to be visible to all transactions.
-This means that the page does not contain any tuples that need to be vacuumed.
+The visibility map stores two bits per heap page: all-visible, all-frozen.
+A set all-visible bit means that all tuples on the page are known to be visible
+to all transactions. A set all-frozen bit means that all tuples on the page are
+completely marked as frozen. This means that the page does not contain any tuples
+that need to be vacuumed, even if a scan of whole table is required.
This information can also be used by <firstterm>index-only scans</> to answer
queries using only the index tuple.
</para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 66deb1f..9cd58be 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2400,8 +2400,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
/*
- * Find buffer to insert this tuple into. If the page is all visible,
- * this will also pin the requisite visibility map page.
+ * Find buffer to insert this tuple into. If the page is all visible
+ * or all frozen, this will also pin the requisite visibility map.
*/
buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
InvalidBuffer, options, bistate,
@@ -3000,9 +3000,9 @@ heap_delete(Relation relation, ItemPointer tid,
/*
* If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, we'll have to unlock and
- * re-lock, to avoid holding the buffer lock across an I/O. That's a bit
- * unfortunate, but hopefully shouldn't happen often.
+ * visible or all frozen while we were busy locking the buffer, we'll
+ * have to unlock and re-lock, to avoid holding the buffer lock across an
+ * I/O. That's a bit unfortunate, but hopefully shouldn't happen often.
*/
if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
{
@@ -7166,7 +7166,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
*/
XLogRecPtr
log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
- TransactionId cutoff_xid)
+ TransactionId cutoff_xid, uint8 vmflags)
{
xl_heap_visible xlrec;
XLogRecPtr recptr;
@@ -7176,6 +7176,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
Assert(BufferIsValid(vm_buffer));
xlrec.cutoff_xid = cutoff_xid;
+ xlrec.flags = vmflags;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
@@ -7765,7 +7766,12 @@ heap_xlog_visible(XLogReaderState *record)
* the subsequent update won't be replayed to clear the flag.
*/
page = BufferGetPage(buffer);
- PageSetAllVisible(page);
+
+ if (xlrec->flags & VISIBILITYMAP_ALL_VISIBLE)
+ PageSetAllVisible(page);
+ if (xlrec->flags & VISIBILITYMAP_ALL_FROZEN)
+ PageSetAllFrozen(page);
+
MarkBufferDirty(buffer);
}
else if (action == BLK_RESTORED)
@@ -7817,7 +7823,7 @@ heap_xlog_visible(XLogReaderState *record)
*/
if (lsn > PageGetLSN(vmpage))
visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
- xlrec->cutoff_xid);
+ xlrec->cutoff_xid, xlrec->flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 7c38772..56ab497 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -21,33 +21,40 @@
*
* NOTES
*
- * The visibility map is a bitmap with one bit per heap page. A set bit means
- * that all tuples on the page are known visible to all transactions, and
- * therefore the page doesn't need to be vacuumed. The map is conservative in
- * the sense that we make sure that whenever a bit is set, we know the
- * condition is true, but if a bit is not set, it might or might not be true.
+ * The visibility map is a bitmap with two bits (all-visible and all-frozen)
+ * per heap page. A set all-visible bit means that all tuples on the page are
+ * known visible to all transactions, and therefore the page doesn't need to
+ * be vacuumed. A set all-frozen bit means that all tuples on the page are
+ * completely frozen, and therefore the page doesn't need to be vacuumed even
+ * if whole table scanning vacuum is required (e.g. anti-wraparound vacuum).
+ * The all-frozen bit must be set only when the page is already all-visible.
+ *
+ * The map is conservative in the sense that we make sure that whenever a bit
+ * is set, we know the condition is true, but if a bit is not set, it might or
+ * might not be true.
*
* Clearing a visibility map bit is not separately WAL-logged. The callers
* must make sure that whenever a bit is cleared, the bit is cleared on WAL
- * replay of the updating operation as well.
+ * replay of the updating operation as well. And all-frozen bit must be
+ * cleared with all-visible at the same time.
*
* When we *set* a visibility map during VACUUM, we must write WAL. This may
* seem counterintuitive, since the bit is basically a hint: if it is clear,
- * it may still be the case that every tuple on the page is visible to all
- * transactions; we just don't know that for certain. The difficulty is that
- * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
- * on the page itself, and the visibility map bit. If a crash occurs after the
- * visibility map page makes it to disk and before the updated heap page makes
- * it to disk, redo must set the bit on the heap page. Otherwise, the next
- * insert, update, or delete on the heap page will fail to realize that the
- * visibility map bit must be cleared, possibly causing index-only scans to
- * return wrong answers.
+ * it may still be the case that every tuple on the page is all-visible or
+ * all-frozen we just don't know that for certain. The difficulty is that
+ * there are two bits which are typically set together: the PD_ALL_VISIBLE
+ * or PD_ALL_FROZEN bit on the page itself, and the corresponding visibility map
+ * bit. If a crash occurs after the visibility map page makes it to disk and before
+ * the updated heap page makes it to disk, redo must set the bit on the heap page.
+ * Otherwise, the next insert, update, or delete on the heap page will fail to
+ * realize that the visibility map bit must be cleared, possibly causing index-only
+ * scans to return wrong answers.
*
* VACUUM will normally skip pages for which the visibility map bit is set;
* such pages can't contain any dead tuples and therefore don't need vacuuming.
- * The visibility map is not used for anti-wraparound vacuums, because
- * an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
- * present in the table, even on pages that don't have any dead tuples.
+ * The visibility map has the all-frozen bit which indicates all tuples on
+ * corresponding page has been completely frozen, so the visibility map is also
+ * used for anti-wraparound vacuum, even if freezing of tuples is required.
*
* LOCKING
*
@@ -58,14 +65,14 @@
* section that logs the page modification. However, we don't want to hold
* the buffer lock over any I/O that may be required to read in the visibility
* map page. To avoid this, we examine the heap page before locking it;
- * if the page-level PD_ALL_VISIBLE bit is set, we pin the visibility map
- * bit. Then, we lock the buffer. But this creates a race condition: there
- * is a possibility that in the time it takes to lock the buffer, the
- * PD_ALL_VISIBLE bit gets set. If that happens, we have to unlock the
- * buffer, pin the visibility map page, and relock the buffer. This shouldn't
- * happen often, because only VACUUM currently sets visibility map bits,
- * and the race will only occur if VACUUM processes a given page at almost
- * exactly the same time that someone tries to further modify it.
+ * if the page-level PD_ALL_VISIBLE or PD_ALL_FROZEN bit is set, we pin the
+ * visibility map bit. Then, we lock the buffer. But this creates a race
+ * condition: there is a possibility that in the time it takes to lock the
+ * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that happens,
+ * we have to unlock the buffer, pin the visibility map page, and relock the
+ * buffer. This shouldn't happen often, because only VACUUM currently sets
+ * visibility map bits, and the race will only occur if VACUUM processes a given
+ * page at almost exactly the same time that someone tries to further modify it.
*
* To set a bit, you need to hold a lock on the heap page. That prevents
* the race condition where VACUUM sees that all tuples on the page are
@@ -101,11 +108,14 @@
*/
#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
-/* Number of bits allocated for each heap block. */
-#define BITS_PER_HEAPBLOCK 1
+/*
+ * Number of bits allocated for each heap block.
+ * One for all-visible, other for all-frozen.
+*/
+#define BITS_PER_HEAPBLOCK 2
/* Number of heap blocks we can represent in one byte. */
-#define HEAPBLOCKS_PER_BYTE 8
+#define HEAPBLOCKS_PER_BYTE 4
/* Number of heap blocks we can represent in one visibility map page. */
#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
@@ -115,24 +125,42 @@
#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
#define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
-/* table for fast counting of set bits */
-static const uint8 number_of_ones[256] = {
- 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
+/* tables for fast counting of set bits for visible and freeze */
+static const uint8 number_of_ones_for_visible[256] = {
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4
+};
+static const uint8 number_of_ones_for_frozen[256] = {
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4
};
/* prototypes for internal routines */
@@ -141,7 +169,7 @@ static void vm_extend(Relation rel, BlockNumber nvmblocks);
/*
- * visibilitymap_clear - clear a bit in visibility map
+ * visibilitymap_clear - clear all bits in visibility map
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -153,11 +181,12 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
int mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- uint8 mask = 1 << mapBit;
+ uint8 mask = (VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN) <<
+ (BITS_PER_HEAPBLOCK * mapBit);
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_clear %s block %d", RelationGetRelationName(rel), heapBlk);
#endif
if (!BufferIsValid(buf) || BufferGetBlockNumber(buf) != mapBlock)
@@ -225,7 +254,7 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
}
/*
- * visibilitymap_set - set a bit on a previously pinned page
+ * visibilitymap_set - set bit(s) on a previously pinned page
*
* recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
* or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
@@ -234,10 +263,11 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
* marked all-visible; it is needed for Hot Standby, and can be
* InvalidTransactionId if the page contains no tuples.
*
- * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
- * this function. Except in recovery, caller should also pass the heap
- * buffer. When checksums are enabled and we're not in recovery, we must add
- * the heap buffer to the WAL chain to protect it from being torn.
+ * Caller is expected to set the heap page's PD_ALL_VISIBLE or PD_ALL_FROZEN
+ * bit before calling this function. Except in recovery, caller should also
+ * pass the heap buffer and flags which indicates what flag we want to set.
+ * When checksums are enabled and we're not in recovery, we must add the heap
+ * buffer to the WAL chain to protect it from being torn.
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -245,7 +275,8 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
*/
void
visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid)
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -254,11 +285,13 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_set %s block %d, flag %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
Assert(InRecovery || BufferIsValid(heapBuf));
+ Assert((flags & VISIBILITYMAP_ALL_VISIBLE) ||
+ (flags & (VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN)));
/* Check that we have the right heap page pinned, if present */
if (BufferIsValid(heapBuf) && BufferGetBlockNumber(heapBuf) != heapBlk)
@@ -272,11 +305,11 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map = PageGetContents(page);
LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
- if (!(map[mapByte] & (1 << mapBit)))
+ if (flags != (map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))))
{
START_CRIT_SECTION();
- map[mapByte] |= (1 << mapBit);
+ map[mapByte] |= (flags << (BITS_PER_HEAPBLOCK * mapBit));
MarkBufferDirty(vmBuf);
if (RelationNeedsWAL(rel))
@@ -285,7 +318,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Assert(!InRecovery);
recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
- cutoff_xid);
+ cutoff_xid, flags);
/*
* If data checksums are enabled (or wal_log_hints=on), we
@@ -295,11 +328,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Page heapPage = BufferGetPage(heapBuf);
- /* caller is expected to set PD_ALL_VISIBLE first */
- Assert(PageIsAllVisible(heapPage));
+ /*
+ * Caller is expected to set PD_ALL_VISIBLE or
+ * PD_ALL_FROZEN first.
+ */
+ Assert(((flags | VISIBILITYMAP_ALL_VISIBLE) && PageIsAllVisible(heapPage)) ||
+ ((flags | VISIBILITYMAP_ALL_FROZEN) && PageIsAllFrozen(heapPage)));
PageSetLSN(heapPage, recptr);
}
}
+
PageSetLSN(page, recptr);
}
@@ -310,15 +348,17 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
}
/*
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_test - test if bit(s) is set
*
- * Are all tuples on heapBlk visible to all, according to the visibility map?
+ * Are all tuples on heapBlk visible to all or are marked frozen, according
+ * to the visibility map?
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by an
* earlier call to visibilitymap_pin or visibilitymap_test on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk, or InvalidBuffer. The caller is responsible for
- * releasing *buf after it's done testing and setting bits.
+ * releasing *buf after it's done testing and setting bits, and must pass flags
+ * for which it needs to check the value in visibility map.
*
* NOTE: This function is typically called without a lock on the heap page,
* so somebody else could change the bit just after we look at it. In fact,
@@ -328,7 +368,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
* all concurrency issues!
*/
bool
-visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
+visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf, uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -337,7 +377,7 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_test %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_test %s block %d, flag %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
/* Reuse the old pinned buffer if possible */
@@ -360,11 +400,11 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
map = PageGetContents(BufferGetPage(*buf));
/*
- * A single-bit read is atomic. There could be memory-ordering effects
+ * A single or double bit read is atomic. There could be memory-ordering effects
* here, but for performance reasons we make it the caller's job to worry
* about that.
*/
- result = (map[mapByte] & (1 << mapBit)) ? true : false;
+ result = (map[mapByte] & (flags << (BITS_PER_HEAPBLOCK * mapBit))) ? true : false;
return result;
}
@@ -374,14 +414,16 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
*
* Note: we ignore the possibility of race conditions when the table is being
* extended concurrently with the call. New pages added to the table aren't
- * going to be marked all-visible, so they won't affect the result.
+ * going to be marked all-visible or all-frozen, so they won't affect the result.
+ * The caller must set the flags which indicates what flag we want to count.
*/
-BlockNumber
-visibilitymap_count(Relation rel)
+void
+visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen)
{
- BlockNumber result = 0;
BlockNumber mapBlock;
+ *all_visible = *all_frozen = 0;
+
for (mapBlock = 0;; mapBlock++)
{
Buffer mapBuffer;
@@ -406,13 +448,12 @@ visibilitymap_count(Relation rel)
for (i = 0; i < MAPSIZE; i++)
{
- result += number_of_ones[map[i]];
+ *all_visible += number_of_ones_for_visible[map[i]];
+ *all_frozen += number_of_ones_for_frozen[map[i]];
}
ReleaseBuffer(mapBuffer);
}
-
- return result;
}
/*
@@ -435,7 +476,7 @@ visibilitymap_truncate(Relation rel, BlockNumber nheapblocks)
uint8 truncBit = HEAPBLK_TO_MAPBIT(nheapblocks);
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_truncate %s %d", RelationGetRelationName(rel), nheapblocks);
+ elog(DEBUG1, "vm_truncate %s block %d", RelationGetRelationName(rel), nheapblocks);
#endif
RelationOpenSmgr(rel);
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e59b163..5f1733e 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1813,7 +1813,7 @@ FormIndexDatum(IndexInfo *indexInfo,
* isprimary: if true, set relhaspkey true; else no change
* reltuples: if >= 0, set reltuples to this value; else no change
*
- * If reltuples >= 0, relpages and relallvisible are also updated (using
+ * If reltuples >= 0, relpages, relallvisible are also updated (using
* RelationGetNumberOfBlocks() and visibilitymap_count()).
*
* NOTE: an important side-effect of this operation is that an SI invalidation
@@ -1919,9 +1919,10 @@ index_update_stats(Relation rel,
{
BlockNumber relpages = RelationGetNumberOfBlocks(rel);
BlockNumber relallvisible;
+ BlockNumber relallfrozen; /* not used */
if (rd_rel->relkind != RELKIND_INDEX)
- relallvisible = visibilitymap_count(rel);
+ visibilitymap_count(rel, &relallvisible, &relallfrozen);
else /* don't bother for indexes */
relallvisible = 0;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index ccc030f..8c555eb 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -444,6 +444,7 @@ CREATE VIEW pg_stat_all_tables AS
pg_stat_get_live_tuples(C.oid) AS n_live_tup,
pg_stat_get_dead_tuples(C.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(C.oid) AS n_mod_since_analyze,
+ pg_stat_get_frozen_pages(C.oid) AS n_frozen_page,
pg_stat_get_last_vacuum_time(C.oid) as last_vacuum,
pg_stat_get_last_autovacuum_time(C.oid) as last_autovacuum,
pg_stat_get_last_analyze_time(C.oid) as last_analyze,
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index ddb68ab..bdbd7db 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -314,6 +314,8 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
+ BlockNumber relallvisible,
+ relallfrozen;
if (inh)
ereport(elevel,
@@ -564,6 +566,10 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
}
}
+ /* Calculate the number of all-visible and all-frozen bit */
+ if (!inh)
+ visibilitymap_count(onerel, &relallvisible, &relallfrozen);
+
/*
* Update pages/tuples stats in pg_class ... but not if we're doing
* inherited stats.
@@ -572,7 +578,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
vac_update_relstats(onerel,
relpages,
totalrows,
- visibilitymap_count(onerel),
+ relallvisible,
hasindex,
InvalidTransactionId,
InvalidMultiXactId,
@@ -608,7 +614,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
* tracks per-table stats.
*/
if (!inh)
- pgstat_report_analyze(onerel, totalrows, totaldeadrows);
+ pgstat_report_analyze(onerel, totalrows, totaldeadrows, relallfrozen);
/* If this isn't part of VACUUM ANALYZE, let index AMs do cleanup */
if (!(options & VACOPT_VACUUM))
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7c4ef58..0a02a25 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -729,11 +729,11 @@ vac_estimate_reltuples(Relation relation, bool is_analyze,
* marked with xmin = our xid.
*
* In addition to fundamentally nontransactional statistics such as
- * relpages and relallvisible, we try to maintain certain lazily-updated
- * DDL flags such as relhasindex, by clearing them if no longer correct.
- * It's safe to do this in VACUUM, which can't run in parallel with
- * CREATE INDEX/RULE/TRIGGER and can't be part of a transaction block.
- * However, it's *not* safe to do it in an ANALYZE that's within an
+ * relpages, relallvisible, we try to maintain certain lazily-updated
+ * DDL flags such as relhasindex, by clearing them if no onger correct.
+ * It's safe to do this in VACUUM, which can't run in
+ * parallel with CREATE INDEX/RULE/TRIGGER and can't be part of a transaction
+ * block. However, it's *not* safe to do it in an ANALYZE that's within an
* outer transaction, because for example the current transaction might
* have dropped the last index; then we'd think relhasindex should be
* cleared, but if the transaction later rolls back this would be wrong.
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 2429889..0fa31a3 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -106,6 +106,8 @@ typedef struct LVRelStats
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* number of pages we examined */
BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber vmskipped_frozen_pages; /* # of pages we skipped by all-frozen bit
+ of visibility map */
double scanned_tuples; /* counts only tuples on scanned pages */
double old_rel_tuples; /* previous value of pg_class.reltuples */
double new_rel_tuples; /* new estimated total # of tuples */
@@ -156,7 +158,7 @@ static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
static int vac_cmp_itemptr(const void *left, const void *right);
static bool heap_page_is_all_visible(Relation rel, Buffer buf,
- TransactionId *visibility_cutoff_xid);
+ TransactionId *visibility_cutoff_xid, bool *all_frozen);
/*
@@ -188,7 +190,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
MultiXactId mxactFullScanLimit;
BlockNumber new_rel_pages;
double new_rel_tuples;
- BlockNumber new_rel_allvisible;
+ BlockNumber new_rel_allvisible,
+ new_rel_allfrozen;
double new_live_tuples;
TransactionId new_frozen_xid;
MultiXactId new_min_multi;
@@ -222,6 +225,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* than or equal to the requested Xid full-table scan limit; or if the
* table's minimum MultiXactId is older than or equal to the requested
* mxid full-table scan limit.
+ * Even if scan_all is set so far, we could skip to scan some pages
+ * according by all-frozen bit of visibility amp.
*/
scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
xidFullScanLimit);
@@ -253,7 +258,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
- if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages + vacrelstats->vmskipped_frozen_pages)
+ < vacrelstats->rel_pages)
{
Assert(!scan_all);
scanned_all = false;
@@ -287,8 +293,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* tuple density") unless there's some actual evidence for the latter.
*
* We do update relallvisible even in the corner case, since if the table
- * is all-visible we'd definitely like to know that. But clamp the value
- * to be not more than what we're setting relpages to.
+ * is all-visible we'd definitely like to know that.
+ * But clamp the value to be not more than what we're setting relpages to.
*
* Also, don't change relfrozenxid/relminmxid if we skipped any pages,
* since then we don't know for certain that all tuples have a newer xmin.
@@ -301,10 +307,13 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_tuples = vacrelstats->old_rel_tuples;
}
- new_rel_allvisible = visibilitymap_count(onerel);
+ visibilitymap_count(onerel, &new_rel_allvisible, &new_rel_allfrozen);
if (new_rel_allvisible > new_rel_pages)
new_rel_allvisible = new_rel_pages;
+ if (new_rel_allfrozen > new_rel_pages)
+ new_rel_allfrozen = new_rel_pages;
+
new_frozen_xid = scanned_all ? FreezeLimit : InvalidTransactionId;
new_min_multi = scanned_all ? MultiXactCutoff : InvalidMultiXactId;
@@ -325,7 +334,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
pgstat_report_vacuum(RelationGetRelid(onerel),
onerel->rd_rel->relisshared,
new_live_tuples,
- vacrelstats->new_dead_tuples);
+ vacrelstats->new_dead_tuples,
+ new_rel_allfrozen);
/* and log the action if appropriate */
if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
@@ -360,10 +370,11 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel),
vacrelstats->num_index_scans);
- appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins\n"),
+ appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped according to vm\n"),
vacrelstats->pages_removed,
vacrelstats->rel_pages,
- vacrelstats->pinskipped_pages);
+ vacrelstats->pinskipped_pages,
+ vacrelstats->vmskipped_frozen_pages);
appendStringInfo(&buf,
_("tuples: %.0f removed, %.0f remain, %.0f are dead but not yet removable\n"),
vacrelstats->tuples_deleted,
@@ -486,9 +497,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* consecutive pages. Since we're reading sequentially, the OS should be
* doing readahead for us, so there's no gain in skipping a page now and
* then; that's likely to disable readahead and so be counterproductive.
- * Also, skipping even a single page means that we can't update
- * relfrozenxid, so we only want to do it if we can skip a goodly number
- * of pages.
+ * Also, skipping even a single page accorinding to all-visible bit of
+ * visibility map means that we can't update relfrozenxid, so we only want
+ * to do it if we can skip a goodly number. On the other hand, we count
+ * both how many pages we skipped according to all-frozen bit of visibility
+ * map and how many pages we freeze, so we can update relfrozenxid if
+ * the sum of two is as many as pages of table.
*
* Before entering the main loop, establish the invariant that
* next_not_all_visible_block is the next block number >= blkno that's not
@@ -498,24 +512,25 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* started skipping blocks, we may as well skip everything up to the next
* not-all-visible block.
*
- * Note: if scan_all is true, we won't actually skip any pages; but we
+ * Note: if scan_all is true, we might not actually skip any pages; but we
* maintain next_not_all_visible_block anyway, so as to set up the
* all_visible_according_to_vm flag correctly for each page.
*
* Note: The value returned by visibilitymap_test could be slightly
* out-of-date, since we make this test before reading the corresponding
* heap page or locking the buffer. This is OK. If we mistakenly think
- * that the page is all-visible when in fact the flag's just been cleared,
- * we might fail to vacuum the page. But it's OK to skip pages when
- * scan_all is not set, so no great harm done; the next vacuum will find
- * them. If we make the reverse mistake and vacuum a page unnecessarily,
- * it'll just be a no-op.
+ * that the page is all-visible/all-frozen when in fact the flag's just
+ * been cleared, we might fail to vacuum the page. But it's OK to skip
+ * pages when scan_all is not set, so no great harm done; the next vacuum
+ * will find them. If we make the reverse mistake and vacuum a page
+ * unnecessarily, it'll just be a no-op.
*/
for (next_not_all_visible_block = 0;
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer))
+ if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer,
+ VISIBILITYMAP_ALL_VISIBLE))
break;
vacuum_delay_point();
}
@@ -533,9 +548,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
bool tupgone,
hastup;
int prev_dead_count;
- int nfrozen;
+ int nfrozen; /* # of tuples is frozen */
+ int nalready_frozen; /* # of tuples is already frozen */
+ int ntotal_frozen; /* # of frozen tuples in single page */
+ int ntup_per_page;
Size freespace;
bool all_visible_according_to_vm;
+ bool all_frozen_according_to_vm;
bool all_visible;
bool has_dead_tuples;
TransactionId visibility_cutoff_xid = InvalidTransactionId;
@@ -548,7 +567,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block++)
{
if (!visibilitymap_test(onerel, next_not_all_visible_block,
- &vmbuffer))
+ &vmbuffer,
+ VISIBILITYMAP_ALL_VISIBLE))
break;
vacuum_delay_point();
}
@@ -563,13 +583,30 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
skipping_all_visible_blocks = false;
all_visible_according_to_vm = false;
+ all_frozen_according_to_vm = false;
}
else
{
- /* Current block is all-visible */
- if (skipping_all_visible_blocks && !scan_all)
- continue;
+ /*
+ * This block is at least all-visible according to visibility map.
+ * We check whether this block is all-frozen or not, to skip to
+ * vacuum this page even if scan_all is true.
+ */
+ bool all_frozen = visibilitymap_test(onerel, blkno, &vmbuffer,
+ VISIBILITYMAP_ALL_FROZEN);
+ if (scan_all)
+ {
+ if (all_frozen)
+ {
+ vacrelstats->vmskipped_frozen_pages++;
+ continue;
+ }
+ }
+ else if (skipping_all_visible_blocks)
+ continue;
+
all_visible_according_to_vm = true;
+ all_frozen_according_to_vm = all_frozen;
}
vacuum_delay_point();
@@ -740,7 +777,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
PageSetAllVisible(page);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, InvalidTransactionId);
+ vmbuffer, InvalidTransactionId,
+ VISIBILITYMAP_ALL_VISIBLE);
END_CRIT_SECTION();
}
@@ -764,6 +802,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
all_visible = true;
has_dead_tuples = false;
nfrozen = 0;
+ nalready_frozen = 0;
+ ntup_per_page = 0;
hastup = false;
prev_dead_count = vacrelstats->num_dead_tuples;
maxoff = PageGetMaxOffsetNumber(page);
@@ -918,8 +958,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
{
num_tuples += 1;
+ ntup_per_page += 1;;
hastup = true;
+ /* Check whether this tuple is alrady frozen or not */
+ if (HeapTupleHeaderXminFrozen(tuple.t_data))
+ nalready_frozen += 1;
+
/*
* Each non-removable tuple must be checked to see if it needs
* freezing. Note we already have exclusive buffer lock.
@@ -931,9 +976,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
} /* scan along page */
/*
- * If we froze any tuples, mark the buffer dirty, and write a WAL
- * record recording the changes. We must log the changes to be
- * crash-safe against future truncation of CLOG.
+ * If we froze any tuples then we mark the buffer dirty, and write a WAL
+ * record recording the changes. We must log the changes to be crash-safe
+ * against future truncation of CLOG.
*/
if (nfrozen > 0)
{
@@ -966,6 +1011,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
END_CRIT_SECTION();
}
+ /* Compute the number of frozen tuples in a page */
+ ntotal_frozen = nfrozen + nalready_frozen;
+
/*
* If there are no indexes then we can vacuum the page right now
* instead of doing a second scan.
@@ -988,26 +1036,46 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
freespace = PageGetHeapFreeSpace(page);
- /* mark page all-visible, if appropriate */
- if (all_visible && !all_visible_according_to_vm)
+ /* This page is all visible */
+ if (all_visible)
{
- /*
- * It should never be the case that the visibility map page is set
- * while the page-level bit is clear, but the reverse is allowed
- * (if checksums are not enabled). Regardless, set the both bits
- * so that we get back in sync.
- *
- * NB: If the heap page is all-visible but the VM bit is not set,
- * we don't need to dirty the heap page. However, if checksums
- * are enabled, we do need to make sure that the heap page is
- * dirtied before passing it to visibilitymap_set(), because it
- * may be logged. Given that this situation should only happen in
- * rare cases after a crash, it is not worth optimizing.
- */
- PageSetAllVisible(page);
- MarkBufferDirty(buf);
- visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ /* mark page all-visible, if appropriate */
+ if (!all_visible_according_to_vm)
+ {
+ /*
+ * It should never be the case that the visibility map page is set
+ * while the page-level bit is clear, but the reverse is allowed
+ * (if checksums are not enabled). Regardless, set the both bits
+ * so that we get back in sync.
+ *
+ * NB: If the heap page is all-visible but the VM bit is not set,
+ * we don't need to dirty the heap page. However, if checksums
+ * are enabled, we do need to make sure that the heap page is
+ * dirtied before passing it to visibilitymap_set(), because it
+ * may be logged. Given that this situation should only happen in
+ * rare cases after a crash, it is not worth optimizing.
+ */
+ PageSetAllVisible(page);
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+ }
+
+ /* mark page all-frozen, if all tuples are frozen and not marked yet */
+ if ((ntotal_frozen == ntup_per_page) && !all_frozen_according_to_vm)
+ {
+ Assert(PageIsAllVisible(page));
+
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ {
+ MarkBufferDirty(buf);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+ vmbuffer, visibility_cutoff_xid, flags);
+ }
}
/*
@@ -1018,7 +1086,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* that something bad has happened.
*/
else if (all_visible_according_to_vm && !PageIsAllVisible(page)
- && visibilitymap_test(onerel, blkno, &vmbuffer))
+ && visibilitymap_test(onerel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
{
elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
relname, blkno);
@@ -1078,7 +1146,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
num_tuples);
/*
- * Release any remaining pin on visibility map page.
+ * Release any remaining pin on visibility map.
*/
if (BufferIsValid(vmbuffer))
{
@@ -1114,6 +1182,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
RelationGetRelationName(onerel),
tups_vacuumed, vacuumed_pages)));
+ /* Report how many frozen pages vacuum skipped according to visibility map */
+ ereport(elevel,
+ (errmsg_plural("skipped %d frozen page according to visibility map",
+ "skipped %d frozen pages according to visibility map",
+ vacrelstats->vmskipped_frozen_pages,
+ vacrelstats->vmskipped_frozen_pages)));
+
/*
* This is pretty messy, but we split it up so that we can skip emitting
* individual parts of the message when not applicable.
@@ -1230,6 +1305,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
OffsetNumber unused[MaxOffsetNumber];
int uncnt = 0;
TransactionId visibility_cutoff_xid;
+ bool all_frozen;
START_CRIT_SECTION();
@@ -1281,19 +1357,34 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
* dirty, exclusively locked, and, if needed, a full page image has been
* emitted in the log_heap_clean() above.
*/
- if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid))
+ if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid, &all_frozen))
PageSetAllVisible(page);
/*
* All the changes to the heap page have been done. If the all-visible
- * flag is now set, also set the VM bit.
+ * flag is now set, also set the VM all-visible bit.
+ * Also, if this page is all-frozen, set the VM all-frozen bit and flag.
*/
- if (PageIsAllVisible(page) &&
- !visibilitymap_test(onerel, blkno, vmbuffer))
+ if (PageIsAllVisible(page))
{
+ uint8 flags = 0;
+
+ if (!visibilitymap_test(onerel, blkno, vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+
+ /* Set the VM all-frozen bit to flag, if needed */
+ if (all_frozen &&
+ !visibilitymap_test(onerel, blkno, vmbuffer, VISIBILITYMAP_ALL_FROZEN))
+ {
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
Assert(BufferIsValid(*vmbuffer));
- visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
- visibility_cutoff_xid);
+
+ if (flags)
+ visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
+ visibility_cutoff_xid, flags);
}
return tupindex;
@@ -1783,10 +1874,12 @@ vac_cmp_itemptr(const void *left, const void *right)
/*
* Check if every tuple in the given page is visible to all current and future
* transactions. Also return the visibility_cutoff_xid which is the highest
- * xmin amongst the visible tuples.
+ * xmin amongst the visible tuples, and all_frozen which implies that all tuples
+ * of this page are frozen.
*/
static bool
-heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid)
+heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid,
+ bool *all_frozen)
{
Page page = BufferGetPage(buf);
BlockNumber blockno = BufferGetBlockNumber(buf);
@@ -1795,6 +1888,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
bool all_visible = true;
*visibility_cutoff_xid = InvalidTransactionId;
+ *all_frozen = true;
/*
* This is a stripped down version of the line pointer scan in
@@ -1818,11 +1912,12 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/*
* Dead line pointers can have index pointers pointing to them. So
- * they can't be treated as visible
+ * they can't be treated as visible and frozen.
*/
if (ItemIdIsDead(itemid))
{
all_visible = false;
+ *all_frozen = false;
break;
}
@@ -1859,6 +1954,10 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
+
+ /* Check whether this tuple is alrady frozen or not */
+ if (!HeapTupleHeaderXminFrozen(tuple.t_data))
+ *all_frozen = false;
}
break;
@@ -1867,6 +1966,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
case HEAPTUPLE_INSERT_IN_PROGRESS:
case HEAPTUPLE_DELETE_IN_PROGRESS:
all_visible = false;
+ *all_frozen = false;
break;
default:
@@ -1875,5 +1975,8 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
}
} /* scan along page */
+ if (!all_visible)
+ *all_frozen = false;
+
return all_visible;
}
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 9f54c46..08df289 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -116,7 +116,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
*/
if (!visibilitymap_test(scandesc->heapRelation,
ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
+ &node->ioss_VMBuffer, VISIBILITYMAP_ALL_VISIBLE))
{
/*
* Rats, we have to visit the heap to check visibility.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index ab018c4..05a17e1 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1329,7 +1329,8 @@ pgstat_report_autovac(Oid dboid)
*/
void
pgstat_report_vacuum(Oid tableoid, bool shared,
- PgStat_Counter livetuples, PgStat_Counter deadtuples)
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages)
{
PgStat_MsgVacuum msg;
@@ -1343,6 +1344,7 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
msg.m_vacuumtime = GetCurrentTimestamp();
msg.m_live_tuples = livetuples;
msg.m_dead_tuples = deadtuples;
+ msg.m_frozen_pages = frozenpages;
pgstat_send(&msg, sizeof(msg));
}
@@ -1354,7 +1356,8 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
*/
void
pgstat_report_analyze(Relation rel,
- PgStat_Counter livetuples, PgStat_Counter deadtuples)
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages)
{
PgStat_MsgAnalyze msg;
@@ -1394,6 +1397,7 @@ pgstat_report_analyze(Relation rel,
msg.m_analyzetime = GetCurrentTimestamp();
msg.m_live_tuples = livetuples;
msg.m_dead_tuples = deadtuples;
+ msg.m_frozen_pages = frozenpages;
pgstat_send(&msg, sizeof(msg));
}
@@ -5069,6 +5073,7 @@ pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
tabentry->n_live_tuples = msg->m_live_tuples;
tabentry->n_dead_tuples = msg->m_dead_tuples;
+ tabentry->n_frozen_pages = msg->m_frozen_pages;
if (msg->m_autovacuum)
{
@@ -5103,6 +5108,7 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
tabentry->n_live_tuples = msg->m_live_tuples;
tabentry->n_dead_tuples = msg->m_dead_tuples;
+ tabentry->n_frozen_pages = msg->m_frozen_pages;
/*
* We reset changes_since_analyze to zero, forgetting any changes that
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index f7c9bf6..98c14f9 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -46,6 +46,7 @@ extern Datum pg_stat_get_vacuum_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_autovacuum_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_analyze_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_autoanalyze_count(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_frozen_pages(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_function_calls(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_function_total_time(PG_FUNCTION_ARGS);
@@ -450,6 +451,21 @@ pg_stat_get_autoanalyze_count(PG_FUNCTION_ARGS)
}
Datum
+pg_stat_get_frozen_pages(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int32 result;
+ PgStat_StatTabEntry *tabentry;
+
+ if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+ result = 0;
+ else
+ result = (int32) (tabentry->n_frozen_pages);
+
+ PG_RETURN_INT32(result);
+}
+
+Datum
pg_stat_get_function_calls(PG_FUNCTION_ARGS)
{
Oid funcid = PG_GETARG_OID(0);
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 41d4606..3a666f8 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -231,6 +231,15 @@ check_cluster_versions(void)
if (old_cluster.major_version > new_cluster.major_version)
pg_fatal("This utility cannot be used to downgrade to older major PostgreSQL versions.\n");
+ /*
+ * We can't allow upgrading with link mode between 9.5 or before and 9.6 or later,
+ * because the format of visibility map has been changed in version 9.6.
+ */
+ if (user_opts.transfer_mode == TRANSFER_MODE_LINK &&
+ GET_MAJOR_VERSION(old_cluster.major_version) < 906 &&
+ GET_MAJOR_VERSION(new_cluster.major_version) >= 906)
+ pg_fatal("This utility cannot upgrade from PostgreSQL version from 9.5 or before to 9.6 or later with link mode.\n");
+
/* get old and new binary versions */
get_bin_version(&old_cluster);
get_bin_version(&new_cluster);
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 79d9390..d47a98b 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -10,6 +10,7 @@
#include "postgres_fe.h"
#include "pg_upgrade.h"
+#include "storage/bufpage.h"
#include <fcntl.h>
@@ -21,6 +22,43 @@ static int copy_file(const char *fromfile, const char *tofile, bool force);
static int win32_pghardlink(const char *src, const char *dst);
#endif
+static const char *rewriteVisibilitymap(const char *fromfile, const char *tofile,
+ bool force);
+
+/* table for fast rewriting vm file in order to add all-frozen information */
+static const uint16 rewrite_vm_table[256] = {
+ 0, 1, 4, 5, 16, 17, 20, 21, 64, 65, 68, 69, 80, 81, 84, 85,
+ 256, 257, 260, 261, 272, 273, 276, 277, 320, 321, 324, 325, 336, 337, 340, 341,
+ 1024, 1025, 1028, 1029, 1040, 1041, 1044, 1045, 1088, 1089, 1092, 1093, 1104, 1105, 1108, 1109,
+ 1280, 1281, 1284, 1285, 1296, 1297, 1300, 1301, 1344, 1345, 1348, 1349, 1360, 1361, 1364, 1365,
+ 4096, 4097, 4100, 4101, 4112, 4113, 4116, 4117, 4160, 4161, 4164, 4165, 4176, 4177, 4180, 4181,
+ 4352, 4353, 4356, 4357, 4368, 4369, 4372, 4373, 4416, 4417, 4420, 4421, 4432, 4433, 4436, 4437,
+ 5120, 5121, 5124, 5125, 5136, 5137, 5140, 5141, 5184, 5185, 5188, 5189, 5200, 5201, 5204, 5205,
+ 5376, 5377, 5380, 5381, 5392, 5393, 5396, 5397, 5440, 5441, 5444, 5445, 5456, 5457, 5460, 5461,
+ 16384, 16385, 16388, 16389, 16400, 16401, 16404, 16405, 16448, 16449, 16452, 16453, 16464, 16465, 16468, 16469,
+ 16640, 16641, 16644, 16645, 16656, 16657, 16660, 16661, 16704, 16705, 16708, 16709, 16720, 16721, 16724, 16725,
+ 17408, 17409, 17412, 17413, 17424, 17425, 17428, 17429, 17472, 17473, 17476, 17477, 17488, 17489, 17492, 17493,
+ 17664, 17665, 17668, 17669, 17680, 17681, 17684, 17685, 17728, 17729, 17732, 17733, 17744, 17745, 17748, 17749,
+ 20480, 20481, 20484, 20485, 20496, 20497, 20500, 20501, 20544, 20545, 20548, 20549, 20560, 20561, 20564, 20565,
+ 20736, 20737, 20740, 20741, 20752, 20753, 20756, 20757, 20800, 20801, 20804, 20805, 20816, 20817, 20820, 20821,
+ 21504, 21505, 21508, 21509, 21520, 21521, 21524, 21525, 21568, 21569, 21572, 21573, 21584, 21585, 21588, 21589,
+ 21760, 21761, 21764, 21765, 21776, 21777, 21780, 21781, 21824, 21825, 21828, 21829, 21840, 21841, 21844, 21845
+};
+
+/*
+ * copyOrRewriteFile()
+ * This function copies file or rewrite visibility map file.
+ * If rewrite_vm is true, we have to rewrite visibility map regardless value of pageConverter.
+ */
+const char *
+copyOrRewriteFile(pageCnvCtx *pageConverter,
+ const char *src, const char *dst, bool force, bool rewrite_vm)
+{
+ if (rewrite_vm)
+ return rewriteVisibilitymap(src, dst, force);
+ else
+ return copyAndUpdateFile(pageConverter, src, dst, force);
+}
/*
* copyAndUpdateFile()
@@ -201,6 +239,97 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
#endif
+/*
+ * rewriteVisibilitymap()
+ *
+ * A additional bit which indicates that all tuples on page is completely
+ * frozen is added into visibility map at PG 9.6. So the format of visibiilty
+ * map has been changed.
+ * Copies a visibility map file while adding all-frozen bit(0) into each bit.
+ */
+static const char *
+rewriteVisibilitymap(const char *fromfile, const char *tofile, bool force)
+{
+#define REWRITE_BUF_SIZE (50 * BLCKSZ)
+#define BITS_PER_HEAPBLOCK 2
+
+ int src_fd = 0;
+ int dst_fd = 0;
+ uint16 vm_bits;
+ ssize_t nbytes;
+ char *buffer = NULL;
+ int ret = 0;
+
+ if ((fromfile == NULL) || (tofile == NULL))
+ return getErrorText(EINVAL);
+
+ if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
+ goto err;
+
+ if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+ goto err;
+
+ buffer = (char *) pg_malloc(REWRITE_BUF_SIZE);
+
+ /* Copy page header data in advance */
+ if ((nbytes = read(src_fd, buffer, MAXALIGN(SizeOfPageHeaderData))) <= 0)
+ goto err;
+
+ if (write(dst_fd, buffer, nbytes) != nbytes)
+ {
+ /* if write didn't set errno, assume problem is no disk space */
+ if (errno == 0)
+ errno = ENOSPC;
+ goto err;
+ }
+
+ /* perform data rewriting i.e read src srouce, write to destination */
+ while (true)
+ {
+ ssize_t nbytes = read(src_fd, buffer, REWRITE_BUF_SIZE);
+ char *cur, *end;
+
+ if (nbytes < 0)
+ {
+ ret = -1;
+ break;
+ }
+
+ if (nbytes == 0)
+ break;
+
+ cur = buffer;
+ end = buffer + nbytes;
+
+ /* Rewrite a byte and write dest_fd per BITS_PER_HEAPBLOCK bytes */
+ while (end > cur)
+ {
+ /* Get rewritten bit from table and its string representation */
+ vm_bits = rewrite_vm_table[(uint8) *cur];
+
+ if (write(dst_fd, &vm_bits, BITS_PER_HEAPBLOCK) != BITS_PER_HEAPBLOCK)
+ {
+ ret = -1;
+ break;
+ }
+ cur++;
+ }
+ }
+
+err:
+
+ if (!buffer)
+ pg_free(buffer);
+
+ if (src_fd != 0)
+ close(src_fd);
+
+ if (dst_fd != 0)
+ close(dst_fd);
+
+ return (errno == 0) ? NULL : getErrorText(errno);
+}
+
void
check_hard_link(void)
{
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 13aa891..d04d836 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -112,6 +112,10 @@ extern char *output_files[];
#define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031
/*
+ * The format of visibility map changed with this 9.6 commit,
+ */
+#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201510191
+/*
* pg_multixact format changed in 9.3 commit 0ac5ad5134f2769ccbaefec73844f85,
* ("Improve concurrency of foreign key locking") which also updated catalog
* version to this value. pg_upgrade behavior depends on whether old and new
@@ -396,6 +400,8 @@ const pageCnvCtx *setupPageConverter(void);
typedef void *pageCnvCtx;
#endif
+const char *copyOrRewriteFile(pageCnvCtx *pageConverter, const char *src,
+ const char *dst, bool force, bool rewrite_vm);
const char *copyAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
const char *dst, bool force);
const char *linkAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index c22df42..5d07fff 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -18,7 +18,7 @@
static void transfer_single_new_db(pageCnvCtx *pageConverter,
FileNameMap *maps, int size, char *old_tablespace);
static void transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *suffix);
+ const char *suffix, bool vm_need_rewrite);
/*
@@ -171,6 +171,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
{
int mapnum;
bool vm_crashsafe_match = true;
+ bool vm_need_rewrite = false;
/*
* Do the old and new cluster disagree on the crash-safetiness of the vm
@@ -180,13 +181,20 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
vm_crashsafe_match = false;
+ /*
+ * Do we need to rewrite visibilitymap?
+ */
+ if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ vm_need_rewrite = true;
+
for (mapnum = 0; mapnum < size; mapnum++)
{
if (old_tablespace == NULL ||
strcmp(maps[mapnum].old_tablespace, old_tablespace) == 0)
{
/* transfer primary file */
- transfer_relfile(pageConverter, &maps[mapnum], "");
+ transfer_relfile(pageConverter, &maps[mapnum], "", false);
/* fsm/vm files added in PG 8.4 */
if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)
@@ -194,9 +202,9 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
/*
* Copy/link any fsm and vm files, if they exist
*/
- transfer_relfile(pageConverter, &maps[mapnum], "_fsm");
- if (vm_crashsafe_match)
- transfer_relfile(pageConverter, &maps[mapnum], "_vm");
+ transfer_relfile(pageConverter, &maps[mapnum], "_fsm", false);
+ if (vm_crashsafe_match || vm_need_rewrite)
+ transfer_relfile(pageConverter, &maps[mapnum], "_vm", vm_need_rewrite);
}
}
}
@@ -210,7 +218,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
*/
static void
transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *type_suffix)
+ const char *type_suffix, bool vm_need_rewrite)
{
const char *msg;
char old_file[MAXPGPATH];
@@ -218,6 +226,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
int fd;
int segno;
char extent_suffix[65];
+ bool rewrite_vm = false;
/*
* Now copy/link any related segments as well. Remember, PG breaks large
@@ -276,7 +285,13 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
+ /*
+ * Do we need to rewrite visibilitymap?
+ */
+ if (strcmp(type_suffix, "_vm") == 0 && vm_need_rewrite)
+ rewrite_vm = true;
+
+ if ((msg = copyOrRewriteFile(pageConverter, old_file, new_file, true, rewrite_vm)) != NULL)
pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
diff --git a/src/bin/pg_upgrade/test.sh b/src/bin/pg_upgrade/test.sh
index aa7f399..6b058d4 100644
--- a/src/bin/pg_upgrade/test.sh
+++ b/src/bin/pg_upgrade/test.sh
@@ -174,6 +174,11 @@ if "$MAKE" -C "$oldsrc" installcheck; then
mv "$temp_root"/dump1.sql "$temp_root"/dump1.sql.orig
sed "s;$oldsrc;$newsrc;g" "$temp_root"/dump1.sql.orig >"$temp_root"/dump1.sql
fi
+
+ vm_sql="SELECT c.relname, c.relallvisible FROM pg_class as c, pg_namespace as n WHERE c.relnamespace = n.oid AND n.nspname NOT IN ('information_schema', 'pg_toast', 'pg_catalog') ORDER BY c.relname;"
+ # Test for rewriting visibility map
+ vacuumdb -d regression || visibilitymap_vacuum1_status=$?
+ psql -d regression -c "$vm_sql" > "$temp_root"/vm_test1.txt || visibilitymap_test1_status=$?
else
make_installcheck_status=$?
fi
@@ -188,6 +193,14 @@ if [ -n "$pg_dumpall1_status" ]; then
echo "pg_dumpall of pre-upgrade database cluster failed"
exit 1
fi
+if [ -n "$visibilitymap_vacuum1_status" ];then
+ echo "VACUUM of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+if [ -n "$visibilitymap_test1_status" ];then
+ echo "SELECT of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
PGDATA=$BASE_PGDATA
@@ -203,6 +216,8 @@ case $testhost in
esac
pg_dumpall -f "$temp_root"/dump2.sql || pg_dumpall2_status=$?
+vacuumdb -d regression || visibilitymap_vacuum2_status=$?
+psql -d regression -c "$vm_sql" > "$temp_root"/vm_test2.txt || visibilitymap_test2_status=$?
pg_ctl -m fast stop
# no need to echo commands anymore
@@ -214,11 +229,26 @@ if [ -n "$pg_dumpall2_status" ]; then
exit 1
fi
+if [ -n "$visibilitymap_vacuum2_status" ];then
+ echo "VACUUM of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
+if [ -n "$visibilitymap_test2_status" ];then
+ echo "SELECT of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
case $testhost in
MINGW*) cmd /c delete_old_cluster.bat ;;
*) sh ./delete_old_cluster.sh ;;
esac
+if ! diff "$temp_root"/vm_test1.txt "$temp_root"/vm_test2.txt >/dev/null; then
+ echo "Visibility map rewriting test failed"
+ exit 1
+fi
+
if diff "$temp_root"/dump1.sql "$temp_root"/dump2.sql >/dev/null; then
echo PASSED
exit 0
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index caa0f14..93afb10 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -320,9 +320,10 @@ typedef struct xl_heap_freeze_page
typedef struct xl_heap_visible
{
TransactionId cutoff_xid;
+ uint8 flags;
} xl_heap_visible;
-#define SizeOfHeapVisible (offsetof(xl_heap_visible, cutoff_xid) + sizeof(TransactionId))
+#define SizeOfHeapVisible (offsetof(xl_heap_visible, flags) + sizeof(uint8))
typedef struct xl_heap_new_cid
{
@@ -389,6 +390,6 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
xl_heap_freeze_tuple *xlrec_tp);
extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
- Buffer vm_buffer, TransactionId cutoff_xid);
+ Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
#endif /* HEAPAM_XLOG_H */
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 0c0e0ef..bacc349 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -19,15 +19,22 @@
#include "storage/buf.h"
#include "utils/relcache.h"
+/* Flags for bit map */
+#define VISIBILITYMAP_ALL_VISIBLE 0x01
+#define VISIBILITYMAP_ALL_FROZEN 0x02
+
extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk,
Buffer vmbuf);
extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
Buffer *vmbuf);
extern bool visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid);
-extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
-extern BlockNumber visibilitymap_count(Relation rel);
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags);
+extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf,
+ uint8 flags);
+extern void visibilitymap_count(Relation rel, BlockNumber *all_visible,
+ BlockNumber *all_frozen);
extern void visibilitymap_truncate(Relation rel, BlockNumber nheapblocks);
#endif /* VISIBILITYMAP_H */
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 15e0b97..f2ef868 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 201510161
+#define CATALOG_VERSION_NO 201510191
#endif
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index f688454..de9f11a 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2777,6 +2777,8 @@ DATA(insert OID = 3056 ( pg_stat_get_analyze_count PGNSP PGUID 12 1 0 0 0 f f f
DESCR("statistics: number of manual analyzes for a table");
DATA(insert OID = 3057 ( pg_stat_get_autoanalyze_count PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_autoanalyze_count _null_ _null_ _null_ ));
DESCR("statistics: number of auto analyzes for a table");
+DATA(insert OID = 6015 ( pg_stat_get_frozen_pages PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 23 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_frozen_pages _null_ _null_ _null_ ));
+DESCR("statistics: number of frozen pages of table");
DATA(insert OID = 1936 ( pg_stat_get_backend_idset PGNSP PGUID 12 1 100 0 0 f f f f t t s r 0 0 23 "" _null_ _null_ _null_ _null_ _null_ pg_stat_get_backend_idset _null_ _null_ _null_ ));
DESCR("statistics: currently active backend IDs");
DATA(insert OID = 2022 ( pg_stat_get_activity PGNSP PGUID 12 1 100 0 0 f f f f f t s r 1 0 2249 "23" "{23,26,23,26,25,25,25,16,1184,1184,1184,1184,869,25,23,28,28,16,25,25,23,16,25}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,datid,pid,usesysid,application_name,state,query,waiting,xact_start,query_start,backend_start,state_change,client_addr,client_hostname,client_port,backend_xid,backend_xmin,ssl,sslversion,sslcipher,sslbits,sslcompression,sslclientdn}" _null_ _null_ pg_stat_get_activity _null_ _null_ _null_ ));
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 9ecc163..3bca9f5 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -355,6 +355,7 @@ typedef struct PgStat_MsgVacuum
TimestampTz m_vacuumtime;
PgStat_Counter m_live_tuples;
PgStat_Counter m_dead_tuples;
+ int32 m_frozen_pages;
} PgStat_MsgVacuum;
@@ -372,6 +373,7 @@ typedef struct PgStat_MsgAnalyze
TimestampTz m_analyzetime;
PgStat_Counter m_live_tuples;
PgStat_Counter m_dead_tuples;
+ int32 m_frozen_pages;
} PgStat_MsgAnalyze;
@@ -615,6 +617,8 @@ typedef struct PgStat_StatTabEntry
PgStat_Counter n_dead_tuples;
PgStat_Counter changes_since_analyze;
+ int32 n_frozen_pages;
+
PgStat_Counter blocks_fetched;
PgStat_Counter blocks_hit;
@@ -917,9 +921,11 @@ extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type t
extern void pgstat_report_autovac(Oid dboid);
extern void pgstat_report_vacuum(Oid tableoid, bool shared,
- PgStat_Counter livetuples, PgStat_Counter deadtuples);
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages);
extern void pgstat_report_analyze(Relation rel,
- PgStat_Counter livetuples, PgStat_Counter deadtuples);
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages);
extern void pgstat_report_recovery_conflict(int reason);
extern void pgstat_report_deadlock(void);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index a2f78ee..102aa81 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -178,8 +178,10 @@ typedef PageHeaderData *PageHeader;
* tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
+#define PD_ALL_FROZEN 0x0008 /* all tuples on page are completely
+ frozen */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x000F /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -367,7 +369,12 @@ typedef PageHeaderData *PageHeader;
#define PageSetAllVisible(page) \
(((PageHeader) (page))->pd_flags |= PD_ALL_VISIBLE)
#define PageClearAllVisible(page) \
- (((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
+ (((PageHeader) (page))->pd_flags &= ~(PD_ALL_VISIBLE | PD_ALL_FROZEN))
+
+#define PageIsAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags & PD_ALL_FROZEN)
+#define PageSetAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags |= PD_ALL_FROZEN)
#define PageIsPrunable(page, oldestxmin) \
( \
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 80374e4..b259e65 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1683,6 +1683,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,
pg_stat_get_live_tuples(c.oid) AS n_live_tup,
pg_stat_get_dead_tuples(c.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(c.oid) AS n_mod_since_analyze,
+ pg_stat_get_frozen_pages(c.oid) AS n_frozen_page,
pg_stat_get_last_vacuum_time(c.oid) AS last_vacuum,
pg_stat_get_last_autovacuum_time(c.oid) AS last_autovacuum,
pg_stat_get_last_analyze_time(c.oid) AS last_analyze,
@@ -1795,6 +1796,7 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
+ pg_stat_all_tables.n_frozen_page,
pg_stat_all_tables.last_vacuum,
pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,
@@ -1838,6 +1840,7 @@ pg_stat_user_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
+ pg_stat_all_tables.n_frozen_page,
pg_stat_all_tables.last_vacuum,
pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index f5be70f..dd49786 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -127,6 +127,8 @@ SELECT count(*) FROM tenk2 WHERE unique1 = 1;
1
(1 row)
+-- do VACUUM FREEZE
+VACUUM FREEZE tenk2;
-- force the rate-limiting logic in pgstat_report_tabstat() to time out
-- and send a message
SELECT pg_sleep(1.0);
@@ -175,6 +177,14 @@ SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages,
t | t
(1 row)
+SELECT n_frozen_page = (pg_relation_size('tenk2') / current_setting('block_size')::int)
+ FROM pg_stat_user_tables
+ WHERE relname ='tenk2';
+ ?column?
+----------
+ t
+(1 row)
+
SELECT pr.snap_ts < pg_stat_get_snapshot_timestamp() as snapshot_newer
FROM prevstats AS pr;
snapshot_newer
diff --git a/src/test/regress/expected/visibilitymap.out b/src/test/regress/expected/visibilitymap.out
new file mode 100644
index 0000000..aedce1e
--- /dev/null
+++ b/src/test/regress/expected/visibilitymap.out
@@ -0,0 +1,22 @@
+--
+-- Visibility map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+-- All pages are become all-visible
+VACUUM FREEZE vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+ ?column?
+----------
+ t
+(1 row)
+
+-- Check whether vacuum skips all-frozen pages
+\set VERBOSITY terse
+VACUUM FREEZE VERBOSE vmtest;
+INFO: vacuuming "public.vmtest"
+INFO: index "vmtest_pkey" now contains 10000 row versions in 30 pages
+INFO: skipped 45 frozen pages according to visibility map
+INFO: "vmtest": found 0 removable, 0 nonremovable row versions in 0 out of 45 pages
+\set VERBOSITY default
+DROP TABLE vmtest;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index c63abf4..7e905df 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -110,3 +110,6 @@ test: event_trigger
# run stats by itself because its delay may be insufficient under heavy load
test: stats
+
+# visibility map and vacuum test cannot run concurrently with any test that runs SQL
+test: visibilitymap
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 88dcd64..2ecfe56 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -159,3 +159,4 @@ test: with
test: xml
test: event_trigger
test: stats
+test: visibilitymap
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index cd2d592..b3c640f 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -120,6 +120,8 @@ ROLLBACK;
SELECT count(*) FROM tenk2;
-- do an indexscan
SELECT count(*) FROM tenk2 WHERE unique1 = 1;
+-- do VACUUM FREEZE
+VACUUM FREEZE tenk2;
-- force the rate-limiting logic in pgstat_report_tabstat() to time out
-- and send a message
@@ -145,6 +147,10 @@ SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages,
FROM pg_statio_user_tables AS st, pg_class AS cl, prevstats AS pr
WHERE st.relname='tenk2' AND cl.relname='tenk2';
+SELECT n_frozen_page = (pg_relation_size('tenk2') / current_setting('block_size')::int)
+ FROM pg_stat_user_tables
+ WHERE relname ='tenk2';
+
SELECT pr.snap_ts < pg_stat_get_snapshot_timestamp() as snapshot_newer
FROM prevstats AS pr;
diff --git a/src/test/regress/sql/visibilitymap.sql b/src/test/regress/sql/visibilitymap.sql
new file mode 100644
index 0000000..832120b
--- /dev/null
+++ b/src/test/regress/sql/visibilitymap.sql
@@ -0,0 +1,16 @@
+--
+-- Visibility map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+
+-- All pages are become all-visible
+VACUUM FREEZE vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+
+-- Check whether vacuum skips all-frozen pages
+\set VERBOSITY terse
+VACUUM FREEZE VERBOSE vmtest;
+\set VERBOSITY default
+
+DROP TABLE vmtest;
On Thu, Oct 8, 2015 at 11:05 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 1 October 2015 at 23:30, Josh Berkus <josh@agliodbs.com> wrote:
On 10/01/2015 07:43 AM, Robert Haas wrote:
On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao <masao.fujii@gmail.com>
wrote:
I wonder how much it's worth renaming only the file extension while
there are many places where "visibility map" and "vm" are used,
for example, log messages, function names, variables, etc.I'd be inclined to keep calling it the visibility map (vm) even if it
also contains freeze information.
What is your main worry about changing the name of this map, is it
about more code churn or is it about that we might introduce new issues
or is it about that people are already accustomed to call this map as
visibility map?
-1 to rename. Visibility Map is a perfectly good name.
The name can stay the same, but specifically the file extension should
change.
It seems to me quite logical for understanding purpose as well. Any new
person who wants to work in this area or is looking into it will always
wonder why this map is named as visibility map even though it contains
information about visibility of page as well as frozen state of page. So
even though it doesn't make any difference in correctness of feature whether
we retain the current name or change it to Visibility & Freeze Map (aka
vfm),
but I think it makes sense to change it for the sake of maintenance of this
code.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Fri, Oct 30, 2015 at 6:03 AM, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
v20 patch has a bug in result of regression test.
Attached updated v21 patch.
Couple of more review comments:
------------------------------------------------------
1.
@@ -615,6 +617,8 @@ typedef struct PgStat_StatTabEntry
PgStat_Counter n_dead_tuples;
PgStat_Counter
changes_since_analyze;
+ int32 n_frozen_pages;
+
PgStat_Counter blocks_fetched;
PgStat_Counter
blocks_hit;
As you are changing above structure, you need to update
PGSTAT_FILE_FORMAT_ID, refer below code:
#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9D
2. It seems that n_frozen_page is not initialized/updated properly
for toast tables:
Try with below steps:
postgres=# create table t4(c1 int, c2 text);
CREATE TABLE
postgres=# select oid, relname from pg_class where relname like '%t4%';
oid | relname
-------+---------
16390 | t4
(1 row)
postgres=# select oid, relname from pg_class where relname like '%16390%';
oid | relname
-------+----------------------
16393 | pg_toast_16390
16395 | pg_toast_16390_index
(2 rows)
postgres=# select relname,seq_scan,n_tup_ins,last_vacuum,n_frozen_page from
pg_s
tat_all_tables where relname like '%16390%';
relname | seq_scan | n_tup_ins | last_vacuum | n_frozen_page
----------------+----------+-----------+-------------+---------------
pg_toast_16390 | 1 | 0 | | -842150451
(1 row)
Note that I have tested above scenario on my Windows 7 m/c.
3.
* visibilitymap.c
* bitmap for tracking visibility of heap tuples
I think above needs to be changed to:
bitmap for tracking visibility and frozen state of heap tuples
4.
a.
/*
- * If we froze any tuples, mark the buffer dirty, and write a WAL
- * record recording the changes. We must log the changes to be
- * crash-safe against future truncation of CLOG.
+ * If we froze any tuples then we mark the buffer dirty, and write a WAL
b.
- * Release any remaining pin on visibility map page.
+ * Release any remaining pin on visibility map.
c.
* We do update relallvisible even in the corner case, since if the table
- * is all-visible
we'd definitely like to know that. But clamp the value
- * to be not more than what we're setting
relpages to.
+ * is all-visible we'd definitely like to know that.
+ * But clamp the value to be not more
than what we're setting relpages to.
I don't think you need to change above comments.
5.
+ * Even if scan_all is set so far, we could skip to scan some pages
+ * according by all-frozen
bit of visibility amp.
/according by/according to
/amp/map
I suggested to modify comment as below:
During full scan, we could skip some pages according to all-frozen
bit of visibility map.
Also no need to start this in new line, start from where the
previous line of comment ends.
6.
/*
* lazy_scan_heap() -- scan an open heap relation
*
* This routine prunes each page in the
heap, which will among other
* things truncate dead tuples to dead line pointers, defragment the
*
page, and set commit status bits (see heap_page_prune). It also builds
* lists of dead
tuples and pages with free space, calculates statistics
* on the number of live tuples in the
heap, and marks pages as
* all-visible if appropriate.
Modify above function header as:
all-visible, all-frozen
7.
lazy_scan_heap()
{
..
if (PageIsEmpty(page))
{
empty_pages++;
freespace =
PageGetHeapFreeSpace(page);
/* empty pages are always all-visible */
if (!PageIsAllVisible(page))
..
}
Don't we need to ensure that empty pages should get marked as
all-frozen?
8.
lazy_scan_heap()
{
..
/*
* As of PostgreSQL 9.2, the visibility map bit should never be set if
* the page-
level bit is clear. However, it's possible that the bit
* got cleared after we checked it
and before we took the buffer
* content lock, so we must recheck before jumping to the conclusion
* that something bad has happened.
*/
else if (all_visible_according_to_vm
&& !PageIsAllVisible(page)
&& visibilitymap_test(onerel, blkno, &vmbuffer,
VISIBILITYMAP_ALL_VISIBLE))
{
elog(WARNING, "page is not marked all-visible
but visibility map bit is set in relation \"%s\" page %u",
relname, blkno);
visibilitymap_clear(onerel, blkno, vmbuffer);
}
/*
*
It's possible for the value returned by GetOldestXmin() to move
* backwards, so it's not wrong for
us to see tuples that appear to
* not be visible to everyone yet, while PD_ALL_VISIBLE is already
* set. The real safe xmin value never moves backwards, but
* GetOldestXmin() is
conservative and sometimes returns a value
* that's unnecessarily small, so if we see that
contradiction it just
* means that the tuples that we think are not visible to everyone yet
* actually are, and the PD_ALL_VISIBLE flag is correct.
*
* There should never
be dead tuples on a page with PD_ALL_VISIBLE
* set, however.
*/
else
if (PageIsAllVisible(page) && has_dead_tuples)
{
elog(WARNING, "page
containing dead tuples is marked as all-visible in relation \"%s\" page %u",
relname, blkno);
PageClearAllVisible(page);
MarkBufferDirty(buf);
visibilitymap_clear(onerel, blkno, vmbuffer);
}
..
}
I think both the above cases could happen for frozen state
as well, unless you think otherwise, we need similar handling
for frozen bit.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Sat, Oct 31, 2015 at 1:32 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Oct 8, 2015 at 11:05 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 1 October 2015 at 23:30, Josh Berkus <josh@agliodbs.com> wrote:
On 10/01/2015 07:43 AM, Robert Haas wrote:
On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao <masao.fujii@gmail.com>
wrote:I wonder how much it's worth renaming only the file extension while
there are many places where "visibility map" and "vm" are used,
for example, log messages, function names, variables, etc.I'd be inclined to keep calling it the visibility map (vm) even if it
also contains freeze information.What is your main worry about changing the name of this map, is it
about more code churn or is it about that we might introduce new issues
or is it about that people are already accustomed to call this map as
visibility map?
My concern is mostly that I think calling it the "visibility and
freeze map" is excessively long and wordy.
One observation that someone made previously is that there is a
difference between "all-visible" and "index-only scan OK". An
all-visible page that has a HOT update is no longer all-visible (it
needs vacuuming) but an index-only scan would still be OK (because
only the non-indexed values in the tuple have changed, and every scan
scan can see either the old or the new tuple but not both. At
present, the index-only scan will consult the heap page anyway,
because all we know is that the page is not all-visible. But maybe in
the future somebody will decide to add a bit for that. Then we'd have
the "visibility, usable for index-only scans, and freeze map", but I
think "_vufiosfm" will not be a good choice for a file suffix.
So similarly here. The file suffix doesn't need to enumerate all the
bits that are present for each page.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Nov 3, 2015 at 5:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sat, Oct 31, 2015 at 1:32 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:
What is your main worry about changing the name of this map, is it
about more code churn or is it about that we might introduce new issues
or is it about that people are already accustomed to call this map as
visibility map?My concern is mostly that I think calling it the "visibility and
freeze map" is excessively long and wordy.One observation that someone made previously is that there is a
difference between "all-visible" and "index-only scan OK". An
all-visible page that has a HOT update is no longer all-visible (it
needs vacuuming) but an index-only scan would still be OK (because
only the non-indexed values in the tuple have changed, and every scan
scan can see either the old or the new tuple but not both. At
present, the index-only scan will consult the heap page anyway,
because all we know is that the page is not all-visible. But maybe in
the future somebody will decide to add a bit for that. Then we'd have
the "visibility, usable for index-only scans, and freeze map", but I
think "_vufiosfm" will not be a good choice for a file suffix.
I think in that case we can call it as page info map or page state map, but
I find retaining visibility map name in this case or for future (if we want
to
add another bit) as confusing. In-fact if you find "visibility and freeze
map",
as excessively long, then we can change it to "page info map" or "page state
map" now as well.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Mon, Nov 2, 2015 at 10:33 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Nov 3, 2015 at 5:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sat, Oct 31, 2015 at 1:32 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:What is your main worry about changing the name of this map, is it
about more code churn or is it about that we might introduce new issues
or is it about that people are already accustomed to call this map as
visibility map?My concern is mostly that I think calling it the "visibility and
freeze map" is excessively long and wordy.One observation that someone made previously is that there is a
difference between "all-visible" and "index-only scan OK". An
all-visible page that has a HOT update is no longer all-visible (it
needs vacuuming) but an index-only scan would still be OK (because
only the non-indexed values in the tuple have changed, and every scan
scan can see either the old or the new tuple but not both. At
present, the index-only scan will consult the heap page anyway,
because all we know is that the page is not all-visible. But maybe in
the future somebody will decide to add a bit for that. Then we'd have
the "visibility, usable for index-only scans, and freeze map", but I
think "_vufiosfm" will not be a good choice for a file suffix.I think in that case we can call it as page info map or page state map, but
I find retaining visibility map name in this case or for future (if we want
to
add another bit) as confusing. In-fact if you find "visibility and freeze
map",
as excessively long, then we can change it to "page info map" or "page state
map" now as well.
Sure. Or we could just keep calling it the visibility map, and then
everyone would know what we're talking about.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Nov 3, 2015 at 12:33 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Nov 3, 2015 at 5:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sat, Oct 31, 2015 at 1:32 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:What is your main worry about changing the name of this map, is it
about more code churn or is it about that we might introduce new issues
or is it about that people are already accustomed to call this map as
visibility map?My concern is mostly that I think calling it the "visibility and
freeze map" is excessively long and wordy.One observation that someone made previously is that there is a
difference between "all-visible" and "index-only scan OK". An
all-visible page that has a HOT update is no longer all-visible (it
needs vacuuming) but an index-only scan would still be OK (because
only the non-indexed values in the tuple have changed, and every scan
scan can see either the old or the new tuple but not both. At
present, the index-only scan will consult the heap page anyway,
because all we know is that the page is not all-visible. But maybe in
the future somebody will decide to add a bit for that. Then we'd have
the "visibility, usable for index-only scans, and freeze map", but I
think "_vufiosfm" will not be a good choice for a file suffix.I think in that case we can call it as page info map or page state map, but
I find retaining visibility map name in this case or for future (if we want
to
add another bit) as confusing. In-fact if you find "visibility and freeze
map",
as excessively long, then we can change it to "page info map" or "page state
map" now as well.
In that case, file suffix would be "_pim" or "_psm"?
IMO, "page info map" would be better, because the bit doesn't indicate
the status of page in real time, it's just additional information.
Also we need to rewrite to new name in source code, and source file
name as well.
Regards,
--
Masahiko Sawada
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Nov 4, 2015 at 4:45 AM, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
On Tue, Nov 3, 2015 at 12:33 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:
On Tue, Nov 3, 2015 at 5:04 AM, Robert Haas <robertmhaas@gmail.com>
wrote:
On Sat, Oct 31, 2015 at 1:32 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:What is your main worry about changing the name of this map, is it
about more code churn or is it about that we might introduce new
issues
or is it about that people are already accustomed to call this map as
visibility map?My concern is mostly that I think calling it the "visibility and
freeze map" is excessively long and wordy.One observation that someone made previously is that there is a
difference between "all-visible" and "index-only scan OK". An
all-visible page that has a HOT update is no longer all-visible (it
needs vacuuming) but an index-only scan would still be OK (because
only the non-indexed values in the tuple have changed, and every scan
scan can see either the old or the new tuple but not both. At
present, the index-only scan will consult the heap page anyway,
because all we know is that the page is not all-visible. But maybe in
the future somebody will decide to add a bit for that. Then we'd have
the "visibility, usable for index-only scans, and freeze map", but I
think "_vufiosfm" will not be a good choice for a file suffix.I think in that case we can call it as page info map or page state map,
but
I find retaining visibility map name in this case or for future (if we
want
to
add another bit) as confusing. In-fact if you find "visibility and
freeze
map",
as excessively long, then we can change it to "page info map" or "page
state
map" now as well.
In that case, file suffix would be "_pim" or "_psm"?
Right.
IMO, "page info map" would be better, because the bit doesn't indicate
the status of page in real time, it's just additional information.
Also we need to rewrite to new name in source code, and source file
name as well.
I think so. Here I think the right thing to do is lets proceed with fixing
other issues of patch and work on this part later and in the mean time
we might get more feedback on this part of proposal.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Hello, I had a look on v21 patch.
Though I haven't looked the whole of the patch, I'd like to show
you some comments only for visibilitymap.c and a part of the
documentation.
1. Patch application
patch command complains about offsets for heapam.c at current
master.
2. visitibilymap_test()
- if (visibilitymap_test(rel, blkno, &vmbuffer))
+ if (visibilitymap_test(rel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE)
The old VM was a simple bitmap so the name _test and the
function are proper but now the bitmap is quad state so it'd be
better chainging the function. Alghough it is not so expensive
to call it twice successively, it is a bit uneasy for me doing
so. One possible shape would be like the following.
lazy_vacuum_page()
int vmstate = visibilitymap_get_status(rel, blkno, &vmbuffer);
if (!(vmstate & VISIBILITYMAP_ALL_VISIBLE))
...
if (all_frozen && !(vmstate & VISIBILITYMAP_ALL_FROZEN))
...
if (flags != vmstate)
visibilitymap_set(...., flags);
and defining two macros for indivisual tests,
#define VM_ALL_VISIBLE(r, b, v) ((vm_get_status((r), (b), (v)) & .._VISIBLE) != 0)
if (VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
and
if (VM_ALL_FROZEN(rel, blkno, &vmbuffer))
How about this?
3. visibilitymap.c
- HEAPBLK_TO_MAPBIT
In visibilitymap_clear and other functions, mapBit means
mapDualBit in the patch, and mapBit always appears in the form
"mapBit * BITS_PER_HEAPBLOCK". So, it'd be better to change the
definition of HEAPBLK_TO_MAPBIT so that it calculates really
the bit position in a byte.
- #define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
+ #define HEAPBLK_TO_MAPBIT(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)
- visibilitymap_count()
The third argument all_frozen is not necessary in some
usage. So this interface would be preferable to be as
following,
BlockNumber
visibilitymap_count(Relation rel, BlockNumber *all_frozen)
{
BlockNumber all_visible = 0;
...
if (all_frozen)
*all_frozen = 0;
... something like ...
- visibilitymap_set()
The check for ALL_VISIBLE is duplicate in the following
assertion.
Assert((flags & VISIBILITYMAP_ALL_VISIBLE) ||
(flags & (VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN)));
4. documentation
- 18.11.1 Statement Hehavior
A typo.
VACUUM performs *a* aggressive freezing
However I am not a fluent English speaker, and such
wordsmithing would be done by someone else, I feel that
"eager/greedy" is more suitable for this meaning..,
nevertheless, the term "whole-table freezing" that you wrote
elsewhere in this patch would be usable.
"VACUUM performs a whole-table freezing"
All "a table scan/sweep"s and something has the similar
meaning would be better be changed to "a whole-table
freezing"
In similar manner, "tuples/rows that are marked as frozen"
could be replaced with "unfrozen tuples/rows".
- 23.1.5 Preventing Transaction ID Wraparound Failures
"The whole table is scanned only when all pages happen to
require vacuuming to remove dead row versions."
This description looks a bit out-of-point. "the whole table
scan" in the original description is what is triggered by
relfrozenxid so the correspondent in the revised description
is "the whole-table freezing", maybe.
"The whole-table feezing takes place when
<structfield>relfrozenxid</> is more than
<varname>vacuum_freeze_table_age</> transactions old or when
<command>VACUUM</>'s <literal>FREEZE</> option is used. The
whole-table freezing scans all unfreezed pages."
The last sentence might be unnecessary.
- 63.4 Visibility Map
"pages contain only tuples that are marked as frozen" would be
enough to be "pages contain only frozen tuples"
and according to the discussion upthread, we might be good to
have some desciption that the name is historically omitting
the aspect of freezemap.
At Sat, 31 Oct 2015 18:07:32 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in <CAA4eK1+aTdaSwG3u+y8fXxn67Kkj0T1KzRsFDLEi=tQvTYgFrQ@mail.gmail.com>
amit.kapila16> On Fri, Oct 30, 2015 at 6:03 AM, Masahiko Sawada <sawada.mshk@gmail.com>
Couple of more review comments:
------------------------------------------------------1.
@@ -615,6 +617,8 @@ typedef struct PgStat_StatTabEntry
PgStat_Counter n_dead_tuples;
PgStat_Counter
changes_since_analyze;+ int32 n_frozen_pages;
+
PgStat_Counter blocks_fetched;
PgStat_Counter
blocks_hit;As you are changing above structure, you need to update
PGSTAT_FILE_FORMAT_ID, refer below code:
#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9D2. It seems that n_frozen_page is not initialized/updated properly
for toast tables:Try with below steps:
postgres=# create table t4(c1 int, c2 text);
CREATE TABLE
postgres=# select oid, relname from pg_class where relname like '%t4%';
oid | relname
-------+---------
16390 | t4
(1 row)postgres=# select oid, relname from pg_class where relname like '%16390%';
oid | relname
-------+----------------------
16393 | pg_toast_16390
16395 | pg_toast_16390_index
(2 rows)postgres=# select relname,seq_scan,n_tup_ins,last_vacuum,n_frozen_page from
pg_s
tat_all_tables where relname like '%16390%';
relname | seq_scan | n_tup_ins | last_vacuum | n_frozen_page
----------------+----------+-----------+-------------+---------------
pg_toast_16390 | 1 | 0 | | -842150451
(1 row)Note that I have tested above scenario on my Windows 7 m/c.
3.
* visibilitymap.c
* bitmap for tracking visibility of heap tuplesI think above needs to be changed to:
bitmap for tracking visibility and frozen state of heap tuples4. a. /* - * If we froze any tuples, mark the buffer dirty, and write a WAL - * record recording the changes. We must log the changes to be - * crash-safe against future truncation of CLOG. + * If we froze any tuples then we mark the buffer dirty, and write a WALb. - * Release any remaining pin on visibility map page. + * Release any remaining pin on visibility map.c. * We do update relallvisible even in the corner case, since if the table - * is all-visible we'd definitely like to know that. But clamp the value - * to be not more than what we're setting relpages to. + * is all-visible we'd definitely like to know that. + * But clamp the value to be not more than what we're setting relpages to.I don't think you need to change above comments.
5. + * Even if scan_all is set so far, we could skip to scan some pages + * according by all-frozen bit of visibility amp./according by/according to
/amp/mapI suggested to modify comment as below:
During full scan, we could skip some pages according to all-frozen
bit of visibility map.Also no need to start this in new line, start from where the
previous line of comment ends.6.
/*
* lazy_scan_heap() -- scan an open heap relation
*
* This routine prunes each page in the
heap, which will among other
* things truncate dead tuples to dead line pointers, defragment the
*
page, and set commit status bits (see heap_page_prune). It also builds
* lists of dead
tuples and pages with free space, calculates statistics
* on the number of live tuples in the
heap, and marks pages as
* all-visible if appropriate.Modify above function header as:
all-visible, all-frozen
7.
lazy_scan_heap()
{
..if (PageIsEmpty(page))
{
empty_pages++;
freespace =
PageGetHeapFreeSpace(page);/* empty pages are always all-visible */
if (!PageIsAllVisible(page))
..
}Don't we need to ensure that empty pages should get marked as
all-frozen?8.
lazy_scan_heap()
{
..
/*
* As of PostgreSQL 9.2, the visibility map bit should never be set if
* the page-
level bit is clear. However, it's possible that the bit
* got cleared after we checked it
and before we took the buffer
* content lock, so we must recheck before jumping to the conclusion
* that something bad has happened.
*/
else if (all_visible_according_to_vm
&& !PageIsAllVisible(page)
&& visibilitymap_test(onerel, blkno, &vmbuffer,
VISIBILITYMAP_ALL_VISIBLE))
{
elog(WARNING, "page is not marked all-visible
but visibility map bit is set in relation \"%s\" page %u",
relname, blkno);
visibilitymap_clear(onerel, blkno, vmbuffer);
}/*
*
It's possible for the value returned by GetOldestXmin() to move
* backwards, so it's not wrong for
us to see tuples that appear to
* not be visible to everyone yet, while PD_ALL_VISIBLE is already
* set. The real safe xmin value never moves backwards, but
* GetOldestXmin() is
conservative and sometimes returns a value
* that's unnecessarily small, so if we see that
contradiction it just
* means that the tuples that we think are not visible to everyone yet
* actually are, and the PD_ALL_VISIBLE flag is correct.
*
* There should never
be dead tuples on a page with PD_ALL_VISIBLE
* set, however.
*/
else
if (PageIsAllVisible(page) && has_dead_tuples)
{
elog(WARNING, "page
containing dead tuples is marked as all-visible in relation \"%s\" page %u",relname, blkno);
PageClearAllVisible(page);
MarkBufferDirty(buf);
visibilitymap_clear(onerel, blkno, vmbuffer);
}..
}I think both the above cases could happen for frozen state
as well, unless you think otherwise, we need similar handling
for frozen bit.
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Nov 4, 2015 at 12:19 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Nov 4, 2015 at 4:45 AM, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:On Tue, Nov 3, 2015 at 12:33 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:On Tue, Nov 3, 2015 at 5:04 AM, Robert Haas <robertmhaas@gmail.com>
wrote:On Sat, Oct 31, 2015 at 1:32 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:What is your main worry about changing the name of this map, is it
about more code churn or is it about that we might introduce new
issues
or is it about that people are already accustomed to call this map as
visibility map?My concern is mostly that I think calling it the "visibility and
freeze map" is excessively long and wordy.One observation that someone made previously is that there is a
difference between "all-visible" and "index-only scan OK". An
all-visible page that has a HOT update is no longer all-visible (it
needs vacuuming) but an index-only scan would still be OK (because
only the non-indexed values in the tuple have changed, and every scan
scan can see either the old or the new tuple but not both. At
present, the index-only scan will consult the heap page anyway,
because all we know is that the page is not all-visible. But maybe in
the future somebody will decide to add a bit for that. Then we'd have
the "visibility, usable for index-only scans, and freeze map", but I
think "_vufiosfm" will not be a good choice for a file suffix.I think in that case we can call it as page info map or page state map,
but
I find retaining visibility map name in this case or for future (if we
want
to
add another bit) as confusing. In-fact if you find "visibility and
freeze
map",
as excessively long, then we can change it to "page info map" or "page
state
map" now as well.In that case, file suffix would be "_pim" or "_psm"?
Right.
IMO, "page info map" would be better, because the bit doesn't indicate
the status of page in real time, it's just additional information.
Also we need to rewrite to new name in source code, and source file
name as well.I think so. Here I think the right thing to do is lets proceed with fixing
other issues of patch and work on this part later and in the mean time
we might get more feedback on this part of proposal.
Yeah, I'm going to do that changes if there is no strong objection from hackers.
Regards,
--
Masahiko Sawada
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Nov 5, 2015 at 6:03 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Hello, I had a look on v21 patch.
Though I haven't looked the whole of the patch, I'd like to show
you some comments only for visibilitymap.c and a part of the
documentation.1. Patch application
patch command complains about offsets for heapam.c at current
master.2. visitibilymap_test()
- if (visibilitymap_test(rel, blkno, &vmbuffer)) + if (visibilitymap_test(rel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE)The old VM was a simple bitmap so the name _test and the
function are proper but now the bitmap is quad state so it'd be
better chainging the function. Alghough it is not so expensive
to call it twice successively, it is a bit uneasy for me doing
so. One possible shape would be like the following.lazy_vacuum_page()
int vmstate = visibilitymap_get_status(rel, blkno, &vmbuffer);
if (!(vmstate & VISIBILITYMAP_ALL_VISIBLE))
...
if (all_frozen && !(vmstate & VISIBILITYMAP_ALL_FROZEN))
...
if (flags != vmstate)
visibilitymap_set(...., flags);and defining two macros for indivisual tests,
#define VM_ALL_VISIBLE(r, b, v) ((vm_get_status((r), (b), (v)) & .._VISIBLE) != 0)
if (VM_ALL_VISIBLE(rel, blkno, &vmbuffer))and
if (VM_ALL_FROZEN(rel, blkno, &vmbuffer))
How about this?
3. visibilitymap.c
- HEAPBLK_TO_MAPBIT
In visibilitymap_clear and other functions, mapBit means
mapDualBit in the patch, and mapBit always appears in the form
"mapBit * BITS_PER_HEAPBLOCK". So, it'd be better to change the
definition of HEAPBLK_TO_MAPBIT so that it calculates really
the bit position in a byte.- #define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE) + #define HEAPBLK_TO_MAPBIT(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)- visibilitymap_count()
The third argument all_frozen is not necessary in some
usage. So this interface would be preferable to be as
following,BlockNumber
visibilitymap_count(Relation rel, BlockNumber *all_frozen)
{
BlockNumber all_visible = 0;
...
if (all_frozen)
*all_frozen = 0;
... something like ...- visibilitymap_set()
The check for ALL_VISIBLE is duplicate in the following
assertion.Assert((flags & VISIBILITYMAP_ALL_VISIBLE) ||
(flags & (VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN)));4. documentation
- 18.11.1 Statement Hehavior
A typo.
VACUUM performs *a* aggressive freezing
However I am not a fluent English speaker, and such
wordsmithing would be done by someone else, I feel that
"eager/greedy" is more suitable for this meaning..,
nevertheless, the term "whole-table freezing" that you wrote
elsewhere in this patch would be usable."VACUUM performs a whole-table freezing"
All "a table scan/sweep"s and something has the similar
meaning would be better be changed to "a whole-table
freezing"In similar manner, "tuples/rows that are marked as frozen"
could be replaced with "unfrozen tuples/rows".- 23.1.5 Preventing Transaction ID Wraparound Failures
"The whole table is scanned only when all pages happen to
require vacuuming to remove dead row versions."This description looks a bit out-of-point. "the whole table
scan" in the original description is what is triggered by
relfrozenxid so the correspondent in the revised description
is "the whole-table freezing", maybe."The whole-table feezing takes place when
<structfield>relfrozenxid</> is more than
<varname>vacuum_freeze_table_age</> transactions old or when
<command>VACUUM</>'s <literal>FREEZE</> option is used. The
whole-table freezing scans all unfreezed pages."The last sentence might be unnecessary.
- 63.4 Visibility Map
"pages contain only tuples that are marked as frozen" would be
enough to be "pages contain only frozen tuples"and according to the discussion upthread, we might be good to
have some desciption that the name is historically omitting
the aspect of freezemap.At Sat, 31 Oct 2015 18:07:32 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in <CAA4eK1+aTdaSwG3u+y8fXxn67Kkj0T1KzRsFDLEi=tQvTYgFrQ@mail.gmail.com>
amit.kapila16> On Fri, Oct 30, 2015 at 6:03 AM, Masahiko Sawada <sawada.mshk@gmail.com>Couple of more review comments:
------------------------------------------------------1.
@@ -615,6 +617,8 @@ typedef struct PgStat_StatTabEntry
PgStat_Counter n_dead_tuples;
PgStat_Counter
changes_since_analyze;+ int32 n_frozen_pages;
+
PgStat_Counter blocks_fetched;
PgStat_Counter
blocks_hit;As you are changing above structure, you need to update
PGSTAT_FILE_FORMAT_ID, refer below code:
#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9D2. It seems that n_frozen_page is not initialized/updated properly
for toast tables:Try with below steps:
postgres=# create table t4(c1 int, c2 text);
CREATE TABLE
postgres=# select oid, relname from pg_class where relname like '%t4%';
oid | relname
-------+---------
16390 | t4
(1 row)postgres=# select oid, relname from pg_class where relname like '%16390%';
oid | relname
-------+----------------------
16393 | pg_toast_16390
16395 | pg_toast_16390_index
(2 rows)postgres=# select relname,seq_scan,n_tup_ins,last_vacuum,n_frozen_page from
pg_s
tat_all_tables where relname like '%16390%';
relname | seq_scan | n_tup_ins | last_vacuum | n_frozen_page
----------------+----------+-----------+-------------+---------------
pg_toast_16390 | 1 | 0 | | -842150451
(1 row)Note that I have tested above scenario on my Windows 7 m/c.
3.
* visibilitymap.c
* bitmap for tracking visibility of heap tuplesI think above needs to be changed to:
bitmap for tracking visibility and frozen state of heap tuples4. a. /* - * If we froze any tuples, mark the buffer dirty, and write a WAL - * record recording the changes. We must log the changes to be - * crash-safe against future truncation of CLOG. + * If we froze any tuples then we mark the buffer dirty, and write a WALb. - * Release any remaining pin on visibility map page. + * Release any remaining pin on visibility map.c. * We do update relallvisible even in the corner case, since if the table - * is all-visible we'd definitely like to know that. But clamp the value - * to be not more than what we're setting relpages to. + * is all-visible we'd definitely like to know that. + * But clamp the value to be not more than what we're setting relpages to.I don't think you need to change above comments.
5. + * Even if scan_all is set so far, we could skip to scan some pages + * according by all-frozen bit of visibility amp./according by/according to
/amp/mapI suggested to modify comment as below:
During full scan, we could skip some pages according to all-frozen
bit of visibility map.Also no need to start this in new line, start from where the
previous line of comment ends.6.
/*
* lazy_scan_heap() -- scan an open heap relation
*
* This routine prunes each page in the
heap, which will among other
* things truncate dead tuples to dead line pointers, defragment the
*
page, and set commit status bits (see heap_page_prune). It also builds
* lists of dead
tuples and pages with free space, calculates statistics
* on the number of live tuples in the
heap, and marks pages as
* all-visible if appropriate.Modify above function header as:
all-visible, all-frozen
7.
lazy_scan_heap()
{
..if (PageIsEmpty(page))
{
empty_pages++;
freespace =
PageGetHeapFreeSpace(page);/* empty pages are always all-visible */
if (!PageIsAllVisible(page))
..
}Don't we need to ensure that empty pages should get marked as
all-frozen?8.
lazy_scan_heap()
{
..
/*
* As of PostgreSQL 9.2, the visibility map bit should never be set if
* the page-
level bit is clear. However, it's possible that the bit
* got cleared after we checked it
and before we took the buffer
* content lock, so we must recheck before jumping to the conclusion
* that something bad has happened.
*/
else if (all_visible_according_to_vm
&& !PageIsAllVisible(page)
&& visibilitymap_test(onerel, blkno, &vmbuffer,
VISIBILITYMAP_ALL_VISIBLE))
{
elog(WARNING, "page is not marked all-visible
but visibility map bit is set in relation \"%s\" page %u",
relname, blkno);
visibilitymap_clear(onerel, blkno, vmbuffer);
}/*
*
It's possible for the value returned by GetOldestXmin() to move
* backwards, so it's not wrong for
us to see tuples that appear to
* not be visible to everyone yet, while PD_ALL_VISIBLE is already
* set. The real safe xmin value never moves backwards, but
* GetOldestXmin() is
conservative and sometimes returns a value
* that's unnecessarily small, so if we see that
contradiction it just
* means that the tuples that we think are not visible to everyone yet
* actually are, and the PD_ALL_VISIBLE flag is correct.
*
* There should never
be dead tuples on a page with PD_ALL_VISIBLE
* set, however.
*/
else
if (PageIsAllVisible(page) && has_dead_tuples)
{
elog(WARNING, "page
containing dead tuples is marked as all-visible in relation \"%s\" page %u",relname, blkno);
PageClearAllVisible(page);
MarkBufferDirty(buf);
visibilitymap_clear(onerel, blkno, vmbuffer);
}..
}I think both the above cases could happen for frozen state
as well, unless you think otherwise, we need similar handling
for frozen bit.
Thank you for reviewing the patch.
I changed the patch so that the visibility map become the page info
map, in source code and documentation.
And fixed review comments I received.
Attached v22 patch.
I think both the above cases could happen for frozen state
as well, unless you think otherwise, we need similar handling
for frozen bit.
It's not happen the situation where is all-frozen and not all-visible,
and the bits of visibility map are cleared at the same time, page
flags are as well.
So I think it's enough to handle only all-visible situation. Am I
missing something?
2. visitibilymap_test()
- if (visibilitymap_test(rel, blkno, &vmbuffer)) + if (visibilitymap_test(rel, blkno, &vmbuffer, VISIBILITYMAP_ALL_VISIBLE)The old VM was a simple bitmap so the name _test and the
function are proper but now the bitmap is quad state so it'd be
better chainging the function. Alghough it is not so expensive
to call it twice successively, it is a bit uneasy for me doing
so. One possible shape would be like the following.lazy_vacuum_page()
int vmstate = visibilitymap_get_status(rel, blkno, &vmbuffer);
if (!(vmstate & VISIBILITYMAP_ALL_VISIBLE))
...
if (all_frozen && !(vmstate & VISIBILITYMAP_ALL_FROZEN))
...
if (flags != vmstate)
visibilitymap_set(...., flags);and defining two macros for indivisual tests,
#define VM_ALL_VISIBLE(r, b, v) ((vm_get_status((r), (b), (v)) & .._VISIBLE) != 0)
if (VM_ALL_VISIBLE(rel, blkno, &vmbuffer))and
if (VM_ALL_FROZEN(rel, blkno, &vmbuffer))
How about this?
I agree.
I've changed so.
Regards,
--
Masahiko Sawada
Attachments:
000_page_info_map_v22.patchtext/x-patch; charset=US-ASCII; name=000_page_info_map_v22.patchDownload
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 22c5f7a..6c4b0a7 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -12,7 +12,7 @@
*/
#include "postgres.h"
-#include "access/visibilitymap.h"
+#include "access/pageinfomap.h"
#include "access/transam.h"
#include "access/xact.h"
#include "access/multixact.h"
@@ -48,7 +48,7 @@ typedef struct output_type
/*
* This function takes an already open relation and scans its pages,
- * skipping those that have the corresponding visibility map bit set.
+ * skipping those that have the corresponding page info map bit set.
* For pages we skip, we find the free space from the free space map
* and approximate tuple_len on that basis. For the others, we count
* the exact number of dead tuples etc.
@@ -87,7 +87,7 @@ statapprox_heap(Relation rel, output_type *stat)
* If the page has only visible tuples, then we can find out the free
* space from the FSM and move on.
*/
- if (visibilitymap_test(rel, blkno, &vmbuffer))
+ if (PIM_ALL_VISIBLE(rel, blkno, &vmbuffer))
{
freespace = GetRecordedFreeSpace(rel, blkno);
stat->tuple_len += BLCKSZ - freespace;
@@ -242,7 +242,7 @@ pgstattuple_approx(PG_FUNCTION_ARGS)
/*
* We support only ordinary relations and materialised views, because we
- * depend on the visibility map and free space map for our estimates about
+ * depend on the page info map and free space map for our estimates about
* unscanned pages.
*/
if (!(rel->rd_rel->relkind == RELKIND_RELATION ||
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 97ef618..4a593ae 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -1833,7 +1833,7 @@
<entry></entry>
<entry>
Number of pages that are marked all-visible in the table's
- visibility map. This is only an estimate used by the
+ page info map. This is only an estimate used by the
planner. It is updated by <command>VACUUM</command>,
<command>ANALYZE</command>, and a few DDL commands such as
<command>CREATE INDEX</command>.
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 6e14851..c75a166 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5905,7 +5905,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs an eager freezing if the table's
<structname>pg_class</>.<structfield>relfrozenxid</> field has reached
the age specified by this setting. The default is 150 million
transactions. Although users can set this value anywhere from zero to
@@ -5949,7 +5949,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs an eager freezing if the table's
<structname>pg_class</>.<structfield>relminmxid</> field has reached
the age specified by this setting. The default is 150 million multixacts.
Although users can set this value anywhere from zero to two billions,
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 60b9a09..0ccbbd5 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -17663,7 +17663,7 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
<entry><type>bigint</type></entry>
<entry>
Disk space used by the specified fork (<literal>'main'</literal>,
- <literal>'fsm'</literal>, <literal>'vm'</>, or <literal>'init'</>)
+ <literal>'fsm'</literal>, <literal>'pim'</>, or <literal>'init'</>)
of the specified table or index
</entry>
</row>
@@ -17703,7 +17703,7 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
<entry><type>bigint</type></entry>
<entry>
Disk space used by the specified table, excluding indexes
- (but including TOAST, free space map, and visibility map)
+ (but including TOAST, free space map, and page info map)
</entry>
</row>
<row>
@@ -17750,7 +17750,7 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
<para>
<function>pg_table_size</> accepts the OID or name of a table and
returns the disk space needed for that table, exclusive of indexes.
- (TOAST space, free space map, and visibility map are included.)
+ (TOAST space, free space map, and page info map are included.)
</para>
<para>
@@ -17793,8 +17793,8 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
</listitem>
<listitem>
<para>
- <literal>'vm'</literal> returns the size of the Visibility Map
- (see <xref linkend="storage-vm">) associated with the relation.
+ <literal>'pim'</literal> returns the size of the Page Info Map
+ (see <xref linkend="storage-pim">) associated with the relation.
</para>
</listitem>
<listitem>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 1c09bae..5da49d5 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -611,7 +611,7 @@ amrestrpos (IndexScanDesc scan);
If the index stores the original indexed data values (and not some lossy
representation of them), it is useful to support index-only scans, in
which the index returns the actual data not just the TID of the heap tuple.
- This will only work if the visibility map shows that the TID is on an
+ This will only work if the page info map shows that the TID is on an
all-visible page; else the heap tuple must be visited anyway to check
MVCC visibility. But that is no concern of the access method's.
</para>
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 5204b34..3060a52 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -102,7 +102,7 @@
</listitem>
<listitem>
- <simpara>To update the visibility map, which speeds up index-only
+ <simpara>To update the page info map, which speeds up index-only
scans.</simpara>
</listitem>
@@ -345,16 +345,16 @@
</tip>
</sect2>
- <sect2 id="vacuum-for-visibility-map">
- <title>Updating The Visibility Map</title>
+ <sect2 id="vacuum-for-page-info-map">
+ <title>Updating The Page Info Map</title>
<para>
- Vacuum maintains a <link linkend="storage-vm">visibility map</> for each
+ Vacuum maintains a <link linkend="storage-pim">page info map</> for each
table to keep track of which pages contain only tuples that are known to be
visible to all active transactions (and all future transactions, until the
- page is again modified). This has two purposes. First, vacuum
- itself can skip such pages on the next run, since there is nothing to
- clean up.
+ page is again modified), and pages contain only frozen tuples.
+ This has two purposes. First, vacuum itself can skip such pages
+ on the next run, since there is nothing to clean up.
</para>
<para>
@@ -364,10 +364,10 @@
visibility information, a normal index scan fetches the heap tuple for each
matching index entry, to check whether it should be seen by the current
transaction. An <firstterm>index-only scan</>, on the other hand, checks
- the visibility map first. If it's known that all tuples on the page are
+ the page info map first. If it's known that all tuples on the page are
visible, the heap fetch can be skipped. This is most noticeable on
- large data sets where the visibility map can prevent disk accesses.
- The visibility map is vastly smaller than the heap, so it can easily be
+ large data sets where the page info map can prevent disk accesses.
+ The page info map is vastly smaller than the heap, so it can easily be
cached even when the heap is very large.
</para>
</sect2>
@@ -438,23 +438,22 @@
</para>
<para>
- <command>VACUUM</> normally skips pages that don't have any dead row
- versions, but those pages might still have row versions with old XID
- values. To ensure all old row versions have been frozen, a
- scan of the whole table is needed.
+ <command>VACUUM</> skips pages that don't have any dead row
+ versions, and pages that have only frozen rows.
+ To ensure all old row versions have been frozen, a scan of all unfrozen pages
+ is needed.
<xref linkend="guc-vacuum-freeze-table-age"> controls when
- <command>VACUUM</> does that: a whole table sweep is forced if
- the table hasn't been fully scanned for <varname>vacuum_freeze_table_age</>
- minus <varname>vacuum_freeze_min_age</> transactions. Setting it to 0
- forces <command>VACUUM</> to always scan all pages, effectively ignoring
- the visibility map.
+ <command>VACUUM</> does that: a whole-table freezing is forced if
+ the table hasn't been ensured all row versions are frozen for
+ <varname>vacuum_freeze_table_age</> minus <varname>vacuum_freeze_min_age</>
+ transcations.
</para>
<para>
The maximum time that a table can go unvacuumed is two billion
transactions minus the <varname>vacuum_freeze_min_age</> value at
- the time <command>VACUUM</> last scanned the whole table. If it were to go
- unvacuumed for longer than
+ the time <command>VACUUM</> last scanned unfrozen pages.
+ If it were to go unvacuumed for longer than
that, data loss could result. To ensure that this does not happen,
autovacuum is invoked on any table that might contain unfrozen rows with
XIDs older than the age specified by the configuration parameter <xref
@@ -490,8 +489,7 @@
a regularly scheduled <command>VACUUM</> or an autovacuum triggered by
normal delete and update activity is run in that window. Setting it too
close could lead to anti-wraparound autovacuums, even though the table
- was recently vacuumed to reclaim space, whereas lower values lead to more
- frequent whole-table scans.
+ was recently vacuumed to reclaim space.
</para>
<para>
@@ -527,7 +525,7 @@
<structname>pg_database</>. In particular,
the <structfield>relfrozenxid</> column of a table's
<structname>pg_class</> row contains the freeze cutoff XID that was used
- by the last whole-table <command>VACUUM</> for that table. All rows
+ by the last whole-table freezing for that table. All rows
inserted by transactions with XIDs older than this cutoff XID are
guaranteed to have been frozen. Similarly,
the <structfield>datfrozenxid</> column of a database's
@@ -555,17 +553,19 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
<command>VACUUM</> normally
only scans pages that have been modified since the last vacuum, but
<structfield>relfrozenxid</> can only be advanced when the whole table is
- scanned. The whole table is scanned when <structfield>relfrozenxid</> is
- more than <varname>vacuum_freeze_table_age</> transactions old, when
- <command>VACUUM</>'s <literal>FREEZE</> option is used, or when all pages
- happen to
- require vacuuming to remove dead row versions. When <command>VACUUM</>
- scans the whole table, after it's finished <literal>age(relfrozenxid)</>
- should be a little more than the <varname>vacuum_freeze_min_age</> setting
- that was used (more by the number of transactions started since the
- <command>VACUUM</> started). If no whole-table-scanning <command>VACUUM</>
- is issued on the table until <varname>autovacuum_freeze_max_age</> is
- reached, an autovacuum will soon be forced for the table.
+ frozen. The whole-table freezing is occuerred only when all pages happen to
+ require freezing to freeze rows. In other cases such as where
+ <structfield>relfrozenxid</> is more than <varname>vacuum_freeze_table_age</>
+ transcations old, where <command>VACUUM</>'s <literal>FREEZE</> option is used,
+ <command>VACUUM</> can skip the pages that all tuples on the page itself are
+ marked as frozen.
+ When all pages of table are eventually marked as frozen by <command>VACUUM</>,
+ after it's finished <literal>age(relfrozenxid)</> should be a little more
+ than the <varname>vacuum_freeze_min_age</> setting that was used (more by
+ the number of transcations started since the <command>VACUUM</> started).
+ If the advancing of <structfield>relfrozenxid</> is not happend until
+ <varname>autovacuum_freeze_max_age</> is reached, an autovacuum will soon
+ be forced for the table.
</para>
<para>
@@ -642,28 +642,28 @@ HINT: Stop the postmaster and vacuum that database in single-user mode.
<structname>pg_class</>.<structfield>relminmxid</> stores the oldest
possible multixact ID still appearing in any tuple of that table.
If this value is older than
- <xref linkend="guc-vacuum-multixact-freeze-table-age">, a whole-table
+ <xref linkend="guc-vacuum-multixact-freeze-table-age">, a table
scan is forced. <function>mxid_age()</> can be used on
<structname>pg_class</>.<structfield>relminmxid</> to find its age.
</para>
<para>
- Whole-table <command>VACUUM</> scans, regardless of
- what causes them, enable advancing the value for that table.
+ When <command>VACUUM</> scans all unfrozen pages, regardless of what causes
+ them, enable advancing the value for that table.
Eventually, as all tables in all databases are scanned and their
oldest multixact values are advanced, on-disk storage for older
multixacts can be removed.
</para>
<para>
- As a safety device, a whole-table vacuum scan will occur for any table
- whose multixact-age is greater than
- <xref linkend="guc-autovacuum-multixact-freeze-max-age">. Whole-table
+ As a safety device, a vacuum scan will occur for any table whose
+ multixact-age is greater than
+ <xref linkend="guc-autovacuum-multixact-freeze-max-age">. A
vacuum scans will also occur progressively for all tables, starting with
those that have the oldest multixact-age, if the amount of used member
storage space exceeds the amount 50% of the addressable storage space.
- Both of these kinds of whole-table scans will occur even if autovacuum is
- nominally disabled.
+ Both of these kinds of scans will occur even if autovacuum is nominally
+ disabled.
</para>
</sect3>
</sect2>
@@ -743,8 +743,8 @@ vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuple
<command>UPDATE</command> and <command>DELETE</command> operation. (It
is only semi-accurate because some information might be lost under heavy
load.) If the <structfield>relfrozenxid</> value of the table is more
- than <varname>vacuum_freeze_table_age</> transactions old, the whole
- table is scanned to freeze old tuples and advance
+ than <varname>vacuum_freeze_table_age</> transactions old, the table is
+ scanned to freeze old tuples and advance
<structfield>relfrozenxid</>, otherwise only pages that have been modified
since the last vacuum are scanned.
</para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e64b7ef..1908a4d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1332,6 +1332,11 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
<entry>Estimated number of rows modified since this table was last analyzed</entry>
</row>
<row>
+ <entry><structfield>n_frozen_page</></entry>
+ <entry><type>integer</></entry>
+ <entry>Number of frozen pages</entry>
+ </row>
+ <row>
<entry><structfield>last_vacuum</></entry>
<entry><type>timestamp with time zone</></entry>
<entry>Last time at which this table was manually vacuumed
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
index b95cc81..70e28a7 100644
--- a/doc/src/sgml/pageinspect.sgml
+++ b/doc/src/sgml/pageinspect.sgml
@@ -32,7 +32,7 @@
single time-consistent copy of the block to be obtained.
<replaceable>fork</replaceable> should be <literal>'main'</literal> for
the main data fork, <literal>'fsm'</literal> for the free space map,
- <literal>'vm'</literal> for the visibility map, or <literal>'init'</literal>
+ <literal>'pim'</literal> for the page info map, or <literal>'init'</literal>
for the initialization fork.
</para>
</listitem>
diff --git a/doc/src/sgml/pgstattuple.sgml b/doc/src/sgml/pgstattuple.sgml
index 18d244b..b950a9c 100644
--- a/doc/src/sgml/pgstattuple.sgml
+++ b/doc/src/sgml/pgstattuple.sgml
@@ -400,7 +400,7 @@ approx_free_percent | 2.09
<para>
It does this by skipping pages that have only visible tuples
- according to the visibility map (if a page has the corresponding VM
+ according to the page info map (if a page has the corresponding PIM
bit set, then it is assumed to contain no dead tuples). For such
pages, it derives the free space value from the free space map, and
assumes that the rest of the space on the page is taken up by live
diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index eb113c2..5ee8527 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -657,6 +657,12 @@ psql --username postgres --file script.sql postgres
</para>
<para>
+ Since the visibility map has been changed to the page info map in
+ version 9.6, <application>pg_upgrade</> does not support upgrading of
+ databases from 9.5 or before to 9.6 or later with link mode (-k).
+ </para>
+
+ <para>
All failure, rebuild, and reindex cases will be reported by
<application>pg_upgrade</> if they affect your installation;
post-upgrade scripts to rebuild tables and indexes will be
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 164d08c..024951f 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -194,9 +194,9 @@ main fork), each table and index has a <firstterm>free space map</> (see <xref
linkend="storage-fsm">), which stores information about free space available in
the relation. The free space map is stored in a file named with the filenode
number plus the suffix <literal>_fsm</>. Tables also have a
-<firstterm>visibility map</>, stored in a fork with the suffix <literal>_vm</>,
-to track which pages are known to have no dead tuples. The visibility map is
-described further in <xref linkend="storage-vm">. Unlogged tables and indexes
+<firstterm>page info map</>, stored in a fork with the suffix <literal>_pim</>,
+to track which pages are known to have no dead tuples. The page info map is
+described further in <xref linkend="storage-pim">. Unlogged tables and indexes
have a third fork, known as the initialization fork, which is stored in a fork
with the suffix <literal>_init</literal> (see <xref linkend="storage-init">).
</para>
@@ -224,7 +224,7 @@ This arrangement avoids problems on platforms that have file size limitations.
(Actually, 1 GB is just the default segment size. The segment size can be
adjusted using the configuration option <option>--with-segsize</option>
when building <productname>PostgreSQL</>.)
-In principle, free space map and visibility map forks could require multiple
+In principle, free space map and page info map forks could require multiple
segments as well, though this is unlikely to happen in practice.
</para>
@@ -270,7 +270,7 @@ The <function>pg_relation_filepath()</> function shows the entire path
as a substitute for remembering many of the above rules. But keep in
mind that this function just gives the name of the first segment of the
main fork of the relation — you may need to append a segment number
-and/or <literal>_fsm</>, <literal>_vm</>, or <literal>_init</> to find all
+and/or <literal>_fsm</>, <literal>_pim</>, or <literal>_init</> to find all
the files associated with the relation.
</para>
@@ -611,30 +611,32 @@ can be used to examine the information stored in free space maps.
</sect1>
-<sect1 id="storage-vm">
+<sect1 id="storage-pim">
-<title>Visibility Map</title>
+<title>Page Info Map</title>
<indexterm>
- <primary>Visibility Map</primary>
+ <primary>Page Info Map</primary>
</indexterm>
-<indexterm><primary>VM</><see>Visibility Map</></indexterm>
+<indexterm><primary>PIM</><see>Page Info Map</></indexterm>
<para>
-Each heap relation has a Visibility Map
-(VM) to keep track of which pages contain only tuples that are known to be
-visible to all active transactions. It's stored
-alongside the main relation data in a separate relation fork, named after the
-filenode number of the relation, plus a <literal>_vm</> suffix. For example,
-if the filenode of a relation is 12345, the VM is stored in a file called
-<filename>12345_vm</>, in the same directory as the main relation file.
-Note that indexes do not have VMs.
+Each heap relation has a Page Info Map
+(PIM) to keep track of which pages contain only tuples that are known to be
+visible to all active transactions, and pages contain only unfrozen tuples.
+It's stored alongside the main relation data in a separate relation fork,
+named after the filenode number of the relation, plus a <literal>_pim</> suffix.
+For example, if the filenode of a relation is 12345, the PIM is stored in a file
+called <filename>12345_pim</>, in the same directory as the main relation file.
+Note that indexes do not have PIMs.
</para>
<para>
-The visibility map simply stores one bit per heap page. A set bit means
-that all tuples on the page are known to be visible to all transactions.
-This means that the page does not contain any tuples that need to be vacuumed.
+The page info map stores two bits per heap page: all-visible, all-frozen.
+A set all-visible bit means that all tuples on the page are known to be visible
+to all transactions. A set all-frozen bit means that all tuples on the page are
+completely marked as frozen. This means that the page does not contain any tuples
+that need to be vacuumed, even if a scan of whole table is required.
This information can also be used by <firstterm>index-only scans</> to answer
queries using only the index tuple.
</para>
@@ -642,7 +644,7 @@ queries using only the index tuple.
<para>
The map is conservative in the sense that we make sure that whenever a bit is
set, we know the condition is true, but if a bit is not set, it might or
-might not be true. Visibility map bits are only set by vacuum, but are
+might not be true. page info map bits are only set by vacuum, but are
cleared by any data-modifying operations on a page.
</para>
diff --git a/src/backend/access/heap/Makefile b/src/backend/access/heap/Makefile
index b83d496..aeec6d1 100644
--- a/src/backend/access/heap/Makefile
+++ b/src/backend/access/heap/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/access/heap
top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
-OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o visibilitymap.o
+OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o pageinfomap.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 35a2b05..f3142f7 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -48,7 +48,7 @@
#include "access/transam.h"
#include "access/tuptoaster.h"
#include "access/valid.h"
-#include "access/visibilitymap.h"
+#include "access/pageinfomap.h"
#include "access/xact.h"
#include "access/xlog.h"
#include "access/xloginsert.h"
@@ -406,12 +406,12 @@ heapgetpage(HeapScanDesc scan, BlockNumber page)
* transactions in the master might still be invisible to a read-only
* transaction in the standby. We partly handle this problem by tracking
* the minimum xmin of visible tuples as the cut-off XID while marking a
- * page all-visible on master and WAL log that along with the visibility
+ * page all-visible on master and WAL log that along with the page information
* map SET operation. In hot standby, we wait for (or abort) all
* transactions that can potentially may not see one or more tuples on the
* page. That's how index-only scans work fine in hot standby. A crucial
* difference between index-only scans and heap scans is that the
- * index-only scan completely relies on the visibility map where as heap
+ * index-only scan completely relies on the page info map where as heap
* scan looks at the page-level PD_ALL_VISIBLE flag. We are not sure if
* the page-level flag can be trusted in the same way, because it might
* get propagated somehow without being explicitly WAL-logged, e.g. via a
@@ -2375,7 +2375,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
TransactionId xid = GetCurrentTransactionId();
HeapTuple heaptup;
Buffer buffer;
- Buffer vmbuffer = InvalidBuffer;
+ Buffer pimbuffer = InvalidBuffer;
bool all_visible_cleared = false;
/*
@@ -2393,7 +2393,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
*/
buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
InvalidBuffer, options, bistate,
- &vmbuffer, NULL);
+ &pimbuffer, NULL);
/*
* We're about to do the actual insert -- but check for conflict first, to
@@ -2422,9 +2422,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
{
all_visible_cleared = true;
PageClearAllVisible(BufferGetPage(buffer));
- visibilitymap_clear(relation,
+ pageinfomap_clear(relation,
ItemPointerGetBlockNumber(&(heaptup->t_self)),
- vmbuffer);
+ pimbuffer);
}
/*
@@ -2518,8 +2518,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
END_CRIT_SECTION();
UnlockReleaseBuffer(buffer);
- if (vmbuffer != InvalidBuffer)
- ReleaseBuffer(vmbuffer);
+ if (pimbuffer != InvalidBuffer)
+ ReleaseBuffer(pimbuffer);
/*
* If tuple is cachable, mark it for invalidation from the caches in case
@@ -2692,7 +2692,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
while (ndone < ntuples)
{
Buffer buffer;
- Buffer vmbuffer = InvalidBuffer;
+ Buffer pimbuffer = InvalidBuffer;
bool all_visible_cleared = false;
int nthispage;
@@ -2700,11 +2700,11 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
/*
* Find buffer where at least the next tuple will fit. If the page is
- * all-visible, this will also pin the requisite visibility map page.
+ * all-visible, this will also pin the requisite page info map page.
*/
buffer = RelationGetBufferForTuple(relation, heaptuples[ndone]->t_len,
InvalidBuffer, options, bistate,
- &vmbuffer, NULL);
+ &pimbuffer, NULL);
page = BufferGetPage(buffer);
/* NO EREPORT(ERROR) from here till changes are logged */
@@ -2736,9 +2736,9 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
{
all_visible_cleared = true;
PageClearAllVisible(page);
- visibilitymap_clear(relation,
+ pageinfomap_clear(relation,
BufferGetBlockNumber(buffer),
- vmbuffer);
+ pimbuffer);
}
/*
@@ -2857,8 +2857,8 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
END_CRIT_SECTION();
UnlockReleaseBuffer(buffer);
- if (vmbuffer != InvalidBuffer)
- ReleaseBuffer(vmbuffer);
+ if (pimbuffer != InvalidBuffer)
+ ReleaseBuffer(pimbuffer);
ndone += nthispage;
}
@@ -2995,7 +2995,7 @@ heap_delete(Relation relation, ItemPointer tid,
Page page;
BlockNumber block;
Buffer buffer;
- Buffer vmbuffer = InvalidBuffer;
+ Buffer pimbuffer = InvalidBuffer;
TransactionId new_xmax;
uint16 new_infomask,
new_infomask2;
@@ -3022,26 +3022,26 @@ heap_delete(Relation relation, ItemPointer tid,
page = BufferGetPage(buffer);
/*
- * Before locking the buffer, pin the visibility map page if it appears to
+ * Before locking the buffer, pin the page info map page if it appears to
* be necessary. Since we haven't got the lock yet, someone else might be
* in the middle of changing this, so we'll need to recheck after we have
* the lock.
*/
if (PageIsAllVisible(page))
- visibilitymap_pin(relation, block, &vmbuffer);
+ pageinfomap_pin(relation, block, &pimbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
/*
- * If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, we'll have to unlock and
- * re-lock, to avoid holding the buffer lock across an I/O. That's a bit
- * unfortunate, but hopefully shouldn't happen often.
+ * If we didn't pin the page info map page and the page has become all
+ * visible or all frozen while we were busy locking the buffer, we'll
+ * have to unlock and re-lock, to avoid holding the buffer lock across an
+ * I/O. That's a bit unfortunate, but hopefully shouldn't happen often.
*/
- if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
+ if (pimbuffer == InvalidBuffer && PageIsAllVisible(page))
{
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
- visibilitymap_pin(relation, block, &vmbuffer);
+ pageinfomap_pin(relation, block, &pimbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
}
@@ -3184,8 +3184,8 @@ l1:
UnlockReleaseBuffer(buffer);
if (have_tuple_lock)
UnlockTupleTuplock(relation, &(tp.t_self), LockTupleExclusive);
- if (vmbuffer != InvalidBuffer)
- ReleaseBuffer(vmbuffer);
+ if (pimbuffer != InvalidBuffer)
+ ReleaseBuffer(pimbuffer);
return result;
}
@@ -3239,8 +3239,8 @@ l1:
{
all_visible_cleared = true;
PageClearAllVisible(page);
- visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
- vmbuffer);
+ pageinfomap_clear(relation, BufferGetBlockNumber(buffer),
+ pimbuffer);
}
/* store transaction information of xact deleting the tuple */
@@ -3320,8 +3320,8 @@ l1:
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
- if (vmbuffer != InvalidBuffer)
- ReleaseBuffer(vmbuffer);
+ if (pimbuffer != InvalidBuffer)
+ ReleaseBuffer(pimbuffer);
/*
* If the tuple has toasted out-of-line attributes, we need to delete
@@ -3454,8 +3454,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
MultiXactStatus mxact_status;
Buffer buffer,
newbuf,
- vmbuffer = InvalidBuffer,
- vmbuffer_new = InvalidBuffer;
+ pimbuffer = InvalidBuffer,
+ pimbuffer_new = InvalidBuffer;
bool need_toast,
already_marked;
Size newtupsize,
@@ -3512,13 +3512,13 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
page = BufferGetPage(buffer);
/*
- * Before locking the buffer, pin the visibility map page if it appears to
+ * Before locking the buffer, pin the page info map page if it appears to
* be necessary. Since we haven't got the lock yet, someone else might be
* in the middle of changing this, so we'll need to recheck after we have
* the lock.
*/
if (PageIsAllVisible(page))
- visibilitymap_pin(relation, block, &vmbuffer);
+ pageinfomap_pin(relation, block, &pimbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
@@ -3800,15 +3800,15 @@ l2:
UnlockReleaseBuffer(buffer);
if (have_tuple_lock)
UnlockTupleTuplock(relation, &(oldtup.t_self), *lockmode);
- if (vmbuffer != InvalidBuffer)
- ReleaseBuffer(vmbuffer);
+ if (pimbuffer != InvalidBuffer)
+ ReleaseBuffer(pimbuffer);
bms_free(hot_attrs);
bms_free(key_attrs);
return result;
}
/*
- * If we didn't pin the visibility map page and the page has become all
+ * If we didn't pin the page info map page and the page has become all
* visible while we were busy locking the buffer, or during some
* subsequent window during which we had it unlocked, we'll have to unlock
* and re-lock, to avoid holding the buffer lock across an I/O. That's a
@@ -3816,10 +3816,10 @@ l2:
* tuple has been locked or updated under us, but hopefully it won't
* happen very often.
*/
- if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
+ if (pimbuffer == InvalidBuffer && PageIsAllVisible(page))
{
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
- visibilitymap_pin(relation, block, &vmbuffer);
+ pageinfomap_pin(relation, block, &pimbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
goto l2;
}
@@ -3976,7 +3976,7 @@ l2:
/* Assume there's no chance to put heaptup on same page. */
newbuf = RelationGetBufferForTuple(relation, heaptup->t_len,
buffer, 0, NULL,
- &vmbuffer_new, &vmbuffer);
+ &pimbuffer_new, &pimbuffer);
}
else
{
@@ -3994,7 +3994,7 @@ l2:
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
newbuf = RelationGetBufferForTuple(relation, heaptup->t_len,
buffer, 0, NULL,
- &vmbuffer_new, &vmbuffer);
+ &pimbuffer_new, &pimbuffer);
}
else
{
@@ -4114,15 +4114,15 @@ l2:
{
all_visible_cleared = true;
PageClearAllVisible(BufferGetPage(buffer));
- visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
- vmbuffer);
+ pageinfomap_clear(relation, BufferGetBlockNumber(buffer),
+ pimbuffer);
}
if (newbuf != buffer && PageIsAllVisible(BufferGetPage(newbuf)))
{
all_visible_cleared_new = true;
PageClearAllVisible(BufferGetPage(newbuf));
- visibilitymap_clear(relation, BufferGetBlockNumber(newbuf),
- vmbuffer_new);
+ pageinfomap_clear(relation, BufferGetBlockNumber(newbuf),
+ pimbuffer_new);
}
if (newbuf != buffer)
@@ -4176,10 +4176,10 @@ l2:
if (newbuf != buffer)
ReleaseBuffer(newbuf);
ReleaseBuffer(buffer);
- if (BufferIsValid(vmbuffer_new))
- ReleaseBuffer(vmbuffer_new);
- if (BufferIsValid(vmbuffer))
- ReleaseBuffer(vmbuffer);
+ if (BufferIsValid(pimbuffer_new))
+ ReleaseBuffer(pimbuffer_new);
+ if (BufferIsValid(pimbuffer))
+ ReleaseBuffer(pimbuffer);
/*
* Release the lmgr tuple lock, if we had it.
@@ -5074,7 +5074,7 @@ failed:
LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
/*
- * Don't update the visibility map here. Locking a tuple doesn't change
+ * Don't update the page info map here. Locking a tuple doesn't change
* visibility info.
*/
@@ -7196,29 +7196,30 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
/*
* Perform XLogInsert for a heap-visible operation. 'block' is the block
- * being marked all-visible, and vm_buffer is the buffer containing the
- * corresponding visibility map block. Both should have already been modified
+ * being marked all-visible, and pim_buffer is the buffer containing the
+ * corresponding page info map block. Both should have already been modified
* and dirtied.
*
* If checksums are enabled, we also generate a full-page image of
* heap_buffer, if necessary.
*/
XLogRecPtr
-log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
- TransactionId cutoff_xid)
+log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer pim_buffer,
+ TransactionId cutoff_xid, uint8 pimflags)
{
xl_heap_visible xlrec;
XLogRecPtr recptr;
uint8 flags;
Assert(BufferIsValid(heap_buffer));
- Assert(BufferIsValid(vm_buffer));
+ Assert(BufferIsValid(pim_buffer));
xlrec.cutoff_xid = cutoff_xid;
+ xlrec.flags = pimflags;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
- XLogRegisterBuffer(0, vm_buffer, 0);
+ XLogRegisterBuffer(0, pim_buffer, 0);
flags = REGBUF_STANDARD;
if (!XLogHintBitIsNeeded())
@@ -7751,16 +7752,16 @@ heap_xlog_clean(XLogReaderState *record)
* Replay XLOG_HEAP2_VISIBLE record.
*
* The critical integrity requirement here is that we must never end up with
- * a situation where the visibility map bit is set, and the page-level
+ * a situation where the page info map bit is set, and the page-level
* PD_ALL_VISIBLE bit is clear. If that were to occur, then a subsequent
- * page modification would fail to clear the visibility map bit.
+ * page modification would fail to clear the page info map bit.
*/
static void
heap_xlog_visible(XLogReaderState *record)
{
XLogRecPtr lsn = record->EndRecPtr;
xl_heap_visible *xlrec = (xl_heap_visible *) XLogRecGetData(record);
- Buffer vmbuffer = InvalidBuffer;
+ Buffer pimbuffer = InvalidBuffer;
Buffer buffer;
Page page;
RelFileNode rnode;
@@ -7784,7 +7785,7 @@ heap_xlog_visible(XLogReaderState *record)
/*
* Read the heap page, if it still exists. If the heap file has dropped or
* truncated later in recovery, we don't need to update the page, but we'd
- * better still update the visibility map.
+ * better still update the page info map.
*/
action = XLogReadBufferForRedo(record, 1, &buffer);
if (action == BLK_NEEDS_REDO)
@@ -7797,14 +7798,19 @@ heap_xlog_visible(XLogReaderState *record)
* we're not inspecting the existing page contents in any way, we
* don't care.
*
- * However, all operations that clear the visibility map bit *do* bump
+ * However, all operations that clear the page info map bit *do* bump
* the LSN, and those operations will only be replayed if the XLOG LSN
* follows the page LSN. Thus, if the page LSN has advanced past our
* XLOG record's LSN, we mustn't mark the page all-visible, because
* the subsequent update won't be replayed to clear the flag.
*/
page = BufferGetPage(buffer);
- PageSetAllVisible(page);
+
+ if (xlrec->flags & PAGEINFOMAP_ALL_VISIBLE)
+ PageSetAllVisible(page);
+ if (xlrec->flags & PAGEINFOMAP_ALL_FROZEN)
+ PageSetAllFrozen(page);
+
MarkBufferDirty(buffer);
}
else if (action == BLK_RESTORED)
@@ -7820,28 +7826,28 @@ heap_xlog_visible(XLogReaderState *record)
/*
* Even if we skipped the heap page update due to the LSN interlock, it's
- * still safe to update the visibility map. Any WAL record that clears
- * the visibility map bit does so before checking the page LSN, so any
+ * still safe to update the page info map. Any WAL record that clears
+ * the page info map bit does so before checking the page LSN, so any
* bits that need to be cleared will still be cleared.
*/
if (XLogReadBufferForRedoExtended(record, 0, RBM_ZERO_ON_ERROR, false,
- &vmbuffer) == BLK_NEEDS_REDO)
+ &pimbuffer) == BLK_NEEDS_REDO)
{
- Page vmpage = BufferGetPage(vmbuffer);
+ Page pimpage = BufferGetPage(pimbuffer);
Relation reln;
/* initialize the page if it was read as zeros */
- if (PageIsNew(vmpage))
- PageInit(vmpage, BLCKSZ, 0);
+ if (PageIsNew(pimpage))
+ PageInit(pimpage, BLCKSZ, 0);
/*
- * XLogReplayBufferExtended locked the buffer. But visibilitymap_set
+ * XLogReplayBufferExtended locked the buffer. But pageinfomap_set
* will handle locking itself.
*/
- LockBuffer(vmbuffer, BUFFER_LOCK_UNLOCK);
+ LockBuffer(pimbuffer, BUFFER_LOCK_UNLOCK);
reln = CreateFakeRelcacheEntry(rnode);
- visibilitymap_pin(reln, blkno, &vmbuffer);
+ pageinfomap_pin(reln, blkno, &pimbuffer);
/*
* Don't set the bit if replay has already passed this point.
@@ -7854,15 +7860,15 @@ heap_xlog_visible(XLogReaderState *record)
* we did for the heap page. If this results in a dropped bit, no
* real harm is done; and the next VACUUM will fix it.
*/
- if (lsn > PageGetLSN(vmpage))
- visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
- xlrec->cutoff_xid);
+ if (lsn > PageGetLSN(pimpage))
+ pageinfomap_set(reln, blkno, InvalidBuffer, lsn, pimbuffer,
+ xlrec->cutoff_xid, xlrec->flags);
- ReleaseBuffer(vmbuffer);
+ ReleaseBuffer(pimbuffer);
FreeFakeRelcacheEntry(reln);
}
- else if (BufferIsValid(vmbuffer))
- UnlockReleaseBuffer(vmbuffer);
+ else if (BufferIsValid(pimbuffer))
+ UnlockReleaseBuffer(pimbuffer);
}
/*
@@ -7965,17 +7971,17 @@ heap_xlog_delete(XLogReaderState *record)
ItemPointerSetOffsetNumber(&target_tid, xlrec->offnum);
/*
- * The visibility map may need to be fixed even if the heap page is
+ * The page info map may need to be fixed even if the heap page is
* already up-to-date.
*/
if (xlrec->flags & XLH_DELETE_ALL_VISIBLE_CLEARED)
{
Relation reln = CreateFakeRelcacheEntry(target_node);
- Buffer vmbuffer = InvalidBuffer;
+ Buffer pimbuffer = InvalidBuffer;
- visibilitymap_pin(reln, blkno, &vmbuffer);
- visibilitymap_clear(reln, blkno, vmbuffer);
- ReleaseBuffer(vmbuffer);
+ pageinfomap_pin(reln, blkno, &pimbuffer);
+ pageinfomap_clear(reln, blkno, pimbuffer);
+ ReleaseBuffer(pimbuffer);
FreeFakeRelcacheEntry(reln);
}
@@ -8043,17 +8049,17 @@ heap_xlog_insert(XLogReaderState *record)
ItemPointerSetOffsetNumber(&target_tid, xlrec->offnum);
/*
- * The visibility map may need to be fixed even if the heap page is
+ * The page info map may need to be fixed even if the heap page is
* already up-to-date.
*/
if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
{
Relation reln = CreateFakeRelcacheEntry(target_node);
- Buffer vmbuffer = InvalidBuffer;
+ Buffer pimbuffer = InvalidBuffer;
- visibilitymap_pin(reln, blkno, &vmbuffer);
- visibilitymap_clear(reln, blkno, vmbuffer);
- ReleaseBuffer(vmbuffer);
+ pageinfomap_pin(reln, blkno, &pimbuffer);
+ pageinfomap_clear(reln, blkno, pimbuffer);
+ ReleaseBuffer(pimbuffer);
FreeFakeRelcacheEntry(reln);
}
@@ -8163,17 +8169,17 @@ heap_xlog_multi_insert(XLogReaderState *record)
XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
/*
- * The visibility map may need to be fixed even if the heap page is
+ * The page info map may need to be fixed even if the heap page is
* already up-to-date.
*/
if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
{
Relation reln = CreateFakeRelcacheEntry(rnode);
- Buffer vmbuffer = InvalidBuffer;
+ Buffer pimbuffer = InvalidBuffer;
- visibilitymap_pin(reln, blkno, &vmbuffer);
- visibilitymap_clear(reln, blkno, vmbuffer);
- ReleaseBuffer(vmbuffer);
+ pageinfomap_pin(reln, blkno, &pimbuffer);
+ pageinfomap_clear(reln, blkno, pimbuffer);
+ ReleaseBuffer(pimbuffer);
FreeFakeRelcacheEntry(reln);
}
@@ -8318,17 +8324,17 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
ItemPointerSet(&newtid, newblk, xlrec->new_offnum);
/*
- * The visibility map may need to be fixed even if the heap page is
+ * The page info map may need to be fixed even if the heap page is
* already up-to-date.
*/
if (xlrec->flags & XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED)
{
Relation reln = CreateFakeRelcacheEntry(rnode);
- Buffer vmbuffer = InvalidBuffer;
+ Buffer pimbuffer = InvalidBuffer;
- visibilitymap_pin(reln, oldblk, &vmbuffer);
- visibilitymap_clear(reln, oldblk, vmbuffer);
- ReleaseBuffer(vmbuffer);
+ pageinfomap_pin(reln, oldblk, &pimbuffer);
+ pageinfomap_clear(reln, oldblk, pimbuffer);
+ ReleaseBuffer(pimbuffer);
FreeFakeRelcacheEntry(reln);
}
@@ -8402,17 +8408,17 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
newaction = XLogReadBufferForRedo(record, 0, &nbuffer);
/*
- * The visibility map may need to be fixed even if the heap page is
+ * The page info map may need to be fixed even if the heap page is
* already up-to-date.
*/
if (xlrec->flags & XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED)
{
Relation reln = CreateFakeRelcacheEntry(rnode);
- Buffer vmbuffer = InvalidBuffer;
+ Buffer pimbuffer = InvalidBuffer;
- visibilitymap_pin(reln, newblk, &vmbuffer);
- visibilitymap_clear(reln, newblk, vmbuffer);
- ReleaseBuffer(vmbuffer);
+ pageinfomap_pin(reln, newblk, &pimbuffer);
+ pageinfomap_clear(reln, newblk, pimbuffer);
+ ReleaseBuffer(pimbuffer);
FreeFakeRelcacheEntry(reln);
}
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 6db73bf..8f702be 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -18,7 +18,7 @@
#include "access/heapam.h"
#include "access/hio.h"
#include "access/htup_details.h"
-#include "access/visibilitymap.h"
+#include "access/pageinfomap.h"
#include "storage/bufmgr.h"
#include "storage/freespace.h"
#include "storage/lmgr.h"
@@ -112,16 +112,16 @@ ReadBufferBI(Relation relation, BlockNumber targetBlock,
/*
* For each heap page which is all-visible, acquire a pin on the appropriate
- * visibility map page, if we haven't already got one.
+ * page info map page, if we haven't already got one.
*
* buffer2 may be InvalidBuffer, if only one buffer is involved. buffer1
* must not be InvalidBuffer. If both buffers are specified, buffer1 must
* be less than buffer2.
*/
static void
-GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
+GetPageInfoMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
BlockNumber block1, BlockNumber block2,
- Buffer *vmbuffer1, Buffer *vmbuffer2)
+ Buffer *pimbuffer1, Buffer *pimbuffer2)
{
bool need_to_pin_buffer1;
bool need_to_pin_buffer2;
@@ -133,10 +133,10 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
{
/* Figure out which pins we need but don't have. */
need_to_pin_buffer1 = PageIsAllVisible(BufferGetPage(buffer1))
- && !visibilitymap_pin_ok(block1, *vmbuffer1);
+ && !pageinfomap_pin_ok(block1, *pimbuffer1);
need_to_pin_buffer2 = buffer2 != InvalidBuffer
&& PageIsAllVisible(BufferGetPage(buffer2))
- && !visibilitymap_pin_ok(block2, *vmbuffer2);
+ && !pageinfomap_pin_ok(block2, *pimbuffer2);
if (!need_to_pin_buffer1 && !need_to_pin_buffer2)
return;
@@ -147,9 +147,9 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
/* Get pins. */
if (need_to_pin_buffer1)
- visibilitymap_pin(relation, block1, vmbuffer1);
+ pageinfomap_pin(relation, block1, pimbuffer1);
if (need_to_pin_buffer2)
- visibilitymap_pin(relation, block2, vmbuffer2);
+ pageinfomap_pin(relation, block2, pimbuffer2);
/* Relock buffers. */
LockBuffer(buffer1, BUFFER_LOCK_EXCLUSIVE);
@@ -192,7 +192,7 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
* happen if space is freed in that page after heap_update finds there's not
* enough there). In that case, the page will be pinned and locked only once.
*
- * For the vmbuffer and vmbuffer_other arguments, we avoid deadlock by
+ * For the pimbuffer and pimbuffer_other arguments, we avoid deadlock by
* locking them only after locking the corresponding heap page, and taking
* no further lwlocks while they are locked.
*
@@ -228,7 +228,7 @@ Buffer
RelationGetBufferForTuple(Relation relation, Size len,
Buffer otherBuffer, int options,
BulkInsertState bistate,
- Buffer *vmbuffer, Buffer *vmbuffer_other)
+ Buffer *pimbuffer, Buffer *pimbuffer_other)
{
bool use_fsm = !(options & HEAP_INSERT_SKIP_FSM);
Buffer buffer = InvalidBuffer;
@@ -316,7 +316,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
* the possibility they are the same block.
*
* If the page-level all-visible flag is set, caller will need to
- * clear both that and the corresponding visibility map bit. However,
+ * clear both that and the corresponding page info map bit. However,
* by the time we return, we'll have x-locked the buffer, and we don't
* want to do any I/O while in that state. So we check the bit here
* before taking the lock, and pin the page if it appears necessary.
@@ -328,7 +328,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
/* easy case */
buffer = ReadBufferBI(relation, targetBlock, bistate);
if (PageIsAllVisible(BufferGetPage(buffer)))
- visibilitymap_pin(relation, targetBlock, vmbuffer);
+ pageinfomap_pin(relation, targetBlock, pimbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
}
else if (otherBlock == targetBlock)
@@ -336,7 +336,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
/* also easy case */
buffer = otherBuffer;
if (PageIsAllVisible(BufferGetPage(buffer)))
- visibilitymap_pin(relation, targetBlock, vmbuffer);
+ pageinfomap_pin(relation, targetBlock, pimbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
}
else if (otherBlock < targetBlock)
@@ -344,7 +344,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
/* lock other buffer first */
buffer = ReadBuffer(relation, targetBlock);
if (PageIsAllVisible(BufferGetPage(buffer)))
- visibilitymap_pin(relation, targetBlock, vmbuffer);
+ pageinfomap_pin(relation, targetBlock, pimbuffer);
LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
}
@@ -353,7 +353,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
/* lock target buffer first */
buffer = ReadBuffer(relation, targetBlock);
if (PageIsAllVisible(BufferGetPage(buffer)))
- visibilitymap_pin(relation, targetBlock, vmbuffer);
+ pageinfomap_pin(relation, targetBlock, pimbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
}
@@ -374,19 +374,19 @@ RelationGetBufferForTuple(Relation relation, Size len,
* caller passed us the right page anyway.
*
* Note also that it's possible that by the time we get the pin and
- * retake the buffer locks, the visibility map bit will have been
+ * retake the buffer locks, the page info map bit will have been
* cleared by some other backend anyway. In that case, we'll have
* done a bit of extra work for no gain, but there's no real harm
* done.
*/
if (otherBuffer == InvalidBuffer || buffer <= otherBuffer)
- GetVisibilityMapPins(relation, buffer, otherBuffer,
- targetBlock, otherBlock, vmbuffer,
- vmbuffer_other);
+ GetPageInfoMapPins(relation, buffer, otherBuffer,
+ targetBlock, otherBlock, pimbuffer,
+ pimbuffer_other);
else
- GetVisibilityMapPins(relation, otherBuffer, buffer,
- otherBlock, targetBlock, vmbuffer_other,
- vmbuffer);
+ GetPageInfoMapPins(relation, otherBuffer, buffer,
+ otherBlock, targetBlock, pimbuffer_other,
+ pimbuffer);
/*
* Now we can check to see if there's enough free space here. If so,
diff --git a/src/backend/access/heap/pageinfomap.c b/src/backend/access/heap/pageinfomap.c
new file mode 100644
index 0000000..6cea796
--- /dev/null
+++ b/src/backend/access/heap/pageinfomap.c
@@ -0,0 +1,676 @@
+/*-------------------------------------------------------------------------
+ *
+ * pageinfomap.c
+ * bitmap for tracking information of heap tuples
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/heap/pageinfomap.c
+ *
+ * INTERFACE ROUTINES
+ * pageinfomap_clear - clear a bit in the page info map
+ * pageinfomap_pin - pin a map page for setting a bit
+ * pageinfomap_pin_ok - check whether correct map page is already pinned
+ * pageinfomap_set - set a bit in a previously pinned page
+ * pageinfomap_get_status - get status of bits
+ * pageinfomap_count - count number of bits set in page info map
+ * pageinfomap_truncate - truncate the page info map
+ *
+ * NOTES
+ *
+ * The page info map is a bitmap with two bits (all-visible and all-frozen)
+ * per heap page. A set all-visible bit means that all tuples on the page are
+ * known visible to all transactions, and therefore the page doesn't need to
+ * be vacuumed. A set all-frozen bit means that all tuples on the page are
+ * completely frozen, and therefore the page doesn't need to be vacuumed even
+ * if whole table scanning vacuum is required (e.g. anti-wraparound vacuum).
+ * The all-frozen bit must be set only when the page is already all-visible.
+ *
+ * The map is conservative in the sense that we make sure that whenever a bit
+ * is set, we know the condition is true, but if a bit is not set, it might or
+ * might not be true.
+ *
+ * Clearing a page info map bit is not separately WAL-logged. The callers
+ * must make sure that whenever a bit is cleared, the bit is cleared on WAL
+ * replay of the updating operation as well. And all-frozen bit must be
+ * cleared with all-visible at the same time.
+ *
+ * When we *set* a page info map during VACUUM, we must write WAL. This may
+ * seem counterintuitive, since the bit is basically a hint: if it is clear,
+ * it may still be the case that every tuple on the page is all-visible or
+ * all-frozen we just don't know that for certain. The difficulty is that
+ * there are two bits which are typically set together: the PD_ALL_VISIBLE
+ * or PD_ALL_FROZEN bit on the page itself, and the corresponding page info map
+ * bit. If a crash occurs after the page info map page makes it to disk and before
+ * the updated heap page makes it to disk, redo must set the bit on the heap page.
+ * Otherwise, the next insert, update, or delete on the heap page will fail to
+ * realize that the page info map bit must be cleared, possibly causing index-only
+ * scans to return wrong answers.
+ *
+ * VACUUM will normally skip pages for which the page info map bit is set;
+ * such pages can't contain any dead tuples and therefore don't need vacuuming.
+ * The page info map has the all-frozen bit which indicates all tuples on
+ * corresponding page has been completely frozen, so the page info map is also
+ * used for anti-wraparound vacuum, even if freezing of tuples is required.
+ *
+ * LOCKING
+ *
+ * In heapam.c, whenever a page is modified so that not all tuples on the
+ * page are visible to everyone anymore, the corresponding bit in the
+ * page info map is cleared. In order to be crash-safe, we need to do this
+ * while still holding a lock on the heap page and in the same critical
+ * section that logs the page modification. However, we don't want to hold
+ * the buffer lock over any I/O that may be required to read in the page information
+ * map page. To avoid this, we examine the heap page before locking it;
+ * if the page-level PD_ALL_VISIBLE or PD_ALL_FROZEN bit is set, we pin the
+ * page info map bit. Then, we lock the buffer. But this creates a race
+ * condition: there is a possibility that in the time it takes to lock the
+ * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that happens,
+ * we have to unlock the buffer, pin the page info map page, and relock the
+ * buffer. This shouldn't happen often, because only VACUUM currently sets
+ * page info map bits, and the race will only occur if VACUUM processes a given
+ * page at almost exactly the same time that someone tries to further modify it.
+ *
+ * To set a bit, you need to hold a lock on the heap page. That prevents
+ * the race condition where VACUUM sees that all tuples on the page are
+ * visible to everyone, but another backend modifies the page before VACUUM
+ * sets the bit in the page info map.
+ *
+ * When a bit is set, the LSN of the page info map page is updated to make
+ * sure that the page info map update doesn't get written to disk before the
+ * WAL record of the changes that made it possible to set the bit is flushed.
+ * But when a bit is cleared, we don't have to do that because it's always
+ * safe to clear a bit in the map from correctness point of view.
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heapam_xlog.h"
+#include "access/pageinfomap.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "storage/lmgr.h"
+#include "storage/smgr.h"
+#include "utils/inval.h"
+
+
+/*#define TRACE_PAGEINFOMAP */
+
+/*
+ * Size of the bitmap on each page info map page, in bytes. There's no
+ * extra headers, so the whole page minus the standard page header is
+ * used for the bitmap.
+ */
+#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
+
+/*
+ * Number of bits allocated for each heap block.
+ * One for all-visible, other for all-frozen.
+*/
+#define BITS_PER_HEAPBLOCK 2
+
+/* Number of heap blocks we can represent in one byte. */
+#define HEAPBLOCKS_PER_BYTE 4
+
+/* Number of heap blocks we can represent in one page info map page. */
+#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
+
+/* Mapping from heap block number to the right bit in the page info map */
+#define HEAPBLK_TO_MAPBLOCK(x) ((x) / HEAPBLOCKS_PER_PAGE)
+#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
+#define HEAPBLK_TO_MAPBIT(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)
+
+/* tables for fast counting of set bits for visible and freeze */
+static const uint8 number_of_ones_for_visible[256] = {
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4
+};
+static const uint8 number_of_ones_for_frozen[256] = {
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4
+};
+
+/* prototypes for internal routines */
+static Buffer pim_readbuf(Relation rel, BlockNumber blkno, bool extend);
+static void pim_extend(Relation rel, BlockNumber npimblocks);
+
+
+/*
+ * pageinfomap_clear - clear all bits in page info map
+ *
+ * You must pass a buffer containing the correct map page to this function.
+ * Call pageinfomap_pin first to pin the right one. This function doesn't do
+ * any I/O.
+ */
+void
+pageinfomap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
+{
+ BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+ int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
+ int mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
+ uint8 mask = PAGEINFOMAP_ALL_FLAGS << mapBit;
+ char *map;
+
+#ifdef TRACE_PAGEINFOMAP
+ elog(DEBUG1, "pim_clear %s block %d", RelationGetRelationName(rel), heapBlk);
+#endif
+
+ if (!BufferIsValid(buf) || BufferGetBlockNumber(buf) != mapBlock)
+ elog(ERROR, "wrong buffer passed to pageinfomap_clear");
+
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ map = PageGetContents(BufferGetPage(buf));
+
+ if (map[mapByte] & mask)
+ {
+ map[mapByte] &= ~mask;
+
+ MarkBufferDirty(buf);
+ }
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+}
+
+/*
+ * pageinfomap_pin - pin a map page for setting a bit
+ *
+ * Setting a bit in the page info map is a two-phase operation. First, call
+ * pageinfomap_pin, to pin the page info map page containing the bit for
+ * the heap page. Because that can require I/O to read the map page, you
+ * shouldn't hold a lock on the heap page while doing that. Then, call
+ * pageinfomap_set to actually set the bit.
+ *
+ * On entry, *buf should be InvalidBuffer or a valid buffer returned by
+ * an earlier call to pageinfomap_pin or pageinfomap_get_status on the same
+ * relation. On return, *buf is a valid buffer with the map page containing
+ * the bit for heapBlk.
+ *
+ * If the page doesn't exist in the map file yet, it is extended.
+ */
+void
+pageinfomap_pin(Relation rel, BlockNumber heapBlk, Buffer *buf)
+{
+ BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+
+ /* Reuse the old pinned buffer if possible */
+ if (BufferIsValid(*buf))
+ {
+ if (BufferGetBlockNumber(*buf) == mapBlock)
+ return;
+
+ ReleaseBuffer(*buf);
+ }
+ *buf = pim_readbuf(rel, mapBlock, true);
+}
+
+/*
+ * pageinfomap_pin_ok - do we already have the correct page pinned?
+ *
+ * On entry, buf should be InvalidBuffer or a valid buffer returned by
+ * an earlier call to pageinfomap_pin or pageinfomap_get_status on the same
+ * relation. The return value indicates whether the buffer covers the
+ * given heapBlk.
+ */
+bool
+pageinfomap_pin_ok(BlockNumber heapBlk, Buffer buf)
+{
+ BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+
+ return BufferIsValid(buf) && BufferGetBlockNumber(buf) == mapBlock;
+}
+
+/*
+ * pageinfomap_set - set bit(s) on a previously pinned page
+ *
+ * recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
+ * or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
+ * one provided; in normal running, we generate a new XLOG record and set the
+ * page LSN to that value. cutoff_xid is the largest xmin on the page being
+ * marked all-visible; it is needed for Hot Standby, and can be
+ * InvalidTransactionId if the page contains no tuples.
+ *
+ * Caller is expected to set the heap page's PD_ALL_VISIBLE or PD_ALL_FROZEN
+ * bit before calling this function. Except in recovery, caller should also
+ * pass the heap buffer and flags which indicates what flag we want to set.
+ * When checksums are enabled and we're not in recovery, we must add the heap
+ * buffer to the WAL chain to protect it from being torn.
+ *
+ * You must pass a buffer containing the correct map page to this function.
+ * Call pageinfomap_pin first to pin the right one. This function doesn't do
+ * any I/O.
+ */
+void
+pageinfomap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
+ XLogRecPtr recptr, Buffer pimBuf, TransactionId cutoff_xid,
+ uint8 flags)
+{
+ BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+ uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
+ uint8 mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
+ Page page;
+ char *map;
+
+#ifdef TRACE_PAGEINFOMAP
+ elog(DEBUG1, "pim_set %s block %d, flag %u", RelationGetRelationName(rel), heapBlk, flags);
+#endif
+
+ Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
+ Assert(InRecovery || BufferIsValid(heapBuf));
+ Assert(flags & PAGEINFOMAP_ALL_FLAGS);
+
+ /* Check that we have the right heap page pinned, if present */
+ if (BufferIsValid(heapBuf) && BufferGetBlockNumber(heapBuf) != heapBlk)
+ elog(ERROR, "wrong heap buffer passed to pageinfomap_set");
+
+ /* Check that we have the right PIM page pinned */
+ if (!BufferIsValid(pimBuf) || BufferGetBlockNumber(pimBuf) != mapBlock)
+ elog(ERROR, "wrong PIM buffer passed to pageinfomap_set");
+
+ page = BufferGetPage(pimBuf);
+ map = PageGetContents(page);
+ LockBuffer(pimBuf, BUFFER_LOCK_EXCLUSIVE);
+
+ if (flags != (map[mapByte] & (flags << mapBit)))
+ {
+ START_CRIT_SECTION();
+
+ map[mapByte] |= (flags << mapBit);
+ MarkBufferDirty(pimBuf);
+
+ if (RelationNeedsWAL(rel))
+ {
+ if (XLogRecPtrIsInvalid(recptr))
+ {
+ Assert(!InRecovery);
+ recptr = log_heap_visible(rel->rd_node, heapBuf, pimBuf,
+ cutoff_xid, flags);
+
+ /*
+ * If data checksums are enabled (or wal_log_hints=on), we
+ * need to protect the heap page from being torn.
+ */
+ if (XLogHintBitIsNeeded())
+ {
+ Page heapPage = BufferGetPage(heapBuf);
+
+ /*
+ * Caller is expected to set PD_ALL_VISIBLE or
+ * PD_ALL_FROZEN first.
+ */
+ Assert(((flags | PAGEINFOMAP_ALL_VISIBLE) && PageIsAllVisible(heapPage)) ||
+ ((flags | PAGEINFOMAP_ALL_FROZEN) && PageIsAllFrozen(heapPage)));
+ PageSetLSN(heapPage, recptr);
+ }
+ }
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+ }
+
+ LockBuffer(pimBuf, BUFFER_LOCK_UNLOCK);
+}
+
+/*
+ * pageinfomap_get_status - get status of bits
+ *
+ * Are all tuples on heapBlk visible to all or are marked frozen, according
+ * to the page info map?
+ *
+ * On entry, *buf should be InvalidBuffer or a valid buffer returned by an
+ * earlier call to pageinfomap_pin or pageinfomap_get_status on the same
+ * relation. On return, *buf is a valid buffer with the map page containing
+ * the bit for heapBlk, or InvalidBuffer. The caller is responsible for
+ * releasing *buf after it's done testing and setting bits, and must pass flags
+ * for which it needs to check the value in page info map.
+ *
+ * NOTE: This function is typically called without a lock on the heap page,
+ * so somebody else could change the bit just after we look at it. In fact,
+ * since we don't lock the page info map page either, it's even possible that
+ * someone else could have changed the bit just before we look at it, but yet
+ * we might see the old value. It is the caller's responsibility to deal with
+ * all concurrency issues!
+ */
+uint8
+pageinfomap_get_status(Relation rel, BlockNumber heapBlk, Buffer *buf)
+{
+ BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+ uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
+ uint8 mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
+ char *map;
+
+#ifdef TRACE_PAGEINFOMAP
+ elog(DEBUG1, "pim_get_status %s, block %d", RelationGetRelationName(rel), heapBlk);
+#endif
+
+ /* Reuse the old pinned buffer if possible */
+ if (BufferIsValid(*buf))
+ {
+ if (BufferGetBlockNumber(*buf) != mapBlock)
+ {
+ ReleaseBuffer(*buf);
+ *buf = InvalidBuffer;
+ }
+ }
+
+ if (!BufferIsValid(*buf))
+ {
+ *buf = pim_readbuf(rel, mapBlock, false);
+ if (!BufferIsValid(*buf))
+ return false;
+ }
+
+ map = PageGetContents(BufferGetPage(*buf));
+
+ /*
+ * The double bits read is atomic. There could be memory-ordering effects
+ * here, but for performance reasons we make it the caller's job to worry
+ * about that.
+ */
+ return ((map[mapByte] >> mapBit) & PAGEINFOMAP_ALL_FLAGS);
+}
+
+/*
+ * pageinfomap_count - count number of bits set in page info map
+ *
+ * Note: we ignore the possibility of race conditions when the table is being
+ * extended concurrently with the call. New pages added to the table aren't
+ * going to be marked all-visible or all-frozen, so they won't affect the result.
+ * The caller must set the flags which indicates what flag we want to count.
+ */
+BlockNumber
+pageinfomap_count(Relation rel, BlockNumber *all_frozen)
+{
+ BlockNumber mapBlock;
+ BlockNumber all_visible = 0;
+
+ if (all_frozen)
+ *all_frozen = 0;
+
+ for (mapBlock = 0;; mapBlock++)
+ {
+ Buffer mapBuffer;
+ unsigned char *map;
+ int i;
+
+ /*
+ * Read till we fall off the end of the map. We assume that any extra
+ * bytes in the last page are zeroed, so we don't bother excluding
+ * them from the count.
+ */
+ mapBuffer = pim_readbuf(rel, mapBlock, false);
+ if (!BufferIsValid(mapBuffer))
+ break;
+
+ /*
+ * We choose not to lock the page, since the result is going to be
+ * immediately stale anyway if anyone is concurrently setting or
+ * clearing bits, and we only really need an approximate value.
+ */
+ map = (unsigned char *) PageGetContents(BufferGetPage(mapBuffer));
+
+ for (i = 0; i < MAPSIZE; i++)
+ {
+ all_visible += number_of_ones_for_visible[map[i]];
+ if (all_frozen)
+ *all_frozen += number_of_ones_for_frozen[map[i]];
+ }
+
+ ReleaseBuffer(mapBuffer);
+ }
+
+ return all_visible;
+}
+
+/*
+ * pageinfomap_truncate - truncate the page info map
+ *
+ * The caller must hold AccessExclusiveLock on the relation, to ensure that
+ * other backends receive the smgr invalidation event that this function sends
+ * before they access the PIM again.
+ *
+ * nheapblocks is the new size of the heap.
+ */
+void
+pageinfomap_truncate(Relation rel, BlockNumber nheapblocks)
+{
+ BlockNumber newnblocks;
+
+ /* last remaining block, byte, and bit */
+ BlockNumber truncBlock = HEAPBLK_TO_MAPBLOCK(nheapblocks);
+ uint32 truncByte = HEAPBLK_TO_MAPBYTE(nheapblocks);
+ uint8 truncBit = HEAPBLK_TO_MAPBIT(nheapblocks);
+
+#ifdef TRACE_PAGEINFOMAP
+ elog(DEBUG1, "pim_truncate %s block %d", RelationGetRelationName(rel), nheapblocks);
+#endif
+
+ RelationOpenSmgr(rel);
+
+ /*
+ * If no page info map has been created yet for this relation, there's
+ * nothing to truncate.
+ */
+ if (!smgrexists(rel->rd_smgr, PAGEINFOMAP_FORKNUM))
+ return;
+
+ /*
+ * Unless the new size is exactly at a page info map page boundary, the
+ * tail bits in the last remaining map page, representing truncated heap
+ * blocks, need to be cleared. This is not only tidy, but also necessary
+ * because we don't get a chance to clear the bits if the heap is extended
+ * again.
+ */
+ if (truncByte != 0 || truncBit != 0)
+ {
+ Buffer mapBuffer;
+ Page page;
+ char *map;
+
+ newnblocks = truncBlock + 1;
+
+ mapBuffer = pim_readbuf(rel, truncBlock, false);
+ if (!BufferIsValid(mapBuffer))
+ {
+ /* nothing to do, the file was already smaller */
+ return;
+ }
+
+ page = BufferGetPage(mapBuffer);
+ map = PageGetContents(page);
+
+ LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+ /* Clear out the unwanted bytes. */
+ MemSet(&map[truncByte + 1], 0, MAPSIZE - (truncByte + 1));
+
+ /*----
+ * Mask out the unwanted bits of the last remaining byte.
+ *
+ * ((1 << 0) - 1) = 00000000
+ * ((1 << 1) - 1) = 00000001
+ * ...
+ * ((1 << 6) - 1) = 00111111
+ * ((1 << 7) - 1) = 01111111
+ *----
+ */
+ map[truncByte] &= (1 << truncBit) - 1;
+
+ MarkBufferDirty(mapBuffer);
+ UnlockReleaseBuffer(mapBuffer);
+ }
+ else
+ newnblocks = truncBlock;
+
+ if (smgrnblocks(rel->rd_smgr, PAGEINFOMAP_FORKNUM) <= newnblocks)
+ {
+ /* nothing to do, the file was already smaller than requested size */
+ return;
+ }
+
+ /* Truncate the unused PIM pages, and send smgr inval message */
+ smgrtruncate(rel->rd_smgr, PAGEINFOMAP_FORKNUM, newnblocks);
+
+ /*
+ * We might as well update the local smgr_pim_nblocks setting. smgrtruncate
+ * sent an smgr cache inval message, which will cause other backends to
+ * invalidate their copy of smgr_pim_nblocks, and this one too at the next
+ * command boundary. But this ensures it isn't outright wrong until then.
+ */
+ if (rel->rd_smgr)
+ rel->rd_smgr->smgr_pim_nblocks = newnblocks;
+}
+
+/*
+ * Read a page info map page.
+ *
+ * If the page doesn't exist, InvalidBuffer is returned, or if 'extend' is
+ * true, the page info map file is extended.
+ */
+static Buffer
+pim_readbuf(Relation rel, BlockNumber blkno, bool extend)
+{
+ Buffer buf;
+
+ /*
+ * We might not have opened the relation at the smgr level yet, or we
+ * might have been forced to close it by a sinval message. The code below
+ * won't necessarily notice relation extension immediately when extend =
+ * false, so we rely on sinval messages to ensure that our ideas about the
+ * size of the map aren't too far out of date.
+ */
+ RelationOpenSmgr(rel);
+
+ /*
+ * If we haven't cached the size of the page info map fork yet, check it
+ * first.
+ */
+ if (rel->rd_smgr->smgr_pim_nblocks == InvalidBlockNumber)
+ {
+ if (smgrexists(rel->rd_smgr, PAGEINFOMAP_FORKNUM))
+ rel->rd_smgr->smgr_pim_nblocks = smgrnblocks(rel->rd_smgr,
+ PAGEINFOMAP_FORKNUM);
+ else
+ rel->rd_smgr->smgr_pim_nblocks = 0;
+ }
+
+ /* Handle requests beyond EOF */
+ if (blkno >= rel->rd_smgr->smgr_pim_nblocks)
+ {
+ if (extend)
+ pim_extend(rel, blkno + 1);
+ else
+ return InvalidBuffer;
+ }
+
+ /*
+ * Use ZERO_ON_ERROR mode, and initialize the page if necessary. It's
+ * always safe to clear bits, so it's better to clear corrupt pages than
+ * error out.
+ */
+ buf = ReadBufferExtended(rel, PAGEINFOMAP_FORKNUM, blkno,
+ RBM_ZERO_ON_ERROR, NULL);
+ if (PageIsNew(BufferGetPage(buf)))
+ PageInit(BufferGetPage(buf), BLCKSZ, 0);
+ return buf;
+}
+
+/*
+ * Ensure that the page info map fork is at least pim_nblocks long, extending
+ * it if necessary with zeroed pages.
+ */
+static void
+pim_extend(Relation rel, BlockNumber pim_nblocks)
+{
+ BlockNumber pim_nblocks_now;
+ Page pg;
+
+ pg = (Page) palloc(BLCKSZ);
+ PageInit(pg, BLCKSZ, 0);
+
+ /*
+ * We use the relation extension lock to lock out other backends trying to
+ * extend the page info map at the same time. It also locks out extension
+ * of the main fork, unnecessarily, but extending the page info map
+ * happens seldom enough that it doesn't seem worthwhile to have a
+ * separate lock tag type for it.
+ *
+ * Note that another backend might have extended or created the relation
+ * by the time we get the lock.
+ */
+ LockRelationForExtension(rel, ExclusiveLock);
+
+ /* Might have to re-open if a cache flush happened */
+ RelationOpenSmgr(rel);
+
+ /*
+ * Create the file first if it doesn't exist. If smgr_pim_nblocks is
+ * positive then it must exist, no need for an smgrexists call.
+ */
+ if ((rel->rd_smgr->smgr_pim_nblocks == 0 ||
+ rel->rd_smgr->smgr_pim_nblocks == InvalidBlockNumber) &&
+ !smgrexists(rel->rd_smgr, PAGEINFOMAP_FORKNUM))
+ smgrcreate(rel->rd_smgr, PAGEINFOMAP_FORKNUM, false);
+
+ pim_nblocks_now = smgrnblocks(rel->rd_smgr, PAGEINFOMAP_FORKNUM);
+
+ /* Now extend the file */
+ while (pim_nblocks_now < pim_nblocks)
+ {
+ PageSetChecksumInplace(pg, pim_nblocks_now);
+
+ smgrextend(rel->rd_smgr, PAGEINFOMAP_FORKNUM, pim_nblocks_now,
+ (char *) pg, false);
+ pim_nblocks_now++;
+ }
+
+ /*
+ * Send a shared-inval message to force other backends to close any smgr
+ * references they may have for this rel, which we are about to change.
+ * This is a useful optimization because it means that backends don't have
+ * to keep checking for creation or extension of the file, which happens
+ * infrequently.
+ */
+ CacheInvalidateSmgr(rel->rd_smgr->smgr_rnode);
+
+ /* Update local cache with the up-to-date size */
+ rel->rd_smgr->smgr_pim_nblocks = pim_nblocks_now;
+
+ UnlockRelationForExtension(rel, ExclusiveLock);
+
+ pfree(pg);
+}
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
deleted file mode 100644
index 7c38772..0000000
--- a/src/backend/access/heap/visibilitymap.c
+++ /dev/null
@@ -1,635 +0,0 @@
-/*-------------------------------------------------------------------------
- *
- * visibilitymap.c
- * bitmap for tracking visibility of heap tuples
- *
- * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
- * Portions Copyright (c) 1994, Regents of the University of California
- *
- *
- * IDENTIFICATION
- * src/backend/access/heap/visibilitymap.c
- *
- * INTERFACE ROUTINES
- * visibilitymap_clear - clear a bit in the visibility map
- * visibilitymap_pin - pin a map page for setting a bit
- * visibilitymap_pin_ok - check whether correct map page is already pinned
- * visibilitymap_set - set a bit in a previously pinned page
- * visibilitymap_test - test if a bit is set
- * visibilitymap_count - count number of bits set in visibility map
- * visibilitymap_truncate - truncate the visibility map
- *
- * NOTES
- *
- * The visibility map is a bitmap with one bit per heap page. A set bit means
- * that all tuples on the page are known visible to all transactions, and
- * therefore the page doesn't need to be vacuumed. The map is conservative in
- * the sense that we make sure that whenever a bit is set, we know the
- * condition is true, but if a bit is not set, it might or might not be true.
- *
- * Clearing a visibility map bit is not separately WAL-logged. The callers
- * must make sure that whenever a bit is cleared, the bit is cleared on WAL
- * replay of the updating operation as well.
- *
- * When we *set* a visibility map during VACUUM, we must write WAL. This may
- * seem counterintuitive, since the bit is basically a hint: if it is clear,
- * it may still be the case that every tuple on the page is visible to all
- * transactions; we just don't know that for certain. The difficulty is that
- * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
- * on the page itself, and the visibility map bit. If a crash occurs after the
- * visibility map page makes it to disk and before the updated heap page makes
- * it to disk, redo must set the bit on the heap page. Otherwise, the next
- * insert, update, or delete on the heap page will fail to realize that the
- * visibility map bit must be cleared, possibly causing index-only scans to
- * return wrong answers.
- *
- * VACUUM will normally skip pages for which the visibility map bit is set;
- * such pages can't contain any dead tuples and therefore don't need vacuuming.
- * The visibility map is not used for anti-wraparound vacuums, because
- * an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
- * present in the table, even on pages that don't have any dead tuples.
- *
- * LOCKING
- *
- * In heapam.c, whenever a page is modified so that not all tuples on the
- * page are visible to everyone anymore, the corresponding bit in the
- * visibility map is cleared. In order to be crash-safe, we need to do this
- * while still holding a lock on the heap page and in the same critical
- * section that logs the page modification. However, we don't want to hold
- * the buffer lock over any I/O that may be required to read in the visibility
- * map page. To avoid this, we examine the heap page before locking it;
- * if the page-level PD_ALL_VISIBLE bit is set, we pin the visibility map
- * bit. Then, we lock the buffer. But this creates a race condition: there
- * is a possibility that in the time it takes to lock the buffer, the
- * PD_ALL_VISIBLE bit gets set. If that happens, we have to unlock the
- * buffer, pin the visibility map page, and relock the buffer. This shouldn't
- * happen often, because only VACUUM currently sets visibility map bits,
- * and the race will only occur if VACUUM processes a given page at almost
- * exactly the same time that someone tries to further modify it.
- *
- * To set a bit, you need to hold a lock on the heap page. That prevents
- * the race condition where VACUUM sees that all tuples on the page are
- * visible to everyone, but another backend modifies the page before VACUUM
- * sets the bit in the visibility map.
- *
- * When a bit is set, the LSN of the visibility map page is updated to make
- * sure that the visibility map update doesn't get written to disk before the
- * WAL record of the changes that made it possible to set the bit is flushed.
- * But when a bit is cleared, we don't have to do that because it's always
- * safe to clear a bit in the map from correctness point of view.
- *
- *-------------------------------------------------------------------------
- */
-#include "postgres.h"
-
-#include "access/heapam_xlog.h"
-#include "access/visibilitymap.h"
-#include "access/xlog.h"
-#include "miscadmin.h"
-#include "storage/bufmgr.h"
-#include "storage/lmgr.h"
-#include "storage/smgr.h"
-#include "utils/inval.h"
-
-
-/*#define TRACE_VISIBILITYMAP */
-
-/*
- * Size of the bitmap on each visibility map page, in bytes. There's no
- * extra headers, so the whole page minus the standard page header is
- * used for the bitmap.
- */
-#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
-
-/* Number of bits allocated for each heap block. */
-#define BITS_PER_HEAPBLOCK 1
-
-/* Number of heap blocks we can represent in one byte. */
-#define HEAPBLOCKS_PER_BYTE 8
-
-/* Number of heap blocks we can represent in one visibility map page. */
-#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
-
-/* Mapping from heap block number to the right bit in the visibility map */
-#define HEAPBLK_TO_MAPBLOCK(x) ((x) / HEAPBLOCKS_PER_PAGE)
-#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
-#define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
-
-/* table for fast counting of set bits */
-static const uint8 number_of_ones[256] = {
- 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
-};
-
-/* prototypes for internal routines */
-static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend);
-static void vm_extend(Relation rel, BlockNumber nvmblocks);
-
-
-/*
- * visibilitymap_clear - clear a bit in visibility map
- *
- * You must pass a buffer containing the correct map page to this function.
- * Call visibilitymap_pin first to pin the right one. This function doesn't do
- * any I/O.
- */
-void
-visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
-{
- BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
- int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
- int mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- uint8 mask = 1 << mapBit;
- char *map;
-
-#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
-#endif
-
- if (!BufferIsValid(buf) || BufferGetBlockNumber(buf) != mapBlock)
- elog(ERROR, "wrong buffer passed to visibilitymap_clear");
-
- LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
- map = PageGetContents(BufferGetPage(buf));
-
- if (map[mapByte] & mask)
- {
- map[mapByte] &= ~mask;
-
- MarkBufferDirty(buf);
- }
-
- LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-}
-
-/*
- * visibilitymap_pin - pin a map page for setting a bit
- *
- * Setting a bit in the visibility map is a two-phase operation. First, call
- * visibilitymap_pin, to pin the visibility map page containing the bit for
- * the heap page. Because that can require I/O to read the map page, you
- * shouldn't hold a lock on the heap page while doing that. Then, call
- * visibilitymap_set to actually set the bit.
- *
- * On entry, *buf should be InvalidBuffer or a valid buffer returned by
- * an earlier call to visibilitymap_pin or visibilitymap_test on the same
- * relation. On return, *buf is a valid buffer with the map page containing
- * the bit for heapBlk.
- *
- * If the page doesn't exist in the map file yet, it is extended.
- */
-void
-visibilitymap_pin(Relation rel, BlockNumber heapBlk, Buffer *buf)
-{
- BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
-
- /* Reuse the old pinned buffer if possible */
- if (BufferIsValid(*buf))
- {
- if (BufferGetBlockNumber(*buf) == mapBlock)
- return;
-
- ReleaseBuffer(*buf);
- }
- *buf = vm_readbuf(rel, mapBlock, true);
-}
-
-/*
- * visibilitymap_pin_ok - do we already have the correct page pinned?
- *
- * On entry, buf should be InvalidBuffer or a valid buffer returned by
- * an earlier call to visibilitymap_pin or visibilitymap_test on the same
- * relation. The return value indicates whether the buffer covers the
- * given heapBlk.
- */
-bool
-visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
-{
- BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
-
- return BufferIsValid(buf) && BufferGetBlockNumber(buf) == mapBlock;
-}
-
-/*
- * visibilitymap_set - set a bit on a previously pinned page
- *
- * recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
- * or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
- * one provided; in normal running, we generate a new XLOG record and set the
- * page LSN to that value. cutoff_xid is the largest xmin on the page being
- * marked all-visible; it is needed for Hot Standby, and can be
- * InvalidTransactionId if the page contains no tuples.
- *
- * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
- * this function. Except in recovery, caller should also pass the heap
- * buffer. When checksums are enabled and we're not in recovery, we must add
- * the heap buffer to the WAL chain to protect it from being torn.
- *
- * You must pass a buffer containing the correct map page to this function.
- * Call visibilitymap_pin first to pin the right one. This function doesn't do
- * any I/O.
- */
-void
-visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid)
-{
- BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
- uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
- uint8 mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- Page page;
- char *map;
-
-#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
-#endif
-
- Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
- Assert(InRecovery || BufferIsValid(heapBuf));
-
- /* Check that we have the right heap page pinned, if present */
- if (BufferIsValid(heapBuf) && BufferGetBlockNumber(heapBuf) != heapBlk)
- elog(ERROR, "wrong heap buffer passed to visibilitymap_set");
-
- /* Check that we have the right VM page pinned */
- if (!BufferIsValid(vmBuf) || BufferGetBlockNumber(vmBuf) != mapBlock)
- elog(ERROR, "wrong VM buffer passed to visibilitymap_set");
-
- page = BufferGetPage(vmBuf);
- map = PageGetContents(page);
- LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
-
- if (!(map[mapByte] & (1 << mapBit)))
- {
- START_CRIT_SECTION();
-
- map[mapByte] |= (1 << mapBit);
- MarkBufferDirty(vmBuf);
-
- if (RelationNeedsWAL(rel))
- {
- if (XLogRecPtrIsInvalid(recptr))
- {
- Assert(!InRecovery);
- recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
- cutoff_xid);
-
- /*
- * If data checksums are enabled (or wal_log_hints=on), we
- * need to protect the heap page from being torn.
- */
- if (XLogHintBitIsNeeded())
- {
- Page heapPage = BufferGetPage(heapBuf);
-
- /* caller is expected to set PD_ALL_VISIBLE first */
- Assert(PageIsAllVisible(heapPage));
- PageSetLSN(heapPage, recptr);
- }
- }
- PageSetLSN(page, recptr);
- }
-
- END_CRIT_SECTION();
- }
-
- LockBuffer(vmBuf, BUFFER_LOCK_UNLOCK);
-}
-
-/*
- * visibilitymap_test - test if a bit is set
- *
- * Are all tuples on heapBlk visible to all, according to the visibility map?
- *
- * On entry, *buf should be InvalidBuffer or a valid buffer returned by an
- * earlier call to visibilitymap_pin or visibilitymap_test on the same
- * relation. On return, *buf is a valid buffer with the map page containing
- * the bit for heapBlk, or InvalidBuffer. The caller is responsible for
- * releasing *buf after it's done testing and setting bits.
- *
- * NOTE: This function is typically called without a lock on the heap page,
- * so somebody else could change the bit just after we look at it. In fact,
- * since we don't lock the visibility map page either, it's even possible that
- * someone else could have changed the bit just before we look at it, but yet
- * we might see the old value. It is the caller's responsibility to deal with
- * all concurrency issues!
- */
-bool
-visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
-{
- BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
- uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
- uint8 mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- bool result;
- char *map;
-
-#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_test %s %d", RelationGetRelationName(rel), heapBlk);
-#endif
-
- /* Reuse the old pinned buffer if possible */
- if (BufferIsValid(*buf))
- {
- if (BufferGetBlockNumber(*buf) != mapBlock)
- {
- ReleaseBuffer(*buf);
- *buf = InvalidBuffer;
- }
- }
-
- if (!BufferIsValid(*buf))
- {
- *buf = vm_readbuf(rel, mapBlock, false);
- if (!BufferIsValid(*buf))
- return false;
- }
-
- map = PageGetContents(BufferGetPage(*buf));
-
- /*
- * A single-bit read is atomic. There could be memory-ordering effects
- * here, but for performance reasons we make it the caller's job to worry
- * about that.
- */
- result = (map[mapByte] & (1 << mapBit)) ? true : false;
-
- return result;
-}
-
-/*
- * visibilitymap_count - count number of bits set in visibility map
- *
- * Note: we ignore the possibility of race conditions when the table is being
- * extended concurrently with the call. New pages added to the table aren't
- * going to be marked all-visible, so they won't affect the result.
- */
-BlockNumber
-visibilitymap_count(Relation rel)
-{
- BlockNumber result = 0;
- BlockNumber mapBlock;
-
- for (mapBlock = 0;; mapBlock++)
- {
- Buffer mapBuffer;
- unsigned char *map;
- int i;
-
- /*
- * Read till we fall off the end of the map. We assume that any extra
- * bytes in the last page are zeroed, so we don't bother excluding
- * them from the count.
- */
- mapBuffer = vm_readbuf(rel, mapBlock, false);
- if (!BufferIsValid(mapBuffer))
- break;
-
- /*
- * We choose not to lock the page, since the result is going to be
- * immediately stale anyway if anyone is concurrently setting or
- * clearing bits, and we only really need an approximate value.
- */
- map = (unsigned char *) PageGetContents(BufferGetPage(mapBuffer));
-
- for (i = 0; i < MAPSIZE; i++)
- {
- result += number_of_ones[map[i]];
- }
-
- ReleaseBuffer(mapBuffer);
- }
-
- return result;
-}
-
-/*
- * visibilitymap_truncate - truncate the visibility map
- *
- * The caller must hold AccessExclusiveLock on the relation, to ensure that
- * other backends receive the smgr invalidation event that this function sends
- * before they access the VM again.
- *
- * nheapblocks is the new size of the heap.
- */
-void
-visibilitymap_truncate(Relation rel, BlockNumber nheapblocks)
-{
- BlockNumber newnblocks;
-
- /* last remaining block, byte, and bit */
- BlockNumber truncBlock = HEAPBLK_TO_MAPBLOCK(nheapblocks);
- uint32 truncByte = HEAPBLK_TO_MAPBYTE(nheapblocks);
- uint8 truncBit = HEAPBLK_TO_MAPBIT(nheapblocks);
-
-#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_truncate %s %d", RelationGetRelationName(rel), nheapblocks);
-#endif
-
- RelationOpenSmgr(rel);
-
- /*
- * If no visibility map has been created yet for this relation, there's
- * nothing to truncate.
- */
- if (!smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM))
- return;
-
- /*
- * Unless the new size is exactly at a visibility map page boundary, the
- * tail bits in the last remaining map page, representing truncated heap
- * blocks, need to be cleared. This is not only tidy, but also necessary
- * because we don't get a chance to clear the bits if the heap is extended
- * again.
- */
- if (truncByte != 0 || truncBit != 0)
- {
- Buffer mapBuffer;
- Page page;
- char *map;
-
- newnblocks = truncBlock + 1;
-
- mapBuffer = vm_readbuf(rel, truncBlock, false);
- if (!BufferIsValid(mapBuffer))
- {
- /* nothing to do, the file was already smaller */
- return;
- }
-
- page = BufferGetPage(mapBuffer);
- map = PageGetContents(page);
-
- LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
-
- /* Clear out the unwanted bytes. */
- MemSet(&map[truncByte + 1], 0, MAPSIZE - (truncByte + 1));
-
- /*----
- * Mask out the unwanted bits of the last remaining byte.
- *
- * ((1 << 0) - 1) = 00000000
- * ((1 << 1) - 1) = 00000001
- * ...
- * ((1 << 6) - 1) = 00111111
- * ((1 << 7) - 1) = 01111111
- *----
- */
- map[truncByte] &= (1 << truncBit) - 1;
-
- MarkBufferDirty(mapBuffer);
- UnlockReleaseBuffer(mapBuffer);
- }
- else
- newnblocks = truncBlock;
-
- if (smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM) <= newnblocks)
- {
- /* nothing to do, the file was already smaller than requested size */
- return;
- }
-
- /* Truncate the unused VM pages, and send smgr inval message */
- smgrtruncate(rel->rd_smgr, VISIBILITYMAP_FORKNUM, newnblocks);
-
- /*
- * We might as well update the local smgr_vm_nblocks setting. smgrtruncate
- * sent an smgr cache inval message, which will cause other backends to
- * invalidate their copy of smgr_vm_nblocks, and this one too at the next
- * command boundary. But this ensures it isn't outright wrong until then.
- */
- if (rel->rd_smgr)
- rel->rd_smgr->smgr_vm_nblocks = newnblocks;
-}
-
-/*
- * Read a visibility map page.
- *
- * If the page doesn't exist, InvalidBuffer is returned, or if 'extend' is
- * true, the visibility map file is extended.
- */
-static Buffer
-vm_readbuf(Relation rel, BlockNumber blkno, bool extend)
-{
- Buffer buf;
-
- /*
- * We might not have opened the relation at the smgr level yet, or we
- * might have been forced to close it by a sinval message. The code below
- * won't necessarily notice relation extension immediately when extend =
- * false, so we rely on sinval messages to ensure that our ideas about the
- * size of the map aren't too far out of date.
- */
- RelationOpenSmgr(rel);
-
- /*
- * If we haven't cached the size of the visibility map fork yet, check it
- * first.
- */
- if (rel->rd_smgr->smgr_vm_nblocks == InvalidBlockNumber)
- {
- if (smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM))
- rel->rd_smgr->smgr_vm_nblocks = smgrnblocks(rel->rd_smgr,
- VISIBILITYMAP_FORKNUM);
- else
- rel->rd_smgr->smgr_vm_nblocks = 0;
- }
-
- /* Handle requests beyond EOF */
- if (blkno >= rel->rd_smgr->smgr_vm_nblocks)
- {
- if (extend)
- vm_extend(rel, blkno + 1);
- else
- return InvalidBuffer;
- }
-
- /*
- * Use ZERO_ON_ERROR mode, and initialize the page if necessary. It's
- * always safe to clear bits, so it's better to clear corrupt pages than
- * error out.
- */
- buf = ReadBufferExtended(rel, VISIBILITYMAP_FORKNUM, blkno,
- RBM_ZERO_ON_ERROR, NULL);
- if (PageIsNew(BufferGetPage(buf)))
- PageInit(BufferGetPage(buf), BLCKSZ, 0);
- return buf;
-}
-
-/*
- * Ensure that the visibility map fork is at least vm_nblocks long, extending
- * it if necessary with zeroed pages.
- */
-static void
-vm_extend(Relation rel, BlockNumber vm_nblocks)
-{
- BlockNumber vm_nblocks_now;
- Page pg;
-
- pg = (Page) palloc(BLCKSZ);
- PageInit(pg, BLCKSZ, 0);
-
- /*
- * We use the relation extension lock to lock out other backends trying to
- * extend the visibility map at the same time. It also locks out extension
- * of the main fork, unnecessarily, but extending the visibility map
- * happens seldom enough that it doesn't seem worthwhile to have a
- * separate lock tag type for it.
- *
- * Note that another backend might have extended or created the relation
- * by the time we get the lock.
- */
- LockRelationForExtension(rel, ExclusiveLock);
-
- /* Might have to re-open if a cache flush happened */
- RelationOpenSmgr(rel);
-
- /*
- * Create the file first if it doesn't exist. If smgr_vm_nblocks is
- * positive then it must exist, no need for an smgrexists call.
- */
- if ((rel->rd_smgr->smgr_vm_nblocks == 0 ||
- rel->rd_smgr->smgr_vm_nblocks == InvalidBlockNumber) &&
- !smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM))
- smgrcreate(rel->rd_smgr, VISIBILITYMAP_FORKNUM, false);
-
- vm_nblocks_now = smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM);
-
- /* Now extend the file */
- while (vm_nblocks_now < vm_nblocks)
- {
- PageSetChecksumInplace(pg, vm_nblocks_now);
-
- smgrextend(rel->rd_smgr, VISIBILITYMAP_FORKNUM, vm_nblocks_now,
- (char *) pg, false);
- vm_nblocks_now++;
- }
-
- /*
- * Send a shared-inval message to force other backends to close any smgr
- * references they may have for this rel, which we are about to change.
- * This is a useful optimization because it means that backends don't have
- * to keep checking for creation or extension of the file, which happens
- * infrequently.
- */
- CacheInvalidateSmgr(rel->rd_smgr->smgr_rnode);
-
- /* Update local cache with the up-to-date size */
- rel->rd_smgr->smgr_vm_nblocks = vm_nblocks_now;
-
- UnlockRelationForExtension(rel, ExclusiveLock);
-
- pfree(pg);
-}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e59b163..2c30126 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -27,7 +27,7 @@
#include "access/relscan.h"
#include "access/sysattr.h"
#include "access/transam.h"
-#include "access/visibilitymap.h"
+#include "access/pageinfomap.h"
#include "access/xact.h"
#include "bootstrap/bootstrap.h"
#include "catalog/binary_upgrade.h"
@@ -1813,8 +1813,8 @@ FormIndexDatum(IndexInfo *indexInfo,
* isprimary: if true, set relhaspkey true; else no change
* reltuples: if >= 0, set reltuples to this value; else no change
*
- * If reltuples >= 0, relpages and relallvisible are also updated (using
- * RelationGetNumberOfBlocks() and visibilitymap_count()).
+ * If reltuples >= 0, relpages, relallvisible are also updated (using
+ * RelationGetNumberOfBlocks() and pageinfomap_count()).
*
* NOTE: an important side-effect of this operation is that an SI invalidation
* message is sent out to all backends --- including me --- causing relcache
@@ -1921,7 +1921,7 @@ index_update_stats(Relation rel,
BlockNumber relallvisible;
if (rd_rel->relkind != RELKIND_INDEX)
- relallvisible = visibilitymap_count(rel);
+ relallvisible = pageinfomap_count(rel, NULL);
else /* don't bother for indexes */
relallvisible = 0;
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index d4440c9..eaf0796 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,7 +19,7 @@
#include "postgres.h"
-#include "access/visibilitymap.h"
+#include "access/pageinfomap.h"
#include "access/xact.h"
#include "access/xlog.h"
#include "access/xloginsert.h"
@@ -237,17 +237,17 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
*/
rel->rd_smgr->smgr_targblock = InvalidBlockNumber;
rel->rd_smgr->smgr_fsm_nblocks = InvalidBlockNumber;
- rel->rd_smgr->smgr_vm_nblocks = InvalidBlockNumber;
+ rel->rd_smgr->smgr_pim_nblocks = InvalidBlockNumber;
/* Truncate the FSM first if it exists */
fsm = smgrexists(rel->rd_smgr, FSM_FORKNUM);
if (fsm)
FreeSpaceMapTruncateRel(rel, nblocks);
- /* Truncate the visibility map too if it exists. */
- vm = smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM);
+ /* Truncate the page info map too if it exists. */
+ vm = smgrexists(rel->rd_smgr, PAGEINFOMAP_FORKNUM);
if (vm)
- visibilitymap_truncate(rel, nblocks);
+ pageinfomap_truncate(rel, nblocks);
/*
* We WAL-log the truncation before actually truncating, which means
@@ -278,8 +278,8 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
/*
* Flush, because otherwise the truncation of the main relation might
* hit the disk before the WAL record, and the truncation of the FSM
- * or visibility map. If we crashed during that window, we'd be left
- * with a truncated heap, but the FSM or visibility map would still
+ * or page info map. If we crashed during that window, we'd be left
+ * with a truncated heap, but the FSM or page info map would still
* contain entries for the non-existent heap pages.
*/
if (fsm || vm)
@@ -527,13 +527,13 @@ smgr_redo(XLogReaderState *record)
/* Also tell xlogutils.c about it */
XLogTruncateRelation(xlrec->rnode, MAIN_FORKNUM, xlrec->blkno);
- /* Truncate FSM and VM too */
+ /* Truncate FSM and PIM too */
rel = CreateFakeRelcacheEntry(xlrec->rnode);
if (smgrexists(reln, FSM_FORKNUM))
FreeSpaceMapTruncateRel(rel, xlrec->blkno);
- if (smgrexists(reln, VISIBILITYMAP_FORKNUM))
- visibilitymap_truncate(rel, xlrec->blkno);
+ if (smgrexists(reln, PAGEINFOMAP_FORKNUM))
+ pageinfomap_truncate(rel, xlrec->blkno);
FreeFakeRelcacheEntry(rel);
}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index ccc030f..8c555eb 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -444,6 +444,7 @@ CREATE VIEW pg_stat_all_tables AS
pg_stat_get_live_tuples(C.oid) AS n_live_tup,
pg_stat_get_dead_tuples(C.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(C.oid) AS n_mod_since_analyze,
+ pg_stat_get_frozen_pages(C.oid) AS n_frozen_page,
pg_stat_get_last_vacuum_time(C.oid) as last_vacuum,
pg_stat_get_last_autovacuum_time(C.oid) as last_autovacuum,
pg_stat_get_last_analyze_time(C.oid) as last_analyze,
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index ddb68ab..a341297 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -20,7 +20,7 @@
#include "access/transam.h"
#include "access/tupconvert.h"
#include "access/tuptoaster.h"
-#include "access/visibilitymap.h"
+#include "access/pageinfomap.h"
#include "access/xact.h"
#include "catalog/catalog.h"
#include "catalog/index.h"
@@ -314,6 +314,8 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
+ BlockNumber relallvisible,
+ relallfrozen;
if (inh)
ereport(elevel,
@@ -564,6 +566,10 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
}
}
+ /* Calculate the number of all-visible and all-frozen bit */
+ if (!inh)
+ relallvisible = pageinfomap_count(onerel, &relallfrozen);
+
/*
* Update pages/tuples stats in pg_class ... but not if we're doing
* inherited stats.
@@ -572,7 +578,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
vac_update_relstats(onerel,
relpages,
totalrows,
- visibilitymap_count(onerel),
+ relallvisible,
hasindex,
InvalidTransactionId,
InvalidMultiXactId,
@@ -608,7 +614,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
* tracks per-table stats.
*/
if (!inh)
- pgstat_report_analyze(onerel, totalrows, totaldeadrows);
+ pgstat_report_analyze(onerel, totalrows, totaldeadrows, relallfrozen);
/* If this isn't part of VACUUM ANALYZE, let index AMs do cleanup */
if (!(options & VACOPT_VACUUM))
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7c4ef58..0a02a25 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -729,11 +729,11 @@ vac_estimate_reltuples(Relation relation, bool is_analyze,
* marked with xmin = our xid.
*
* In addition to fundamentally nontransactional statistics such as
- * relpages and relallvisible, we try to maintain certain lazily-updated
- * DDL flags such as relhasindex, by clearing them if no longer correct.
- * It's safe to do this in VACUUM, which can't run in parallel with
- * CREATE INDEX/RULE/TRIGGER and can't be part of a transaction block.
- * However, it's *not* safe to do it in an ANALYZE that's within an
+ * relpages, relallvisible, we try to maintain certain lazily-updated
+ * DDL flags such as relhasindex, by clearing them if no onger correct.
+ * It's safe to do this in VACUUM, which can't run in
+ * parallel with CREATE INDEX/RULE/TRIGGER and can't be part of a transaction
+ * block. However, it's *not* safe to do it in an ANALYZE that's within an
* outer transaction, because for example the current transaction might
* have dropped the last index; then we'd think relhasindex should be
* cleared, but if the transaction later rolls back this would be wrong.
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 2429889..112ea00 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -43,7 +43,7 @@
#include "access/htup_details.h"
#include "access/multixact.h"
#include "access/transam.h"
-#include "access/visibilitymap.h"
+#include "access/pageinfomap.h"
#include "access/xlog.h"
#include "catalog/catalog.h"
#include "catalog/storage.h"
@@ -93,7 +93,7 @@
/*
* Before we consider skipping a page that's marked as clean in
- * visibility map, we must've seen at least this many clean pages.
+ * page info map, we must've seen at least this many clean pages.
*/
#define SKIP_PAGES_THRESHOLD ((BlockNumber) 32)
@@ -106,6 +106,8 @@ typedef struct LVRelStats
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* number of pages we examined */
BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber pimskipped_frozen_pages; /* # of pages we skipped by all-frozen bit
+ of page info map */
double scanned_tuples; /* counts only tuples on scanned pages */
double old_rel_tuples; /* previous value of pg_class.reltuples */
double new_rel_tuples; /* new estimated total # of tuples */
@@ -146,7 +148,7 @@ static void lazy_cleanup_index(Relation indrel,
IndexBulkDeleteResult *stats,
LVRelStats *vacrelstats);
static int lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
- int tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer);
+ int tupindex, LVRelStats *vacrelstats, Buffer *pimbuffer);
static void lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats);
static BlockNumber count_nondeletable_pages(Relation onerel,
LVRelStats *vacrelstats);
@@ -156,7 +158,7 @@ static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
static int vac_cmp_itemptr(const void *left, const void *right);
static bool heap_page_is_all_visible(Relation rel, Buffer buf,
- TransactionId *visibility_cutoff_xid);
+ TransactionId *visibility_cutoff_xid, bool *all_frozen);
/*
@@ -188,7 +190,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
MultiXactId mxactFullScanLimit;
BlockNumber new_rel_pages;
double new_rel_tuples;
- BlockNumber new_rel_allvisible;
+ BlockNumber new_rel_allvisible,
+ new_rel_allfrozen;
double new_live_tuples;
TransactionId new_frozen_xid;
MultiXactId new_min_multi;
@@ -221,7 +224,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* We request a full scan if either the table's frozen Xid is now older
* than or equal to the requested Xid full-table scan limit; or if the
* table's minimum MultiXactId is older than or equal to the requested
- * mxid full-table scan limit.
+ * mxid full-table scan limit. During full scan, we could skip some pags
+ * according to all-frozen bit of page info map.
*/
scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
xidFullScanLimit);
@@ -253,7 +257,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
- if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages + vacrelstats->pimskipped_frozen_pages)
+ < vacrelstats->rel_pages)
{
Assert(!scan_all);
scanned_all = false;
@@ -301,10 +306,13 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_tuples = vacrelstats->old_rel_tuples;
}
- new_rel_allvisible = visibilitymap_count(onerel);
+ new_rel_allvisible = pageinfomap_count(onerel, &new_rel_allfrozen);
if (new_rel_allvisible > new_rel_pages)
new_rel_allvisible = new_rel_pages;
+ if (new_rel_allfrozen > new_rel_pages)
+ new_rel_allfrozen = new_rel_pages;
+
new_frozen_xid = scanned_all ? FreezeLimit : InvalidTransactionId;
new_min_multi = scanned_all ? MultiXactCutoff : InvalidMultiXactId;
@@ -325,7 +333,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
pgstat_report_vacuum(RelationGetRelid(onerel),
onerel->rd_rel->relisshared,
new_live_tuples,
- vacrelstats->new_dead_tuples);
+ vacrelstats->new_dead_tuples,
+ new_rel_allfrozen);
/* and log the action if appropriate */
if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
@@ -360,10 +369,11 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel),
vacrelstats->num_index_scans);
- appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins\n"),
+ appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped according to pim\n"),
vacrelstats->pages_removed,
vacrelstats->rel_pages,
- vacrelstats->pinskipped_pages);
+ vacrelstats->pinskipped_pages,
+ vacrelstats->pimskipped_frozen_pages);
appendStringInfo(&buf,
_("tuples: %.0f removed, %.0f remain, %.0f are dead but not yet removable\n"),
vacrelstats->tuples_deleted,
@@ -451,7 +461,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
IndexBulkDeleteResult **indstats;
int i;
PGRUsage ru0;
- Buffer vmbuffer = InvalidBuffer;
+ Buffer pimbuffer = InvalidBuffer;
BlockNumber next_not_all_visible_block;
bool skipping_all_visible_blocks;
xl_heap_freeze_tuple *frozen;
@@ -482,40 +492,43 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
/*
* We want to skip pages that don't require vacuuming according to the
- * visibility map, but only when we can skip at least SKIP_PAGES_THRESHOLD
+ * page info map, but only when we can skip at least SKIP_PAGES_THRESHOLD
* consecutive pages. Since we're reading sequentially, the OS should be
* doing readahead for us, so there's no gain in skipping a page now and
* then; that's likely to disable readahead and so be counterproductive.
- * Also, skipping even a single page means that we can't update
- * relfrozenxid, so we only want to do it if we can skip a goodly number
- * of pages.
+ * Also, skipping even a single page accorinding to all-visible bit of
+ * page info map means that we can't update relfrozenxid, so we only want
+ * to do it if we can skip a goodly number. On the other hand, we count
+ * both how many pages we skipped according to all-frozen bit of page information
+ * map and how many pages we freeze, so we can update relfrozenxid if
+ * the sum of two is as many as pages of table.
*
* Before entering the main loop, establish the invariant that
* next_not_all_visible_block is the next block number >= blkno that's not
- * all-visible according to the visibility map, or nblocks if there's no
+ * all-visible according to the page info map, or nblocks if there's no
* such block. Also, we set up the skipping_all_visible_blocks flag,
* which is needed because we need hysteresis in the decision: once we've
* started skipping blocks, we may as well skip everything up to the next
* not-all-visible block.
*
- * Note: if scan_all is true, we won't actually skip any pages; but we
+ * Note: if scan_all is true, we might not actually skip any pages; but we
* maintain next_not_all_visible_block anyway, so as to set up the
- * all_visible_according_to_vm flag correctly for each page.
+ * all_visible_according_to_pim flag correctly for each page.
*
- * Note: The value returned by visibilitymap_test could be slightly
+ * Note: The value returned by pageinfomap_get_status could be slightly
* out-of-date, since we make this test before reading the corresponding
* heap page or locking the buffer. This is OK. If we mistakenly think
- * that the page is all-visible when in fact the flag's just been cleared,
- * we might fail to vacuum the page. But it's OK to skip pages when
- * scan_all is not set, so no great harm done; the next vacuum will find
- * them. If we make the reverse mistake and vacuum a page unnecessarily,
- * it'll just be a no-op.
+ * that the page is all-visible/all-frozen when in fact the flag's just
+ * been cleared, we might fail to vacuum the page. But it's OK to skip
+ * pages when scan_all is not set, so no great harm done; the next vacuum
+ * will find them. If we make the reverse mistake and vacuum a page
+ * unnecessarily, it'll just be a no-op.
*/
for (next_not_all_visible_block = 0;
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer))
+ if (!PIM_ALL_VISIBLE(onerel, next_not_all_visible_block, &pimbuffer))
break;
vacuum_delay_point();
}
@@ -533,9 +546,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
bool tupgone,
hastup;
int prev_dead_count;
- int nfrozen;
+ int nfrozen; /* # of tuples is frozen */
+ int nalready_frozen; /* # of tuples is already frozen */
+ int ntotal_frozen; /* # of frozen tuples in single page */
+ int ntup_per_page;
Size freespace;
- bool all_visible_according_to_vm;
+ bool all_visible_according_to_pim;
+ bool all_frozen_according_to_pim;
bool all_visible;
bool has_dead_tuples;
TransactionId visibility_cutoff_xid = InvalidTransactionId;
@@ -547,8 +564,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block,
- &vmbuffer))
+ if (!PIM_ALL_VISIBLE(onerel, next_not_all_visible_block, &pimbuffer))
break;
vacuum_delay_point();
}
@@ -562,14 +578,29 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
skipping_all_visible_blocks = true;
else
skipping_all_visible_blocks = false;
- all_visible_according_to_vm = false;
+
+ all_visible_according_to_pim = false;
+ all_frozen_according_to_pim = false;
}
else
{
- /* Current block is all-visible */
- if (skipping_all_visible_blocks && !scan_all)
- continue;
- all_visible_according_to_vm = true;
+ /*
+ * This block is at least all-visible according to the page info map.
+ * We check whether this block is all-frozen or not, to skip to
+ * vacuum this page even if scan_all is true.
+ */
+ bool all_frozen = PIM_ALL_FROZEN(onerel, blkno, &pimbuffer);
+
+ if (scan_all && all_frozen)
+ {
+ vacrelstats->pimskipped_frozen_pages++;
+ continue;
+ }
+ else if (!scan_all && skipping_all_visible_blocks)
+ continue;
+
+ all_visible_according_to_pim = true;
+ all_frozen_according_to_pim = all_frozen;
}
vacuum_delay_point();
@@ -583,14 +614,14 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
{
/*
* Before beginning index vacuuming, we release any pin we may
- * hold on the visibility map page. This isn't necessary for
+ * hold on the page info map page. This isn't necessary for
* correctness, but we do it anyway to avoid holding the pin
* across a lengthy, unrelated operation.
*/
- if (BufferIsValid(vmbuffer))
+ if (BufferIsValid(pimbuffer))
{
- ReleaseBuffer(vmbuffer);
- vmbuffer = InvalidBuffer;
+ ReleaseBuffer(pimbuffer);
+ pimbuffer = InvalidBuffer;
}
/* Log cleanup info before we touch indexes */
@@ -614,14 +645,14 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
}
/*
- * Pin the visibility map page in case we need to mark the page
+ * Pin the page info map page in case we need to mark the page
* all-visible. In most cases this will be very cheap, because we'll
* already have the correct page pinned anyway. However, it's
* possible that (a) next_not_all_visible_block is covered by a
- * different VM page than the current block or (b) we released our pin
+ * different PIM page than the current block or (b) we released our pin
* and did a cycle of index vacuuming.
*/
- visibilitymap_pin(onerel, blkno, &vmbuffer);
+ pageinfomap_pin(onerel, blkno, &pimbuffer);
buf = ReadBufferExtended(onerel, MAIN_FORKNUM, blkno,
RBM_NORMAL, vac_strategy);
@@ -716,7 +747,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
empty_pages++;
freespace = PageGetHeapFreeSpace(page);
- /* empty pages are always all-visible */
+ /* empty pages are always all-visible and all-frozen */
if (!PageIsAllVisible(page))
{
START_CRIT_SECTION();
@@ -739,8 +770,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
log_newpage_buffer(buf, true);
PageSetAllVisible(page);
- visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, InvalidTransactionId);
+ PageSetAllFrozen(page);
+ pageinfomap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+ pimbuffer, InvalidTransactionId,
+ PAGEINFOMAP_ALL_VISIBLE | PAGEINFOMAP_ALL_FROZEN);
END_CRIT_SECTION();
}
@@ -764,6 +797,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
all_visible = true;
has_dead_tuples = false;
nfrozen = 0;
+ nalready_frozen = 0;
+ ntup_per_page = 0;
hastup = false;
prev_dead_count = vacrelstats->num_dead_tuples;
maxoff = PageGetMaxOffsetNumber(page);
@@ -918,8 +953,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
{
num_tuples += 1;
+ ntup_per_page += 1;;
hastup = true;
+ /* Check whether this tuple is alrady frozen or not */
+ if (HeapTupleHeaderXminFrozen(tuple.t_data))
+ nalready_frozen += 1;
+
/*
* Each non-removable tuple must be checked to see if it needs
* freezing. Note we already have exclusive buffer lock.
@@ -931,9 +971,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
} /* scan along page */
/*
- * If we froze any tuples, mark the buffer dirty, and write a WAL
- * record recording the changes. We must log the changes to be
- * crash-safe against future truncation of CLOG.
+ * If we freeze any tuples, mark the buffer dirty, and write a WAL
+ * record recording the changes. We must log the changes to be crash-safe
+ * against future truncation of CLOG.
*/
if (nfrozen > 0)
{
@@ -966,6 +1006,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
END_CRIT_SECTION();
}
+ /* Compute the number of frozen tuples in a page */
+ ntotal_frozen = nfrozen + nalready_frozen;
+
/*
* If there are no indexes then we can vacuum the page right now
* instead of doing a second scan.
@@ -974,7 +1017,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
vacrelstats->num_dead_tuples > 0)
{
/* Remove tuples from heap */
- lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats, &vmbuffer);
+ lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats, &pimbuffer);
has_dead_tuples = false;
/*
@@ -988,41 +1031,61 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
freespace = PageGetHeapFreeSpace(page);
- /* mark page all-visible, if appropriate */
- if (all_visible && !all_visible_according_to_vm)
+ /* This page is all visible */
+ if (all_visible)
{
- /*
- * It should never be the case that the visibility map page is set
- * while the page-level bit is clear, but the reverse is allowed
- * (if checksums are not enabled). Regardless, set the both bits
- * so that we get back in sync.
- *
- * NB: If the heap page is all-visible but the VM bit is not set,
- * we don't need to dirty the heap page. However, if checksums
- * are enabled, we do need to make sure that the heap page is
- * dirtied before passing it to visibilitymap_set(), because it
- * may be logged. Given that this situation should only happen in
- * rare cases after a crash, it is not worth optimizing.
- */
- PageSetAllVisible(page);
- MarkBufferDirty(buf);
- visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ /* mark page all-visible, if appropriate */
+ if (!all_visible_according_to_pim)
+ {
+ /*
+ * It should never be the case that the page info map page is set
+ * while the page-level bit is clear, but the reverse is allowed
+ * (if checksums are not enabled). Regardless, set the both bits
+ * so that we get back in sync.
+ *
+ * NB: If the heap page is all-visible but the PIM bit is not set,
+ * we don't need to dirty the heap page. However, if checksums
+ * are enabled, we do need to make sure that the heap page is
+ * dirtied before passing it to pageinfomap_set(), because it
+ * may be logged. Given that this situation should only happen in
+ * rare cases after a crash, it is not worth optimizing.
+ */
+ PageSetAllVisible(page);
+ flags |= PAGEINFOMAP_ALL_VISIBLE;
+ }
+
+ /* mark page all-frozen, if all tuples are frozen and not marked yet */
+ if ((ntotal_frozen == ntup_per_page) && !all_frozen_according_to_pim)
+ {
+ Assert(PageIsAllVisible(page));
+
+ PageSetAllFrozen(page);
+ flags |= PAGEINFOMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ {
+ MarkBufferDirty(buf);
+ pageinfomap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+ pimbuffer, visibility_cutoff_xid, flags);
+ }
}
/*
- * As of PostgreSQL 9.2, the visibility map bit should never be set if
+ * As of PostgreSQL 9.2, the page info map bit should never be set if
* the page-level bit is clear. However, it's possible that the bit
* got cleared after we checked it and before we took the buffer
* content lock, so we must recheck before jumping to the conclusion
* that something bad has happened.
*/
- else if (all_visible_according_to_vm && !PageIsAllVisible(page)
- && visibilitymap_test(onerel, blkno, &vmbuffer))
+ else if (all_visible_according_to_pim && !PageIsAllVisible(page)
+ && PIM_ALL_VISIBLE(onerel, blkno, &pimbuffer))
{
- elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
+ elog(WARNING, "page is not marked all-visible (and all-frozen) but page info map bit(s) is set in relation \"%s\" page %u",
relname, blkno);
- visibilitymap_clear(onerel, blkno, vmbuffer);
+ pageinfomap_clear(onerel, blkno, pimbuffer);
}
/*
@@ -1040,11 +1103,11 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
*/
else if (PageIsAllVisible(page) && has_dead_tuples)
{
- elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
+ elog(WARNING, "page containing dead tuples is marked as all-visible (and all-frozen) in relation \"%s\" page %u",
relname, blkno);
PageClearAllVisible(page);
MarkBufferDirty(buf);
- visibilitymap_clear(onerel, blkno, vmbuffer);
+ pageinfomap_clear(onerel, blkno, pimbuffer);
}
UnlockReleaseBuffer(buf);
@@ -1078,12 +1141,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
num_tuples);
/*
- * Release any remaining pin on visibility map page.
+ * Release any remaining pin on page info map page.
*/
- if (BufferIsValid(vmbuffer))
+ if (BufferIsValid(pimbuffer))
{
- ReleaseBuffer(vmbuffer);
- vmbuffer = InvalidBuffer;
+ ReleaseBuffer(pimbuffer);
+ pimbuffer = InvalidBuffer;
}
/* If any tuples need to be deleted, perform final vacuum cycle */
@@ -1114,6 +1177,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
RelationGetRelationName(onerel),
tups_vacuumed, vacuumed_pages)));
+ /* Report how many frozen pages vacuum skipped according to page info map */
+ ereport(elevel,
+ (errmsg_plural("skipped %d frozen page according to page info map",
+ "skipped %d frozen pages according to page info map",
+ vacrelstats->pimskipped_frozen_pages,
+ vacrelstats->pimskipped_frozen_pages)));
+
/*
* This is pretty messy, but we split it up so that we can skip emitting
* individual parts of the message when not applicable.
@@ -1162,7 +1232,7 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
int tupindex;
int npages;
PGRUsage ru0;
- Buffer vmbuffer = InvalidBuffer;
+ Buffer pimbuffer = InvalidBuffer;
pg_rusage_init(&ru0);
npages = 0;
@@ -1187,7 +1257,7 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
continue;
}
tupindex = lazy_vacuum_page(onerel, tblk, buf, tupindex, vacrelstats,
- &vmbuffer);
+ &pimbuffer);
/* Now that we've compacted the page, record its available space */
page = BufferGetPage(buf);
@@ -1198,10 +1268,10 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
npages++;
}
- if (BufferIsValid(vmbuffer))
+ if (BufferIsValid(pimbuffer))
{
- ReleaseBuffer(vmbuffer);
- vmbuffer = InvalidBuffer;
+ ReleaseBuffer(pimbuffer);
+ pimbuffer = InvalidBuffer;
}
ereport(elevel,
@@ -1224,12 +1294,13 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
*/
static int
lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
- int tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer)
+ int tupindex, LVRelStats *vacrelstats, Buffer *pimbuffer)
{
Page page = BufferGetPage(buffer);
OffsetNumber unused[MaxOffsetNumber];
int uncnt = 0;
TransactionId visibility_cutoff_xid;
+ bool all_frozen;
START_CRIT_SECTION();
@@ -1270,7 +1341,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
/*
* End critical section, so we safely can do visibility tests (which
* possibly need to perform IO and allocate memory!). If we crash now the
- * page (including the corresponding vm bit) might not be marked all
+ * page (including the corresponding pim bit) might not be marked all
* visible, but that's fine. A later vacuum will fix that.
*/
END_CRIT_SECTION();
@@ -1281,19 +1352,34 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
* dirty, exclusively locked, and, if needed, a full page image has been
* emitted in the log_heap_clean() above.
*/
- if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid))
+ if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid, &all_frozen))
PageSetAllVisible(page);
/*
* All the changes to the heap page have been done. If the all-visible
- * flag is now set, also set the VM bit.
+ * flag is now set, also set the PIM all-visible bit.
+ * Also, if this page is all-frozen, set the PIM all-frozen bit and flag.
*/
- if (PageIsAllVisible(page) &&
- !visibilitymap_test(onerel, blkno, vmbuffer))
+ if (PageIsAllVisible(page))
{
- Assert(BufferIsValid(*vmbuffer));
- visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
- visibility_cutoff_xid);
+ uint8 pim_status = pageinfomap_get_status(onerel, blkno, pimbuffer);
+ uint8 flags = 0;
+
+ if (!(pim_status & PAGEINFOMAP_ALL_VISIBLE))
+ flags |= PAGEINFOMAP_ALL_VISIBLE;
+
+ /* Set the PIM all-frozen bit to flag, if needed */
+ if (all_frozen && !(pim_status & PAGEINFOMAP_ALL_FROZEN))
+ {
+ PageSetAllFrozen(page);
+ flags |= PAGEINFOMAP_ALL_FROZEN;
+ }
+
+ Assert(BufferIsValid(*pimbuffer));
+
+ if (pim_status != flags)
+ pageinfomap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *pimbuffer,
+ visibility_cutoff_xid, flags);
}
return tupindex;
@@ -1783,10 +1869,12 @@ vac_cmp_itemptr(const void *left, const void *right)
/*
* Check if every tuple in the given page is visible to all current and future
* transactions. Also return the visibility_cutoff_xid which is the highest
- * xmin amongst the visible tuples.
+ * xmin amongst the visible tuples, and all_frozen which implies that all tuples
+ * of this page are frozen.
*/
static bool
-heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid)
+heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid,
+ bool *all_frozen)
{
Page page = BufferGetPage(buf);
BlockNumber blockno = BufferGetBlockNumber(buf);
@@ -1795,6 +1883,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
bool all_visible = true;
*visibility_cutoff_xid = InvalidTransactionId;
+ *all_frozen = true;
/*
* This is a stripped down version of the line pointer scan in
@@ -1818,11 +1907,12 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/*
* Dead line pointers can have index pointers pointing to them. So
- * they can't be treated as visible
+ * they can't be treated as visible and frozen.
*/
if (ItemIdIsDead(itemid))
{
all_visible = false;
+ *all_frozen = false;
break;
}
@@ -1859,6 +1949,10 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
+
+ /* Check whether this tuple is alrady frozen or not */
+ if (!HeapTupleHeaderXminFrozen(tuple.t_data))
+ *all_frozen = false;
}
break;
@@ -1867,6 +1961,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
case HEAPTUPLE_INSERT_IN_PROGRESS:
case HEAPTUPLE_DELETE_IN_PROGRESS:
all_visible = false;
+ *all_frozen = false;
break;
default:
@@ -1875,5 +1970,8 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
}
} /* scan along page */
+ if (!all_visible)
+ *all_frozen = false;
+
return all_visible;
}
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 9f54c46..f4cd9c6 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -25,7 +25,7 @@
#include "postgres.h"
#include "access/relscan.h"
-#include "access/visibilitymap.h"
+#include "access/pageinfomap.h"
#include "executor/execdebug.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeIndexscan.h"
@@ -85,38 +85,37 @@ IndexOnlyNext(IndexOnlyScanState *node)
* which all tuples are known visible to everybody. In any case,
* we'll use the index tuple not the heap tuple as the data source.
*
- * Note on Memory Ordering Effects: visibilitymap_test does not lock
- * the visibility map buffer, and therefore the result we read here
+ * Note on Memory Ordering Effects: pageinfomap_get_stattus does not lock
+ * the page info map buffer, and therefore the result we read here
* could be slightly stale. However, it can't be stale enough to
* matter.
*
- * We need to detect clearing a VM bit due to an insert right away,
+ * We need to detect clearing a PIM bit due to an insert right away,
* because the tuple is present in the index page but not visible. The
* reading of the TID by this scan (using a shared lock on the index
* buffer) is serialized with the insert of the TID into the index
- * (using an exclusive lock on the index buffer). Because the VM bit
+ * (using an exclusive lock on the index buffer). Because the PIM bit
* is cleared before updating the index, and locking/unlocking of the
* index page acts as a full memory barrier, we are sure to see the
* cleared bit if we see a recently-inserted TID.
*
* Deletes do not update the index page (only VACUUM will clear out
- * the TID), so the clearing of the VM bit by a delete is not
+ * the TID), so the clearing of the PIM bit by a delete is not
* serialized with this test below, and we may see a value that is
* significantly stale. However, we don't care about the delete right
* away, because the tuple is still visible until the deleting
* transaction commits or the statement ends (if it's our
- * transaction). In either case, the lock on the VM buffer will have
+ * transaction). In either case, the lock on the PIM buffer will have
* been released (acting as a write barrier) after clearing the bit.
* And for us to have a snapshot that includes the deleting
* transaction (making the tuple invisible), we must have acquired
* ProcArrayLock after that time, acting as a read barrier.
*
* It's worth going through this complexity to avoid needing to lock
- * the VM buffer, which could cause significant contention.
+ * the PIM buffer, which could cause significant contention.
*/
- if (!visibilitymap_test(scandesc->heapRelation,
- ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
+ if (!PIM_ALL_VISIBLE(scandesc->heapRelation, ItemPointerGetBlockNumber(tid),
+ &node->ioss_PIMBuffer))
{
/*
* Rats, we have to visit the heap to check visibility.
@@ -322,11 +321,11 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
indexScanDesc = node->ioss_ScanDesc;
relation = node->ss.ss_currentRelation;
- /* Release VM buffer pin, if any. */
- if (node->ioss_VMBuffer != InvalidBuffer)
+ /* Release PIM buffer pin, if any. */
+ if (node->ioss_PIMBuffer != InvalidBuffer)
{
- ReleaseBuffer(node->ioss_VMBuffer);
- node->ioss_VMBuffer = InvalidBuffer;
+ ReleaseBuffer(node->ioss_PIMBuffer);
+ node->ioss_PIMBuffer = InvalidBuffer;
}
/*
@@ -546,7 +545,7 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
/* Set it up for index-only scan */
indexstate->ioss_ScanDesc->xs_want_itup = true;
- indexstate->ioss_VMBuffer = InvalidBuffer;
+ indexstate->ioss_PIMBuffer = InvalidBuffer;
/*
* If no run-time keys to calculate, go ahead and pass the scankeys to the
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 990486c..d27a35b 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -468,7 +468,7 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count)
* estimates based on the correlation squared (XXX is that appropriate?).
*
* If it's an index-only scan, then we will not need to fetch any heap
- * pages for which the visibility map shows all tuples are visible.
+ * pages for which the page info map shows all tuples are visible.
* Hence, reduce the estimated number of heap fetches accordingly.
* We use the measured fraction of the entire heap that is all-visible,
* which might not be particularly relevant to the subset of the heap
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 9442e5f..7a1565a 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -780,7 +780,7 @@ infer_collation_opclass_match(InferenceElem *elem, Relation idxRel,
* estimate_rel_size - estimate # pages and # tuples in a table or index
*
* We also estimate the fraction of the pages that are marked all-visible in
- * the visibility map, for use in estimation of index-only scans.
+ * the page info map, for use in estimation of index-only scans.
*
* If attr_widths isn't NULL, it points to the zero-index entry of the
* relation's attr_widths[] cache; we fill this in if we have need to compute
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index ab018c4..ca7257a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1329,7 +1329,8 @@ pgstat_report_autovac(Oid dboid)
*/
void
pgstat_report_vacuum(Oid tableoid, bool shared,
- PgStat_Counter livetuples, PgStat_Counter deadtuples)
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages)
{
PgStat_MsgVacuum msg;
@@ -1343,6 +1344,7 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
msg.m_vacuumtime = GetCurrentTimestamp();
msg.m_live_tuples = livetuples;
msg.m_dead_tuples = deadtuples;
+ msg.m_frozen_pages = frozenpages;
pgstat_send(&msg, sizeof(msg));
}
@@ -1354,7 +1356,8 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
*/
void
pgstat_report_analyze(Relation rel,
- PgStat_Counter livetuples, PgStat_Counter deadtuples)
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages)
{
PgStat_MsgAnalyze msg;
@@ -1394,6 +1397,7 @@ pgstat_report_analyze(Relation rel,
msg.m_analyzetime = GetCurrentTimestamp();
msg.m_live_tuples = livetuples;
msg.m_dead_tuples = deadtuples;
+ msg.m_frozen_pages = frozenpages;
pgstat_send(&msg, sizeof(msg));
}
@@ -3702,6 +3706,7 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
result->n_live_tuples = 0;
result->n_dead_tuples = 0;
result->changes_since_analyze = 0;
+ result->n_frozen_pages = 0;
result->blocks_fetched = 0;
result->blocks_hit = 0;
result->vacuum_timestamp = 0;
@@ -5069,6 +5074,7 @@ pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
tabentry->n_live_tuples = msg->m_live_tuples;
tabentry->n_dead_tuples = msg->m_dead_tuples;
+ tabentry->n_frozen_pages = msg->m_frozen_pages;
if (msg->m_autovacuum)
{
@@ -5103,6 +5109,7 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
tabentry->n_live_tuples = msg->m_live_tuples;
tabentry->n_dead_tuples = msg->m_dead_tuples;
+ tabentry->n_frozen_pages = msg->m_frozen_pages;
/*
* We reset changes_since_analyze to zero, forgetting any changes that
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 244b4ea..9e5bd46 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -167,7 +167,7 @@ smgropen(RelFileNode rnode, BackendId backend)
reln->smgr_owner = NULL;
reln->smgr_targblock = InvalidBlockNumber;
reln->smgr_fsm_nblocks = InvalidBlockNumber;
- reln->smgr_vm_nblocks = InvalidBlockNumber;
+ reln->smgr_pim_nblocks = InvalidBlockNumber;
reln->smgr_which = 0; /* we only have md.c at present */
/* mark it not open */
diff --git a/src/backend/utils/adt/dbsize.c b/src/backend/utils/adt/dbsize.c
index 5ee59d0..c2ac902 100644
--- a/src/backend/utils/adt/dbsize.c
+++ b/src/backend/utils/adt/dbsize.c
@@ -348,12 +348,12 @@ calculate_toast_table_size(Oid toastrelid)
toastRel = relation_open(toastrelid, AccessShareLock);
- /* toast heap size, including FSM and VM size */
+ /* toast heap size, including FSM and PIM size */
for (forkNum = 0; forkNum <= MAX_FORKNUM; forkNum++)
size += calculate_relation_size(&(toastRel->rd_node),
toastRel->rd_backend, forkNum);
- /* toast index size, including FSM and VM size */
+ /* toast index size, including FSM and PIM size */
indexlist = RelationGetIndexList(toastRel);
/* Size is calculated using all the indexes available */
@@ -377,7 +377,7 @@ calculate_toast_table_size(Oid toastrelid)
/*
* Calculate total on-disk size of a given table,
- * including FSM and VM, plus TOAST table if any.
+ * including FSM and PIM, plus TOAST table if any.
* Indexes other than the TOAST table's index are not included.
*
* Note that this also behaves sanely if applied to an index or toast table;
@@ -390,7 +390,7 @@ calculate_table_size(Relation rel)
ForkNumber forkNum;
/*
- * heap size, including FSM and VM
+ * heap size, including FSM and PIM
*/
for (forkNum = 0; forkNum <= MAX_FORKNUM; forkNum++)
size += calculate_relation_size(&(rel->rd_node), rel->rd_backend,
@@ -485,7 +485,7 @@ pg_indexes_size(PG_FUNCTION_ARGS)
/*
* Compute the on-disk size of all files for the relation,
- * including heap data, index data, toast data, FSM, VM.
+ * including heap data, index data, toast data, FSM, PIM.
*/
static int64
calculate_total_relation_size(Relation rel)
@@ -494,7 +494,7 @@ calculate_total_relation_size(Relation rel)
/*
* Aggregate the table size, this includes size of the heap, toast and
- * toast index with free space and visibility map
+ * toast index with free space and page info map
*/
size = calculate_table_size(rel);
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index f7c9bf6..98c14f9 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -46,6 +46,7 @@ extern Datum pg_stat_get_vacuum_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_autovacuum_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_analyze_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_autoanalyze_count(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_frozen_pages(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_function_calls(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_function_total_time(PG_FUNCTION_ARGS);
@@ -450,6 +451,21 @@ pg_stat_get_autoanalyze_count(PG_FUNCTION_ARGS)
}
Datum
+pg_stat_get_frozen_pages(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int32 result;
+ PgStat_StatTabEntry *tabentry;
+
+ if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+ result = 0;
+ else
+ result = (int32) (tabentry->n_frozen_pages);
+
+ PG_RETURN_INT32(result);
+}
+
+Datum
pg_stat_get_function_calls(PG_FUNCTION_ARGS)
{
Oid funcid = PG_GETARG_OID(0);
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 41d4606..2b06013 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -231,6 +231,15 @@ check_cluster_versions(void)
if (old_cluster.major_version > new_cluster.major_version)
pg_fatal("This utility cannot be used to downgrade to older major PostgreSQL versions.\n");
+ /*
+ * We can't allow upgrading with link mode between 9.5 or before and 9.6 or later,
+ * because the visibility map has been changed to the page info map in version 9.6.
+ */
+ if (user_opts.transfer_mode == TRANSFER_MODE_LINK &&
+ GET_MAJOR_VERSION(old_cluster.major_version) < 906 &&
+ GET_MAJOR_VERSION(new_cluster.major_version) >= 906)
+ pg_fatal("This utility cannot upgrade from PostgreSQL version from 9.5 or before to 9.6 or later with link mode.\n");
+
/* get old and new binary versions */
get_bin_version(&old_cluster);
get_bin_version(&new_cluster);
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 79d9390..109b677 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -10,6 +10,7 @@
#include "postgres_fe.h"
#include "pg_upgrade.h"
+#include "storage/bufpage.h"
#include <fcntl.h>
@@ -21,6 +22,43 @@ static int copy_file(const char *fromfile, const char *tofile, bool force);
static int win32_pghardlink(const char *src, const char *dst);
#endif
+static const char *rewriteVisibilitymap(const char *fromfile, const char *tofile,
+ bool force);
+
+/* table for fast rewriting vm file in order to add all-frozen information */
+static const uint16 rewrite_vm_table[256] = {
+ 0, 1, 4, 5, 16, 17, 20, 21, 64, 65, 68, 69, 80, 81, 84, 85,
+ 256, 257, 260, 261, 272, 273, 276, 277, 320, 321, 324, 325, 336, 337, 340, 341,
+ 1024, 1025, 1028, 1029, 1040, 1041, 1044, 1045, 1088, 1089, 1092, 1093, 1104, 1105, 1108, 1109,
+ 1280, 1281, 1284, 1285, 1296, 1297, 1300, 1301, 1344, 1345, 1348, 1349, 1360, 1361, 1364, 1365,
+ 4096, 4097, 4100, 4101, 4112, 4113, 4116, 4117, 4160, 4161, 4164, 4165, 4176, 4177, 4180, 4181,
+ 4352, 4353, 4356, 4357, 4368, 4369, 4372, 4373, 4416, 4417, 4420, 4421, 4432, 4433, 4436, 4437,
+ 5120, 5121, 5124, 5125, 5136, 5137, 5140, 5141, 5184, 5185, 5188, 5189, 5200, 5201, 5204, 5205,
+ 5376, 5377, 5380, 5381, 5392, 5393, 5396, 5397, 5440, 5441, 5444, 5445, 5456, 5457, 5460, 5461,
+ 16384, 16385, 16388, 16389, 16400, 16401, 16404, 16405, 16448, 16449, 16452, 16453, 16464, 16465, 16468, 16469,
+ 16640, 16641, 16644, 16645, 16656, 16657, 16660, 16661, 16704, 16705, 16708, 16709, 16720, 16721, 16724, 16725,
+ 17408, 17409, 17412, 17413, 17424, 17425, 17428, 17429, 17472, 17473, 17476, 17477, 17488, 17489, 17492, 17493,
+ 17664, 17665, 17668, 17669, 17680, 17681, 17684, 17685, 17728, 17729, 17732, 17733, 17744, 17745, 17748, 17749,
+ 20480, 20481, 20484, 20485, 20496, 20497, 20500, 20501, 20544, 20545, 20548, 20549, 20560, 20561, 20564, 20565,
+ 20736, 20737, 20740, 20741, 20752, 20753, 20756, 20757, 20800, 20801, 20804, 20805, 20816, 20817, 20820, 20821,
+ 21504, 21505, 21508, 21509, 21520, 21521, 21524, 21525, 21568, 21569, 21572, 21573, 21584, 21585, 21588, 21589,
+ 21760, 21761, 21764, 21765, 21776, 21777, 21780, 21781, 21824, 21825, 21828, 21829, 21840, 21841, 21844, 21845
+};
+
+/*
+ * copyOrRewriteFile()
+ * This function copies file or rewrite visibility map file to page info map file.
+ * If rewrite_vm is true, we have to rewrite visibility map regardless value of pageConverter.
+ */
+const char *
+copyOrRewriteFile(pageCnvCtx *pageConverter,
+ const char *src, const char *dst, bool force, bool rewrite_vm)
+{
+ if (rewrite_vm)
+ return rewriteVisibilitymap(src, dst, force);
+ else
+ return copyAndUpdateFile(pageConverter, src, dst, force);
+}
/*
* copyAndUpdateFile()
@@ -201,6 +239,96 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
#endif
+/*
+ * rewriteVisibilitymap()
+ *
+ * Since a additional bit which indicates that all tuples on page is completely
+ * frozen is added into visibilitymap, the visibility map become the page info map.
+ * Rewrite a visibility map file while adding all-frozen bit(0) into each bit.
+ */
+static const char *
+rewriteVisibilitymap(const char *fromfile, const char *tofile, bool force)
+{
+#define REWRITE_BUF_SIZE (50 * BLCKSZ)
+#define BITS_PER_HEAPBLOCK 2
+
+ int src_fd = 0;
+ int dst_fd = 0;
+ uint16 vm_bits;
+ ssize_t nbytes;
+ char *buffer = NULL;
+ int ret = 0;
+
+ if ((fromfile == NULL) || (tofile == NULL))
+ return getErrorText(EINVAL);
+
+ if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
+ goto err;
+
+ if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+ goto err;
+
+ buffer = (char *) pg_malloc(REWRITE_BUF_SIZE);
+
+ /* Copy page header data in advance */
+ if ((nbytes = read(src_fd, buffer, MAXALIGN(SizeOfPageHeaderData))) <= 0)
+ goto err;
+
+ if (write(dst_fd, buffer, nbytes) != nbytes)
+ {
+ /* if write didn't set errno, assume problem is no disk space */
+ if (errno == 0)
+ errno = ENOSPC;
+ goto err;
+ }
+
+ /* perform data rewriting i.e read src srouce, write to destination */
+ while (true)
+ {
+ ssize_t nbytes = read(src_fd, buffer, REWRITE_BUF_SIZE);
+ char *cur, *end;
+
+ if (nbytes < 0)
+ {
+ ret = -1;
+ break;
+ }
+
+ if (nbytes == 0)
+ break;
+
+ cur = buffer;
+ end = buffer + nbytes;
+
+ /* Rewrite a byte and write dest_fd per BITS_PER_HEAPBLOCK bytes */
+ while (end > cur)
+ {
+ /* Get rewritten bit from table and its string representation */
+ vm_bits = rewrite_vm_table[(uint8) *cur];
+
+ if (write(dst_fd, &vm_bits, BITS_PER_HEAPBLOCK) != BITS_PER_HEAPBLOCK)
+ {
+ ret = -1;
+ break;
+ }
+ cur++;
+ }
+ }
+
+err:
+
+ if (!buffer)
+ pg_free(buffer);
+
+ if (src_fd != 0)
+ close(src_fd);
+
+ if (dst_fd != 0)
+ close(dst_fd);
+
+ return (errno == 0) ? NULL : getErrorText(errno);
+}
+
void
check_hard_link(void)
{
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 13aa891..95c6df1 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -112,6 +112,10 @@ extern char *output_files[];
#define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031
/*
+ * The visibility map changed to the page info map with this 9.6 commit,
+ */
+#define VISIBILITY_MAP_CHANGE_TO_PAGEINFOMAP_CAT_VER 201511131
+/*
* pg_multixact format changed in 9.3 commit 0ac5ad5134f2769ccbaefec73844f85,
* ("Improve concurrency of foreign key locking") which also updated catalog
* version to this value. pg_upgrade behavior depends on whether old and new
@@ -396,6 +400,8 @@ const pageCnvCtx *setupPageConverter(void);
typedef void *pageCnvCtx;
#endif
+const char *copyOrRewriteFile(pageCnvCtx *pageConverter, const char *src,
+ const char *dst, bool force, bool rewrite_vm);
const char *copyAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
const char *dst, bool force);
const char *linkAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index c22df42..41d80ef 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -18,7 +18,7 @@
static void transfer_single_new_db(pageCnvCtx *pageConverter,
FileNameMap *maps, int size, char *old_tablespace);
static void transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *suffix);
+ const char *old_type_suffix, const char *new_type_suffix);
/*
@@ -171,6 +171,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
{
int mapnum;
bool vm_crashsafe_match = true;
+ bool vm_need_rewrite = false;
/*
* Do the old and new cluster disagree on the crash-safetiness of the vm
@@ -180,13 +181,20 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
vm_crashsafe_match = false;
+ /*
+ * Do we need to rewrite visibilitymap to pageinfomap?
+ */
+ if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_CHANGE_TO_PAGEINFOMAP_CAT_VER &&
+ new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CHANGE_TO_PAGEINFOMAP_CAT_VER)
+ vm_need_rewrite = true;
+
for (mapnum = 0; mapnum < size; mapnum++)
{
if (old_tablespace == NULL ||
strcmp(maps[mapnum].old_tablespace, old_tablespace) == 0)
{
/* transfer primary file */
- transfer_relfile(pageConverter, &maps[mapnum], "");
+ transfer_relfile(pageConverter, &maps[mapnum], "", "");
/* fsm/vm files added in PG 8.4 */
if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)
@@ -194,9 +202,14 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
/*
* Copy/link any fsm and vm files, if they exist
*/
- transfer_relfile(pageConverter, &maps[mapnum], "_fsm");
+ transfer_relfile(pageConverter, &maps[mapnum], "_fsm", "_fsm");
if (vm_crashsafe_match)
- transfer_relfile(pageConverter, &maps[mapnum], "_vm");
+ {
+ if (vm_need_rewrite)
+ transfer_relfile(pageConverter, &maps[mapnum], "_vm", "_pim");
+ else
+ transfer_relfile(pageConverter, &maps[mapnum], "_vm", "_vm");
+ }
}
}
}
@@ -210,7 +223,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
*/
static void
transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *type_suffix)
+ const char *old_type_suffix, const char *new_type_suffix)
{
const char *msg;
char old_file[MAXPGPATH];
@@ -218,6 +231,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
int fd;
int segno;
char extent_suffix[65];
+ bool rewrite_vm = false;
/*
* Now copy/link any related segments as well. Remember, PG breaks large
@@ -236,18 +250,18 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
map->old_tablespace_suffix,
map->old_db_oid,
map->old_relfilenode,
- type_suffix,
+ old_type_suffix,
extent_suffix);
snprintf(new_file, sizeof(new_file), "%s%s/%u/%u%s%s",
map->new_tablespace,
map->new_tablespace_suffix,
map->new_db_oid,
map->new_relfilenode,
- type_suffix,
+ new_type_suffix,
extent_suffix);
/* Is it an extent, fsm, or vm file? */
- if (type_suffix[0] != '\0' || segno != 0)
+ if (old_type_suffix[0] != '\0' || segno != 0)
{
/* Did file open fail? */
if ((fd = open(old_file, O_RDONLY, 0)) == -1)
@@ -276,7 +290,11 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
+ /* Is it an vm file needs to be rewritten? */
+ if (strcmp(old_type_suffix, "_vm") == 0 && strcmp(old_type_suffix, new_type_suffix) != 0)
+ rewrite_vm = true;
+
+ if ((msg = copyOrRewriteFile(pageConverter, old_file, new_file, true, rewrite_vm)) != NULL)
pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
diff --git a/src/bin/pg_upgrade/test.sh b/src/bin/pg_upgrade/test.sh
index aa7f399..f5d80cb 100644
--- a/src/bin/pg_upgrade/test.sh
+++ b/src/bin/pg_upgrade/test.sh
@@ -174,6 +174,11 @@ if "$MAKE" -C "$oldsrc" installcheck; then
mv "$temp_root"/dump1.sql "$temp_root"/dump1.sql.orig
sed "s;$oldsrc;$newsrc;g" "$temp_root"/dump1.sql.orig >"$temp_root"/dump1.sql
fi
+
+ vm_sql="SELECT c.relname, c.relallvisible FROM pg_class as c, pg_namespace as n WHERE c.relnamespace = n.oid AND n.nspname NOT IN ('information_schema', 'pg_toast', 'pg_catalog') ORDER BY c.relname;"
+ # Test for rewriting visibility map
+ vacuumdb -d regression || visibilitymap_vacuum1_status=$?
+ psql -d regression -c "$vm_sql" > "$temp_root"/vm_test1.txt || visibilitymap_test1_status=$?
else
make_installcheck_status=$?
fi
@@ -188,6 +193,14 @@ if [ -n "$pg_dumpall1_status" ]; then
echo "pg_dumpall of pre-upgrade database cluster failed"
exit 1
fi
+if [ -n "$visibilitymap_vacuum1_status" ];then
+ echo "VACUUM of pre-upgrade database cluster for page info map test failed"
+ exit 1
+fi
+if [ -n "$visibilitymap_test1_status" ];then
+ echo "SELECT of pre-upgrade database cluster for page info map test failed"
+ exit 1
+fi
PGDATA=$BASE_PGDATA
@@ -203,6 +216,8 @@ case $testhost in
esac
pg_dumpall -f "$temp_root"/dump2.sql || pg_dumpall2_status=$?
+vacuumdb -d regression || visibilitymap_vacuum2_status=$?
+psql -d regression -c "$vm_sql" > "$temp_root"/vm_test2.txt || visibilitymap_test2_status=$?
pg_ctl -m fast stop
# no need to echo commands anymore
@@ -214,11 +229,26 @@ if [ -n "$pg_dumpall2_status" ]; then
exit 1
fi
+if [ -n "$visibilitymap_vacuum2_status" ];then
+ echo "VACUUM of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
+if [ -n "$visibilitymap_test2_status" ];then
+ echo "SELECT of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
case $testhost in
MINGW*) cmd /c delete_old_cluster.bat ;;
*) sh ./delete_old_cluster.sh ;;
esac
+if ! diff "$temp_root"/vm_test1.txt "$temp_root"/vm_test2.txt >/dev/null; then
+ echo "Visibility map rewriting test failed"
+ exit 1
+fi
+
if diff "$temp_root"/dump1.sql "$temp_root"/dump2.sql >/dev/null; then
echo PASSED
exit 0
diff --git a/src/common/relpath.c b/src/common/relpath.c
index 66dfef1..bac1157 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -34,7 +34,7 @@
const char *const forkNames[] = {
"main", /* MAIN_FORKNUM */
"fsm", /* FSM_FORKNUM */
- "vm", /* VISIBILITYMAP_FORKNUM */
+ "pim", /* PAGEINFOMAP_FORKNUM */
"init" /* INIT_FORKNUM */
};
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index caa0f14..dd8a4cc 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -312,17 +312,18 @@ typedef struct xl_heap_freeze_page
#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, ntuples) + sizeof(uint16))
/*
- * This is what we need to know about setting a visibility map bit
+ * This is what we need to know about setting a page info map bit
*
- * Backup blk 0: visibility map buffer
+ * Backup blk 0: page info map buffer
* Backup blk 1: heap buffer
*/
typedef struct xl_heap_visible
{
TransactionId cutoff_xid;
+ uint8 flags;
} xl_heap_visible;
-#define SizeOfHeapVisible (offsetof(xl_heap_visible, cutoff_xid) + sizeof(TransactionId))
+#define SizeOfHeapVisible (offsetof(xl_heap_visible, flags) + sizeof(uint8))
typedef struct xl_heap_new_cid
{
@@ -389,6 +390,6 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
xl_heap_freeze_tuple *xlrec_tp);
extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
- Buffer vm_buffer, TransactionId cutoff_xid);
+ Buffer pim_buffer, TransactionId cutoff_xid, uint8 flags);
#endif /* HEAPAM_XLOG_H */
diff --git a/src/include/access/hio.h b/src/include/access/hio.h
index b3b91e7..a200e5e 100644
--- a/src/include/access/hio.h
+++ b/src/include/access/hio.h
@@ -40,6 +40,6 @@ extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
Buffer otherBuffer, int options,
BulkInsertState bistate,
- Buffer *vmbuffer, Buffer *vmbuffer_other);
+ Buffer *pimbuffer, Buffer *pimbuffer_other);
#endif /* HIO_H */
diff --git a/src/include/access/pageinfomap.h b/src/include/access/pageinfomap.h
new file mode 100644
index 0000000..da217d2
--- /dev/null
+++ b/src/include/access/pageinfomap.h
@@ -0,0 +1,46 @@
+/*-------------------------------------------------------------------------
+ *
+ * pageinfomap.h
+ * page info map interface
+ *
+ *
+ * Portions Copyright (c) 2007-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/pageinfomap.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PAGEINFOMAP_H
+#define PAGEINFOMAP_H
+
+#include "access/xlogdefs.h"
+#include "storage/block.h"
+#include "storage/buf.h"
+#include "utils/relcache.h"
+
+/* Flags for bit map */
+#define PAGEINFOMAP_ALL_VISIBLE 0x01
+#define PAGEINFOMAP_ALL_FROZEN 0x02
+
+#define PAGEINFOMAP_ALL_FLAGS 0x03
+
+/* Macros for pageinfomap test */
+#define PIM_ALL_VISIBLE(r, b, v) \
+ ((pageinfomap_get_status((r), (b), (v)) & PAGEINFOMAP_ALL_VISIBLE) != 0)
+#define PIM_ALL_FROZEN(r, b, v) \
+ ((pageinfomap_get_status((r), (b), (v)) & PAGEINFOMAP_ALL_FROZEN) != 0)
+
+extern void pageinfomap_clear(Relation rel, BlockNumber heapBlk,
+ Buffer vmbuf);
+extern void pageinfomap_pin(Relation rel, BlockNumber heapBlk,
+ Buffer *vmbuf);
+extern bool pageinfomap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);
+extern void pageinfomap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags);
+extern uint8 pageinfomap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
+extern BlockNumber pageinfomap_count(Relation rel, BlockNumber *all_frozen);
+extern void pageinfomap_truncate(Relation rel, BlockNumber nheapblocks);
+
+#endif /* PAGEINFOMAP_H */
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
deleted file mode 100644
index 0c0e0ef..0000000
--- a/src/include/access/visibilitymap.h
+++ /dev/null
@@ -1,33 +0,0 @@
-/*-------------------------------------------------------------------------
- *
- * visibilitymap.h
- * visibility map interface
- *
- *
- * Portions Copyright (c) 2007-2015, PostgreSQL Global Development Group
- * Portions Copyright (c) 1994, Regents of the University of California
- *
- * src/include/access/visibilitymap.h
- *
- *-------------------------------------------------------------------------
- */
-#ifndef VISIBILITYMAP_H
-#define VISIBILITYMAP_H
-
-#include "access/xlogdefs.h"
-#include "storage/block.h"
-#include "storage/buf.h"
-#include "utils/relcache.h"
-
-extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk,
- Buffer vmbuf);
-extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
- Buffer *vmbuf);
-extern bool visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);
-extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid);
-extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
-extern BlockNumber visibilitymap_count(Relation rel);
-extern void visibilitymap_truncate(Relation rel, BlockNumber nheapblocks);
-
-#endif /* VISIBILITYMAP_H */
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index eba4150..3ff384b 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 201511071
+#define CATALOG_VERSION_NO 201511131
#endif
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index d8640db..e3d9530 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2779,6 +2779,8 @@ DATA(insert OID = 3056 ( pg_stat_get_analyze_count PGNSP PGUID 12 1 0 0 0 f f f
DESCR("statistics: number of manual analyzes for a table");
DATA(insert OID = 3057 ( pg_stat_get_autoanalyze_count PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_autoanalyze_count _null_ _null_ _null_ ));
DESCR("statistics: number of auto analyzes for a table");
+DATA(insert OID = 6015 ( pg_stat_get_frozen_pages PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 23 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_frozen_pages _null_ _null_ _null_ ));
+DESCR("statistics: number of frozen pages of table");
DATA(insert OID = 1936 ( pg_stat_get_backend_idset PGNSP PGUID 12 1 100 0 0 f f f f t t s r 0 0 23 "" _null_ _null_ _null_ _null_ _null_ pg_stat_get_backend_idset _null_ _null_ _null_ ));
DESCR("statistics: currently active backend IDs");
DATA(insert OID = 2022 ( pg_stat_get_activity PGNSP PGUID 12 1 100 0 0 f f f f f t s r 1 0 2249 "23" "{23,26,23,26,25,25,25,16,1184,1184,1184,1184,869,25,23,28,28,16,25,25,23,16,25}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,datid,pid,usesysid,application_name,state,query,waiting,xact_start,query_start,backend_start,state_change,client_addr,client_hostname,client_port,backend_xid,backend_xmin,ssl,sslversion,sslcipher,sslbits,sslcompression,sslclientdn}" _null_ _null_ pg_stat_get_activity _null_ _null_ _null_ ));
@@ -3665,7 +3667,7 @@ DESCR("convert a long int to a human readable text using size units");
DATA(insert OID = 3166 ( pg_size_pretty PGNSP PGUID 12 1 0 0 0 f f f f t f v s 1 0 25 "1700" _null_ _null_ _null_ _null_ _null_ pg_size_pretty_numeric _null_ _null_ _null_ ));
DESCR("convert a numeric to a human readable text using size units");
DATA(insert OID = 2997 ( pg_table_size PGNSP PGUID 12 1 0 0 0 f f f f t f v s 1 0 20 "2205" _null_ _null_ _null_ _null_ _null_ pg_table_size _null_ _null_ _null_ ));
-DESCR("disk space usage for the specified table, including TOAST, free space and visibility map");
+DESCR("disk space usage for the specified table, including TOAST, free space and page info map");
DATA(insert OID = 2998 ( pg_indexes_size PGNSP PGUID 12 1 0 0 0 f f f f t f v s 1 0 20 "2205" _null_ _null_ _null_ _null_ _null_ pg_indexes_size _null_ _null_ _null_ ));
DESCR("disk space usage for all indexes attached to the specified table");
DATA(insert OID = 2999 ( pg_relation_filenode PGNSP PGUID 12 1 0 0 0 f f f f t f s s 1 0 26 "2205" _null_ _null_ _null_ _null_ _null_ pg_relation_filenode _null_ _null_ _null_ ));
diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h
index a263779..90ee722 100644
--- a/src/include/common/relpath.h
+++ b/src/include/common/relpath.h
@@ -26,7 +26,7 @@ typedef enum ForkNumber
InvalidForkNumber = -1,
MAIN_FORKNUM = 0,
FSM_FORKNUM,
- VISIBILITYMAP_FORKNUM,
+ PAGEINFOMAP_FORKNUM,
INIT_FORKNUM
/*
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index eb3591a..af23b26 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1363,7 +1363,7 @@ typedef struct IndexScanState
* RuntimeContext expr context for evaling runtime Skeys
* RelationDesc index relation descriptor
* ScanDesc index scan descriptor
- * VMBuffer buffer in use for visibility map testing, if any
+ * PIMBuffer buffer in use for page info map testing, if any
* HeapFetches number of tuples we were forced to fetch from heap
* ----------------
*/
@@ -1381,7 +1381,7 @@ typedef struct IndexOnlyScanState
ExprContext *ioss_RuntimeContext;
Relation ioss_RelationDesc;
IndexScanDesc ioss_ScanDesc;
- Buffer ioss_VMBuffer;
+ Buffer ioss_PIMBuffer;
long ioss_HeapFetches;
} IndexOnlyScanState;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 9ecc163..614ca5a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -355,6 +355,7 @@ typedef struct PgStat_MsgVacuum
TimestampTz m_vacuumtime;
PgStat_Counter m_live_tuples;
PgStat_Counter m_dead_tuples;
+ int32 m_frozen_pages;
} PgStat_MsgVacuum;
@@ -372,6 +373,7 @@ typedef struct PgStat_MsgAnalyze
TimestampTz m_analyzetime;
PgStat_Counter m_live_tuples;
PgStat_Counter m_dead_tuples;
+ int32 m_frozen_pages;
} PgStat_MsgAnalyze;
@@ -551,7 +553,7 @@ typedef union PgStat_Msg
* ------------------------------------------------------------
*/
-#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9D
+#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9E
/* ----------
* PgStat_StatDBEntry The collector's data per database
@@ -615,6 +617,8 @@ typedef struct PgStat_StatTabEntry
PgStat_Counter n_dead_tuples;
PgStat_Counter changes_since_analyze;
+ int32 n_frozen_pages;
+
PgStat_Counter blocks_fetched;
PgStat_Counter blocks_hit;
@@ -917,9 +921,11 @@ extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type t
extern void pgstat_report_autovac(Oid dboid);
extern void pgstat_report_vacuum(Oid tableoid, bool shared,
- PgStat_Counter livetuples, PgStat_Counter deadtuples);
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages);
extern void pgstat_report_analyze(Relation rel,
- PgStat_Counter livetuples, PgStat_Counter deadtuples);
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages);
extern void pgstat_report_recovery_conflict(int reason);
extern void pgstat_report_deadlock(void);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index a2f78ee..102aa81 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -178,8 +178,10 @@ typedef PageHeaderData *PageHeader;
* tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
+#define PD_ALL_FROZEN 0x0008 /* all tuples on page are completely
+ frozen */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x000F /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -367,7 +369,12 @@ typedef PageHeaderData *PageHeader;
#define PageSetAllVisible(page) \
(((PageHeader) (page))->pd_flags |= PD_ALL_VISIBLE)
#define PageClearAllVisible(page) \
- (((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
+ (((PageHeader) (page))->pd_flags &= ~(PD_ALL_VISIBLE | PD_ALL_FROZEN))
+
+#define PageIsAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags & PD_ALL_FROZEN)
+#define PageSetAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags |= PD_ALL_FROZEN)
#define PageIsPrunable(page, oldestxmin) \
( \
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 69a624f..c676694 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -54,7 +54,7 @@ typedef struct SMgrRelationData
*/
BlockNumber smgr_targblock; /* current insertion target block */
BlockNumber smgr_fsm_nblocks; /* last known size of fsm fork */
- BlockNumber smgr_vm_nblocks; /* last known size of vm fork */
+ BlockNumber smgr_pim_nblocks; /* last known size of pim fork */
/* additional public fields may someday exist here */
diff --git a/src/test/regress/expected/pageinfomap.out b/src/test/regress/expected/pageinfomap.out
new file mode 100644
index 0000000..31543ba
--- /dev/null
+++ b/src/test/regress/expected/pageinfomap.out
@@ -0,0 +1,22 @@
+--
+-- Page Info Map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+-- All pages are become all-visible
+VACUUM FREEZE vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+ ?column?
+----------
+ t
+(1 row)
+
+-- Check whether vacuum skips all-frozen pages
+\set VERBOSITY terse
+VACUUM FREEZE VERBOSE vmtest;
+INFO: vacuuming "public.vmtest"
+INFO: index "vmtest_pkey" now contains 10000 row versions in 30 pages
+INFO: skipped 45 frozen pages according to page info map
+INFO: "vmtest": found 0 removable, 0 nonremovable row versions in 0 out of 45 pages
+\set VERBOSITY default
+DROP TABLE vmtest;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 80374e4..b259e65 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1683,6 +1683,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,
pg_stat_get_live_tuples(c.oid) AS n_live_tup,
pg_stat_get_dead_tuples(c.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(c.oid) AS n_mod_since_analyze,
+ pg_stat_get_frozen_pages(c.oid) AS n_frozen_page,
pg_stat_get_last_vacuum_time(c.oid) AS last_vacuum,
pg_stat_get_last_autovacuum_time(c.oid) AS last_autovacuum,
pg_stat_get_last_analyze_time(c.oid) AS last_analyze,
@@ -1795,6 +1796,7 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
+ pg_stat_all_tables.n_frozen_page,
pg_stat_all_tables.last_vacuum,
pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,
@@ -1838,6 +1840,7 @@ pg_stat_user_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
+ pg_stat_all_tables.n_frozen_page,
pg_stat_all_tables.last_vacuum,
pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index f5be70f..dd49786 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -127,6 +127,8 @@ SELECT count(*) FROM tenk2 WHERE unique1 = 1;
1
(1 row)
+-- do VACUUM FREEZE
+VACUUM FREEZE tenk2;
-- force the rate-limiting logic in pgstat_report_tabstat() to time out
-- and send a message
SELECT pg_sleep(1.0);
@@ -175,6 +177,14 @@ SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages,
t | t
(1 row)
+SELECT n_frozen_page = (pg_relation_size('tenk2') / current_setting('block_size')::int)
+ FROM pg_stat_user_tables
+ WHERE relname ='tenk2';
+ ?column?
+----------
+ t
+(1 row)
+
SELECT pr.snap_ts < pg_stat_get_snapshot_timestamp() as snapshot_newer
FROM prevstats AS pr;
snapshot_newer
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 3987b4c..c4d0281 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -110,3 +110,6 @@ test: event_trigger
# run stats by itself because its delay may be insufficient under heavy load
test: stats
+
+# page info map and vacuum test cannot run concurrently with any test that runs SQL
+test: pageinfomap
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 379f272..69fbab1 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -160,3 +160,4 @@ test: with
test: xml
test: event_trigger
test: stats
+test: pageinfomap
diff --git a/src/test/regress/sql/pageinfomap.sql b/src/test/regress/sql/pageinfomap.sql
new file mode 100644
index 0000000..739c715
--- /dev/null
+++ b/src/test/regress/sql/pageinfomap.sql
@@ -0,0 +1,16 @@
+--
+-- Page Info Map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+
+-- All pages are become all-visible
+VACUUM FREEZE vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+
+-- Check whether vacuum skips all-frozen pages
+\set VERBOSITY terse
+VACUUM FREEZE VERBOSE vmtest;
+\set VERBOSITY default
+
+DROP TABLE vmtest;
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index cd2d592..b3c640f 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -120,6 +120,8 @@ ROLLBACK;
SELECT count(*) FROM tenk2;
-- do an indexscan
SELECT count(*) FROM tenk2 WHERE unique1 = 1;
+-- do VACUUM FREEZE
+VACUUM FREEZE tenk2;
-- force the rate-limiting logic in pgstat_report_tabstat() to time out
-- and send a message
@@ -145,6 +147,10 @@ SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages,
FROM pg_statio_user_tables AS st, pg_class AS cl, prevstats AS pr
WHERE st.relname='tenk2' AND cl.relname='tenk2';
+SELECT n_frozen_page = (pg_relation_size('tenk2') / current_setting('block_size')::int)
+ FROM pg_stat_user_tables
+ WHERE relname ='tenk2';
+
SELECT pr.snap_ts < pg_stat_get_snapshot_timestamp() as snapshot_newer
FROM prevstats AS pr;
On Fri, Nov 13, 2015 at 4:48 AM, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:
Thank you for reviewing the patch.
I changed the patch so that the visibility map become the page info
map, in source code and documentation.
One thing to notice is that this almost doubles the patch size which
might makes it slightly difficult to review, but on the other hand if
no-body opposes for such a change, this seems to be the right direction.
And fixed review comments I received.
Attached v22 patch.I think both the above cases could happen for frozen state
as well, unless you think otherwise, we need similar handling
for frozen bit.It's not happen the situation where is all-frozen and not all-visible,
and the bits of visibility map are cleared at the same time, page
flags are as well.
So I think it's enough to handle only all-visible situation. Am Imissing something?
No, I think you are right as information for both is cleared together
and all-visible is superset of all-frozen (means if all-frozen is set,
then all-visible must be set), so it is sufficient to check visibility
info in above situation, but I feel we can update the comment to
indicate the same and add an Assert to ensure if all-frozen is set
all-visibile must be set.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Fri, Nov 13, 2015 at 1:44 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Nov 13, 2015 at 4:48 AM, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:Thank you for reviewing the patch.
I changed the patch so that the visibility map become the page info
map, in source code and documentation.One thing to notice is that this almost doubles the patch size which
might makes it slightly difficult to review, but on the other hand if
no-body opposes for such a change, this seems to be the right direction.
I believe that it's going to right direction.
But I think we didn't get consensus about this changes yet, so it might go back.
And fixed review comments I received.
Attached v22 patch.I think both the above cases could happen for frozen state
as well, unless you think otherwise, we need similar handling
for frozen bit.It's not happen the situation where is all-frozen and not all-visible,
and the bits of visibility map are cleared at the same time, page
flags are as well.
So I think it's enough to handle only all-visible situation. Am Imissing something?
No, I think you are right as information for both is cleared together
and all-visible is superset of all-frozen (means if all-frozen is set,
then all-visible must be set), so it is sufficient to check visibility
info in above situation, but I feel we can update the comment to
indicate the same and add an Assert to ensure if all-frozen is set
all-visibile must be set.
I agree.
I added Assert() macro into lazy_scan_heap() and some comments.
Attached v23 patch.
Regards,
--
Masahiko Sawada
Attachments:
000_page_info_map_v23.patchtext/x-patch; charset=US-ASCII; name=000_page_info_map_v23.patchDownload
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 22c5f7a..6c4b0a7 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -12,7 +12,7 @@
*/
#include "postgres.h"
-#include "access/visibilitymap.h"
+#include "access/pageinfomap.h"
#include "access/transam.h"
#include "access/xact.h"
#include "access/multixact.h"
@@ -48,7 +48,7 @@ typedef struct output_type
/*
* This function takes an already open relation and scans its pages,
- * skipping those that have the corresponding visibility map bit set.
+ * skipping those that have the corresponding page info map bit set.
* For pages we skip, we find the free space from the free space map
* and approximate tuple_len on that basis. For the others, we count
* the exact number of dead tuples etc.
@@ -87,7 +87,7 @@ statapprox_heap(Relation rel, output_type *stat)
* If the page has only visible tuples, then we can find out the free
* space from the FSM and move on.
*/
- if (visibilitymap_test(rel, blkno, &vmbuffer))
+ if (PIM_ALL_VISIBLE(rel, blkno, &vmbuffer))
{
freespace = GetRecordedFreeSpace(rel, blkno);
stat->tuple_len += BLCKSZ - freespace;
@@ -242,7 +242,7 @@ pgstattuple_approx(PG_FUNCTION_ARGS)
/*
* We support only ordinary relations and materialised views, because we
- * depend on the visibility map and free space map for our estimates about
+ * depend on the page info map and free space map for our estimates about
* unscanned pages.
*/
if (!(rel->rd_rel->relkind == RELKIND_RELATION ||
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 97ef618..4a593ae 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -1833,7 +1833,7 @@
<entry></entry>
<entry>
Number of pages that are marked all-visible in the table's
- visibility map. This is only an estimate used by the
+ page info map. This is only an estimate used by the
planner. It is updated by <command>VACUUM</command>,
<command>ANALYZE</command>, and a few DDL commands such as
<command>CREATE INDEX</command>.
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 6e14851..c75a166 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5905,7 +5905,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs an eager freezing if the table's
<structname>pg_class</>.<structfield>relfrozenxid</> field has reached
the age specified by this setting. The default is 150 million
transactions. Although users can set this value anywhere from zero to
@@ -5949,7 +5949,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs an eager freezing if the table's
<structname>pg_class</>.<structfield>relminmxid</> field has reached
the age specified by this setting. The default is 150 million multixacts.
Although users can set this value anywhere from zero to two billions,
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 60b9a09..0ccbbd5 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -17663,7 +17663,7 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
<entry><type>bigint</type></entry>
<entry>
Disk space used by the specified fork (<literal>'main'</literal>,
- <literal>'fsm'</literal>, <literal>'vm'</>, or <literal>'init'</>)
+ <literal>'fsm'</literal>, <literal>'pim'</>, or <literal>'init'</>)
of the specified table or index
</entry>
</row>
@@ -17703,7 +17703,7 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
<entry><type>bigint</type></entry>
<entry>
Disk space used by the specified table, excluding indexes
- (but including TOAST, free space map, and visibility map)
+ (but including TOAST, free space map, and page info map)
</entry>
</row>
<row>
@@ -17750,7 +17750,7 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
<para>
<function>pg_table_size</> accepts the OID or name of a table and
returns the disk space needed for that table, exclusive of indexes.
- (TOAST space, free space map, and visibility map are included.)
+ (TOAST space, free space map, and page info map are included.)
</para>
<para>
@@ -17793,8 +17793,8 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
</listitem>
<listitem>
<para>
- <literal>'vm'</literal> returns the size of the Visibility Map
- (see <xref linkend="storage-vm">) associated with the relation.
+ <literal>'pim'</literal> returns the size of the Page Info Map
+ (see <xref linkend="storage-pim">) associated with the relation.
</para>
</listitem>
<listitem>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 1c09bae..5da49d5 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -611,7 +611,7 @@ amrestrpos (IndexScanDesc scan);
If the index stores the original indexed data values (and not some lossy
representation of them), it is useful to support index-only scans, in
which the index returns the actual data not just the TID of the heap tuple.
- This will only work if the visibility map shows that the TID is on an
+ This will only work if the page info map shows that the TID is on an
all-visible page; else the heap tuple must be visited anyway to check
MVCC visibility. But that is no concern of the access method's.
</para>
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 5204b34..3060a52 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -102,7 +102,7 @@
</listitem>
<listitem>
- <simpara>To update the visibility map, which speeds up index-only
+ <simpara>To update the page info map, which speeds up index-only
scans.</simpara>
</listitem>
@@ -345,16 +345,16 @@
</tip>
</sect2>
- <sect2 id="vacuum-for-visibility-map">
- <title>Updating The Visibility Map</title>
+ <sect2 id="vacuum-for-page-info-map">
+ <title>Updating The Page Info Map</title>
<para>
- Vacuum maintains a <link linkend="storage-vm">visibility map</> for each
+ Vacuum maintains a <link linkend="storage-pim">page info map</> for each
table to keep track of which pages contain only tuples that are known to be
visible to all active transactions (and all future transactions, until the
- page is again modified). This has two purposes. First, vacuum
- itself can skip such pages on the next run, since there is nothing to
- clean up.
+ page is again modified), and pages contain only frozen tuples.
+ This has two purposes. First, vacuum itself can skip such pages
+ on the next run, since there is nothing to clean up.
</para>
<para>
@@ -364,10 +364,10 @@
visibility information, a normal index scan fetches the heap tuple for each
matching index entry, to check whether it should be seen by the current
transaction. An <firstterm>index-only scan</>, on the other hand, checks
- the visibility map first. If it's known that all tuples on the page are
+ the page info map first. If it's known that all tuples on the page are
visible, the heap fetch can be skipped. This is most noticeable on
- large data sets where the visibility map can prevent disk accesses.
- The visibility map is vastly smaller than the heap, so it can easily be
+ large data sets where the page info map can prevent disk accesses.
+ The page info map is vastly smaller than the heap, so it can easily be
cached even when the heap is very large.
</para>
</sect2>
@@ -438,23 +438,22 @@
</para>
<para>
- <command>VACUUM</> normally skips pages that don't have any dead row
- versions, but those pages might still have row versions with old XID
- values. To ensure all old row versions have been frozen, a
- scan of the whole table is needed.
+ <command>VACUUM</> skips pages that don't have any dead row
+ versions, and pages that have only frozen rows.
+ To ensure all old row versions have been frozen, a scan of all unfrozen pages
+ is needed.
<xref linkend="guc-vacuum-freeze-table-age"> controls when
- <command>VACUUM</> does that: a whole table sweep is forced if
- the table hasn't been fully scanned for <varname>vacuum_freeze_table_age</>
- minus <varname>vacuum_freeze_min_age</> transactions. Setting it to 0
- forces <command>VACUUM</> to always scan all pages, effectively ignoring
- the visibility map.
+ <command>VACUUM</> does that: a whole-table freezing is forced if
+ the table hasn't been ensured all row versions are frozen for
+ <varname>vacuum_freeze_table_age</> minus <varname>vacuum_freeze_min_age</>
+ transcations.
</para>
<para>
The maximum time that a table can go unvacuumed is two billion
transactions minus the <varname>vacuum_freeze_min_age</> value at
- the time <command>VACUUM</> last scanned the whole table. If it were to go
- unvacuumed for longer than
+ the time <command>VACUUM</> last scanned unfrozen pages.
+ If it were to go unvacuumed for longer than
that, data loss could result. To ensure that this does not happen,
autovacuum is invoked on any table that might contain unfrozen rows with
XIDs older than the age specified by the configuration parameter <xref
@@ -490,8 +489,7 @@
a regularly scheduled <command>VACUUM</> or an autovacuum triggered by
normal delete and update activity is run in that window. Setting it too
close could lead to anti-wraparound autovacuums, even though the table
- was recently vacuumed to reclaim space, whereas lower values lead to more
- frequent whole-table scans.
+ was recently vacuumed to reclaim space.
</para>
<para>
@@ -527,7 +525,7 @@
<structname>pg_database</>. In particular,
the <structfield>relfrozenxid</> column of a table's
<structname>pg_class</> row contains the freeze cutoff XID that was used
- by the last whole-table <command>VACUUM</> for that table. All rows
+ by the last whole-table freezing for that table. All rows
inserted by transactions with XIDs older than this cutoff XID are
guaranteed to have been frozen. Similarly,
the <structfield>datfrozenxid</> column of a database's
@@ -555,17 +553,19 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
<command>VACUUM</> normally
only scans pages that have been modified since the last vacuum, but
<structfield>relfrozenxid</> can only be advanced when the whole table is
- scanned. The whole table is scanned when <structfield>relfrozenxid</> is
- more than <varname>vacuum_freeze_table_age</> transactions old, when
- <command>VACUUM</>'s <literal>FREEZE</> option is used, or when all pages
- happen to
- require vacuuming to remove dead row versions. When <command>VACUUM</>
- scans the whole table, after it's finished <literal>age(relfrozenxid)</>
- should be a little more than the <varname>vacuum_freeze_min_age</> setting
- that was used (more by the number of transactions started since the
- <command>VACUUM</> started). If no whole-table-scanning <command>VACUUM</>
- is issued on the table until <varname>autovacuum_freeze_max_age</> is
- reached, an autovacuum will soon be forced for the table.
+ frozen. The whole-table freezing is occuerred only when all pages happen to
+ require freezing to freeze rows. In other cases such as where
+ <structfield>relfrozenxid</> is more than <varname>vacuum_freeze_table_age</>
+ transcations old, where <command>VACUUM</>'s <literal>FREEZE</> option is used,
+ <command>VACUUM</> can skip the pages that all tuples on the page itself are
+ marked as frozen.
+ When all pages of table are eventually marked as frozen by <command>VACUUM</>,
+ after it's finished <literal>age(relfrozenxid)</> should be a little more
+ than the <varname>vacuum_freeze_min_age</> setting that was used (more by
+ the number of transcations started since the <command>VACUUM</> started).
+ If the advancing of <structfield>relfrozenxid</> is not happend until
+ <varname>autovacuum_freeze_max_age</> is reached, an autovacuum will soon
+ be forced for the table.
</para>
<para>
@@ -642,28 +642,28 @@ HINT: Stop the postmaster and vacuum that database in single-user mode.
<structname>pg_class</>.<structfield>relminmxid</> stores the oldest
possible multixact ID still appearing in any tuple of that table.
If this value is older than
- <xref linkend="guc-vacuum-multixact-freeze-table-age">, a whole-table
+ <xref linkend="guc-vacuum-multixact-freeze-table-age">, a table
scan is forced. <function>mxid_age()</> can be used on
<structname>pg_class</>.<structfield>relminmxid</> to find its age.
</para>
<para>
- Whole-table <command>VACUUM</> scans, regardless of
- what causes them, enable advancing the value for that table.
+ When <command>VACUUM</> scans all unfrozen pages, regardless of what causes
+ them, enable advancing the value for that table.
Eventually, as all tables in all databases are scanned and their
oldest multixact values are advanced, on-disk storage for older
multixacts can be removed.
</para>
<para>
- As a safety device, a whole-table vacuum scan will occur for any table
- whose multixact-age is greater than
- <xref linkend="guc-autovacuum-multixact-freeze-max-age">. Whole-table
+ As a safety device, a vacuum scan will occur for any table whose
+ multixact-age is greater than
+ <xref linkend="guc-autovacuum-multixact-freeze-max-age">. A
vacuum scans will also occur progressively for all tables, starting with
those that have the oldest multixact-age, if the amount of used member
storage space exceeds the amount 50% of the addressable storage space.
- Both of these kinds of whole-table scans will occur even if autovacuum is
- nominally disabled.
+ Both of these kinds of scans will occur even if autovacuum is nominally
+ disabled.
</para>
</sect3>
</sect2>
@@ -743,8 +743,8 @@ vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuple
<command>UPDATE</command> and <command>DELETE</command> operation. (It
is only semi-accurate because some information might be lost under heavy
load.) If the <structfield>relfrozenxid</> value of the table is more
- than <varname>vacuum_freeze_table_age</> transactions old, the whole
- table is scanned to freeze old tuples and advance
+ than <varname>vacuum_freeze_table_age</> transactions old, the table is
+ scanned to freeze old tuples and advance
<structfield>relfrozenxid</>, otherwise only pages that have been modified
since the last vacuum are scanned.
</para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e64b7ef..1908a4d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1332,6 +1332,11 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
<entry>Estimated number of rows modified since this table was last analyzed</entry>
</row>
<row>
+ <entry><structfield>n_frozen_page</></entry>
+ <entry><type>integer</></entry>
+ <entry>Number of frozen pages</entry>
+ </row>
+ <row>
<entry><structfield>last_vacuum</></entry>
<entry><type>timestamp with time zone</></entry>
<entry>Last time at which this table was manually vacuumed
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
index b95cc81..70e28a7 100644
--- a/doc/src/sgml/pageinspect.sgml
+++ b/doc/src/sgml/pageinspect.sgml
@@ -32,7 +32,7 @@
single time-consistent copy of the block to be obtained.
<replaceable>fork</replaceable> should be <literal>'main'</literal> for
the main data fork, <literal>'fsm'</literal> for the free space map,
- <literal>'vm'</literal> for the visibility map, or <literal>'init'</literal>
+ <literal>'pim'</literal> for the page info map, or <literal>'init'</literal>
for the initialization fork.
</para>
</listitem>
diff --git a/doc/src/sgml/pgstattuple.sgml b/doc/src/sgml/pgstattuple.sgml
index 18d244b..b950a9c 100644
--- a/doc/src/sgml/pgstattuple.sgml
+++ b/doc/src/sgml/pgstattuple.sgml
@@ -400,7 +400,7 @@ approx_free_percent | 2.09
<para>
It does this by skipping pages that have only visible tuples
- according to the visibility map (if a page has the corresponding VM
+ according to the page info map (if a page has the corresponding PIM
bit set, then it is assumed to contain no dead tuples). For such
pages, it derives the free space value from the free space map, and
assumes that the rest of the space on the page is taken up by live
diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index eb113c2..5ee8527 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -657,6 +657,12 @@ psql --username postgres --file script.sql postgres
</para>
<para>
+ Since the visibility map has been changed to the page info map in
+ version 9.6, <application>pg_upgrade</> does not support upgrading of
+ databases from 9.5 or before to 9.6 or later with link mode (-k).
+ </para>
+
+ <para>
All failure, rebuild, and reindex cases will be reported by
<application>pg_upgrade</> if they affect your installation;
post-upgrade scripts to rebuild tables and indexes will be
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 164d08c..024951f 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -194,9 +194,9 @@ main fork), each table and index has a <firstterm>free space map</> (see <xref
linkend="storage-fsm">), which stores information about free space available in
the relation. The free space map is stored in a file named with the filenode
number plus the suffix <literal>_fsm</>. Tables also have a
-<firstterm>visibility map</>, stored in a fork with the suffix <literal>_vm</>,
-to track which pages are known to have no dead tuples. The visibility map is
-described further in <xref linkend="storage-vm">. Unlogged tables and indexes
+<firstterm>page info map</>, stored in a fork with the suffix <literal>_pim</>,
+to track which pages are known to have no dead tuples. The page info map is
+described further in <xref linkend="storage-pim">. Unlogged tables and indexes
have a third fork, known as the initialization fork, which is stored in a fork
with the suffix <literal>_init</literal> (see <xref linkend="storage-init">).
</para>
@@ -224,7 +224,7 @@ This arrangement avoids problems on platforms that have file size limitations.
(Actually, 1 GB is just the default segment size. The segment size can be
adjusted using the configuration option <option>--with-segsize</option>
when building <productname>PostgreSQL</>.)
-In principle, free space map and visibility map forks could require multiple
+In principle, free space map and page info map forks could require multiple
segments as well, though this is unlikely to happen in practice.
</para>
@@ -270,7 +270,7 @@ The <function>pg_relation_filepath()</> function shows the entire path
as a substitute for remembering many of the above rules. But keep in
mind that this function just gives the name of the first segment of the
main fork of the relation — you may need to append a segment number
-and/or <literal>_fsm</>, <literal>_vm</>, or <literal>_init</> to find all
+and/or <literal>_fsm</>, <literal>_pim</>, or <literal>_init</> to find all
the files associated with the relation.
</para>
@@ -611,30 +611,32 @@ can be used to examine the information stored in free space maps.
</sect1>
-<sect1 id="storage-vm">
+<sect1 id="storage-pim">
-<title>Visibility Map</title>
+<title>Page Info Map</title>
<indexterm>
- <primary>Visibility Map</primary>
+ <primary>Page Info Map</primary>
</indexterm>
-<indexterm><primary>VM</><see>Visibility Map</></indexterm>
+<indexterm><primary>PIM</><see>Page Info Map</></indexterm>
<para>
-Each heap relation has a Visibility Map
-(VM) to keep track of which pages contain only tuples that are known to be
-visible to all active transactions. It's stored
-alongside the main relation data in a separate relation fork, named after the
-filenode number of the relation, plus a <literal>_vm</> suffix. For example,
-if the filenode of a relation is 12345, the VM is stored in a file called
-<filename>12345_vm</>, in the same directory as the main relation file.
-Note that indexes do not have VMs.
+Each heap relation has a Page Info Map
+(PIM) to keep track of which pages contain only tuples that are known to be
+visible to all active transactions, and pages contain only unfrozen tuples.
+It's stored alongside the main relation data in a separate relation fork,
+named after the filenode number of the relation, plus a <literal>_pim</> suffix.
+For example, if the filenode of a relation is 12345, the PIM is stored in a file
+called <filename>12345_pim</>, in the same directory as the main relation file.
+Note that indexes do not have PIMs.
</para>
<para>
-The visibility map simply stores one bit per heap page. A set bit means
-that all tuples on the page are known to be visible to all transactions.
-This means that the page does not contain any tuples that need to be vacuumed.
+The page info map stores two bits per heap page: all-visible, all-frozen.
+A set all-visible bit means that all tuples on the page are known to be visible
+to all transactions. A set all-frozen bit means that all tuples on the page are
+completely marked as frozen. This means that the page does not contain any tuples
+that need to be vacuumed, even if a scan of whole table is required.
This information can also be used by <firstterm>index-only scans</> to answer
queries using only the index tuple.
</para>
@@ -642,7 +644,7 @@ queries using only the index tuple.
<para>
The map is conservative in the sense that we make sure that whenever a bit is
set, we know the condition is true, but if a bit is not set, it might or
-might not be true. Visibility map bits are only set by vacuum, but are
+might not be true. page info map bits are only set by vacuum, but are
cleared by any data-modifying operations on a page.
</para>
diff --git a/src/backend/access/heap/Makefile b/src/backend/access/heap/Makefile
index b83d496..aeec6d1 100644
--- a/src/backend/access/heap/Makefile
+++ b/src/backend/access/heap/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/access/heap
top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
-OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o visibilitymap.o
+OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o pageinfomap.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 35a2b05..f3142f7 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -48,7 +48,7 @@
#include "access/transam.h"
#include "access/tuptoaster.h"
#include "access/valid.h"
-#include "access/visibilitymap.h"
+#include "access/pageinfomap.h"
#include "access/xact.h"
#include "access/xlog.h"
#include "access/xloginsert.h"
@@ -406,12 +406,12 @@ heapgetpage(HeapScanDesc scan, BlockNumber page)
* transactions in the master might still be invisible to a read-only
* transaction in the standby. We partly handle this problem by tracking
* the minimum xmin of visible tuples as the cut-off XID while marking a
- * page all-visible on master and WAL log that along with the visibility
+ * page all-visible on master and WAL log that along with the page information
* map SET operation. In hot standby, we wait for (or abort) all
* transactions that can potentially may not see one or more tuples on the
* page. That's how index-only scans work fine in hot standby. A crucial
* difference between index-only scans and heap scans is that the
- * index-only scan completely relies on the visibility map where as heap
+ * index-only scan completely relies on the page info map where as heap
* scan looks at the page-level PD_ALL_VISIBLE flag. We are not sure if
* the page-level flag can be trusted in the same way, because it might
* get propagated somehow without being explicitly WAL-logged, e.g. via a
@@ -2375,7 +2375,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
TransactionId xid = GetCurrentTransactionId();
HeapTuple heaptup;
Buffer buffer;
- Buffer vmbuffer = InvalidBuffer;
+ Buffer pimbuffer = InvalidBuffer;
bool all_visible_cleared = false;
/*
@@ -2393,7 +2393,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
*/
buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
InvalidBuffer, options, bistate,
- &vmbuffer, NULL);
+ &pimbuffer, NULL);
/*
* We're about to do the actual insert -- but check for conflict first, to
@@ -2422,9 +2422,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
{
all_visible_cleared = true;
PageClearAllVisible(BufferGetPage(buffer));
- visibilitymap_clear(relation,
+ pageinfomap_clear(relation,
ItemPointerGetBlockNumber(&(heaptup->t_self)),
- vmbuffer);
+ pimbuffer);
}
/*
@@ -2518,8 +2518,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
END_CRIT_SECTION();
UnlockReleaseBuffer(buffer);
- if (vmbuffer != InvalidBuffer)
- ReleaseBuffer(vmbuffer);
+ if (pimbuffer != InvalidBuffer)
+ ReleaseBuffer(pimbuffer);
/*
* If tuple is cachable, mark it for invalidation from the caches in case
@@ -2692,7 +2692,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
while (ndone < ntuples)
{
Buffer buffer;
- Buffer vmbuffer = InvalidBuffer;
+ Buffer pimbuffer = InvalidBuffer;
bool all_visible_cleared = false;
int nthispage;
@@ -2700,11 +2700,11 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
/*
* Find buffer where at least the next tuple will fit. If the page is
- * all-visible, this will also pin the requisite visibility map page.
+ * all-visible, this will also pin the requisite page info map page.
*/
buffer = RelationGetBufferForTuple(relation, heaptuples[ndone]->t_len,
InvalidBuffer, options, bistate,
- &vmbuffer, NULL);
+ &pimbuffer, NULL);
page = BufferGetPage(buffer);
/* NO EREPORT(ERROR) from here till changes are logged */
@@ -2736,9 +2736,9 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
{
all_visible_cleared = true;
PageClearAllVisible(page);
- visibilitymap_clear(relation,
+ pageinfomap_clear(relation,
BufferGetBlockNumber(buffer),
- vmbuffer);
+ pimbuffer);
}
/*
@@ -2857,8 +2857,8 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
END_CRIT_SECTION();
UnlockReleaseBuffer(buffer);
- if (vmbuffer != InvalidBuffer)
- ReleaseBuffer(vmbuffer);
+ if (pimbuffer != InvalidBuffer)
+ ReleaseBuffer(pimbuffer);
ndone += nthispage;
}
@@ -2995,7 +2995,7 @@ heap_delete(Relation relation, ItemPointer tid,
Page page;
BlockNumber block;
Buffer buffer;
- Buffer vmbuffer = InvalidBuffer;
+ Buffer pimbuffer = InvalidBuffer;
TransactionId new_xmax;
uint16 new_infomask,
new_infomask2;
@@ -3022,26 +3022,26 @@ heap_delete(Relation relation, ItemPointer tid,
page = BufferGetPage(buffer);
/*
- * Before locking the buffer, pin the visibility map page if it appears to
+ * Before locking the buffer, pin the page info map page if it appears to
* be necessary. Since we haven't got the lock yet, someone else might be
* in the middle of changing this, so we'll need to recheck after we have
* the lock.
*/
if (PageIsAllVisible(page))
- visibilitymap_pin(relation, block, &vmbuffer);
+ pageinfomap_pin(relation, block, &pimbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
/*
- * If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, we'll have to unlock and
- * re-lock, to avoid holding the buffer lock across an I/O. That's a bit
- * unfortunate, but hopefully shouldn't happen often.
+ * If we didn't pin the page info map page and the page has become all
+ * visible or all frozen while we were busy locking the buffer, we'll
+ * have to unlock and re-lock, to avoid holding the buffer lock across an
+ * I/O. That's a bit unfortunate, but hopefully shouldn't happen often.
*/
- if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
+ if (pimbuffer == InvalidBuffer && PageIsAllVisible(page))
{
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
- visibilitymap_pin(relation, block, &vmbuffer);
+ pageinfomap_pin(relation, block, &pimbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
}
@@ -3184,8 +3184,8 @@ l1:
UnlockReleaseBuffer(buffer);
if (have_tuple_lock)
UnlockTupleTuplock(relation, &(tp.t_self), LockTupleExclusive);
- if (vmbuffer != InvalidBuffer)
- ReleaseBuffer(vmbuffer);
+ if (pimbuffer != InvalidBuffer)
+ ReleaseBuffer(pimbuffer);
return result;
}
@@ -3239,8 +3239,8 @@ l1:
{
all_visible_cleared = true;
PageClearAllVisible(page);
- visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
- vmbuffer);
+ pageinfomap_clear(relation, BufferGetBlockNumber(buffer),
+ pimbuffer);
}
/* store transaction information of xact deleting the tuple */
@@ -3320,8 +3320,8 @@ l1:
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
- if (vmbuffer != InvalidBuffer)
- ReleaseBuffer(vmbuffer);
+ if (pimbuffer != InvalidBuffer)
+ ReleaseBuffer(pimbuffer);
/*
* If the tuple has toasted out-of-line attributes, we need to delete
@@ -3454,8 +3454,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
MultiXactStatus mxact_status;
Buffer buffer,
newbuf,
- vmbuffer = InvalidBuffer,
- vmbuffer_new = InvalidBuffer;
+ pimbuffer = InvalidBuffer,
+ pimbuffer_new = InvalidBuffer;
bool need_toast,
already_marked;
Size newtupsize,
@@ -3512,13 +3512,13 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
page = BufferGetPage(buffer);
/*
- * Before locking the buffer, pin the visibility map page if it appears to
+ * Before locking the buffer, pin the page info map page if it appears to
* be necessary. Since we haven't got the lock yet, someone else might be
* in the middle of changing this, so we'll need to recheck after we have
* the lock.
*/
if (PageIsAllVisible(page))
- visibilitymap_pin(relation, block, &vmbuffer);
+ pageinfomap_pin(relation, block, &pimbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
@@ -3800,15 +3800,15 @@ l2:
UnlockReleaseBuffer(buffer);
if (have_tuple_lock)
UnlockTupleTuplock(relation, &(oldtup.t_self), *lockmode);
- if (vmbuffer != InvalidBuffer)
- ReleaseBuffer(vmbuffer);
+ if (pimbuffer != InvalidBuffer)
+ ReleaseBuffer(pimbuffer);
bms_free(hot_attrs);
bms_free(key_attrs);
return result;
}
/*
- * If we didn't pin the visibility map page and the page has become all
+ * If we didn't pin the page info map page and the page has become all
* visible while we were busy locking the buffer, or during some
* subsequent window during which we had it unlocked, we'll have to unlock
* and re-lock, to avoid holding the buffer lock across an I/O. That's a
@@ -3816,10 +3816,10 @@ l2:
* tuple has been locked or updated under us, but hopefully it won't
* happen very often.
*/
- if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
+ if (pimbuffer == InvalidBuffer && PageIsAllVisible(page))
{
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
- visibilitymap_pin(relation, block, &vmbuffer);
+ pageinfomap_pin(relation, block, &pimbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
goto l2;
}
@@ -3976,7 +3976,7 @@ l2:
/* Assume there's no chance to put heaptup on same page. */
newbuf = RelationGetBufferForTuple(relation, heaptup->t_len,
buffer, 0, NULL,
- &vmbuffer_new, &vmbuffer);
+ &pimbuffer_new, &pimbuffer);
}
else
{
@@ -3994,7 +3994,7 @@ l2:
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
newbuf = RelationGetBufferForTuple(relation, heaptup->t_len,
buffer, 0, NULL,
- &vmbuffer_new, &vmbuffer);
+ &pimbuffer_new, &pimbuffer);
}
else
{
@@ -4114,15 +4114,15 @@ l2:
{
all_visible_cleared = true;
PageClearAllVisible(BufferGetPage(buffer));
- visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
- vmbuffer);
+ pageinfomap_clear(relation, BufferGetBlockNumber(buffer),
+ pimbuffer);
}
if (newbuf != buffer && PageIsAllVisible(BufferGetPage(newbuf)))
{
all_visible_cleared_new = true;
PageClearAllVisible(BufferGetPage(newbuf));
- visibilitymap_clear(relation, BufferGetBlockNumber(newbuf),
- vmbuffer_new);
+ pageinfomap_clear(relation, BufferGetBlockNumber(newbuf),
+ pimbuffer_new);
}
if (newbuf != buffer)
@@ -4176,10 +4176,10 @@ l2:
if (newbuf != buffer)
ReleaseBuffer(newbuf);
ReleaseBuffer(buffer);
- if (BufferIsValid(vmbuffer_new))
- ReleaseBuffer(vmbuffer_new);
- if (BufferIsValid(vmbuffer))
- ReleaseBuffer(vmbuffer);
+ if (BufferIsValid(pimbuffer_new))
+ ReleaseBuffer(pimbuffer_new);
+ if (BufferIsValid(pimbuffer))
+ ReleaseBuffer(pimbuffer);
/*
* Release the lmgr tuple lock, if we had it.
@@ -5074,7 +5074,7 @@ failed:
LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
/*
- * Don't update the visibility map here. Locking a tuple doesn't change
+ * Don't update the page info map here. Locking a tuple doesn't change
* visibility info.
*/
@@ -7196,29 +7196,30 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
/*
* Perform XLogInsert for a heap-visible operation. 'block' is the block
- * being marked all-visible, and vm_buffer is the buffer containing the
- * corresponding visibility map block. Both should have already been modified
+ * being marked all-visible, and pim_buffer is the buffer containing the
+ * corresponding page info map block. Both should have already been modified
* and dirtied.
*
* If checksums are enabled, we also generate a full-page image of
* heap_buffer, if necessary.
*/
XLogRecPtr
-log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
- TransactionId cutoff_xid)
+log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer pim_buffer,
+ TransactionId cutoff_xid, uint8 pimflags)
{
xl_heap_visible xlrec;
XLogRecPtr recptr;
uint8 flags;
Assert(BufferIsValid(heap_buffer));
- Assert(BufferIsValid(vm_buffer));
+ Assert(BufferIsValid(pim_buffer));
xlrec.cutoff_xid = cutoff_xid;
+ xlrec.flags = pimflags;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
- XLogRegisterBuffer(0, vm_buffer, 0);
+ XLogRegisterBuffer(0, pim_buffer, 0);
flags = REGBUF_STANDARD;
if (!XLogHintBitIsNeeded())
@@ -7751,16 +7752,16 @@ heap_xlog_clean(XLogReaderState *record)
* Replay XLOG_HEAP2_VISIBLE record.
*
* The critical integrity requirement here is that we must never end up with
- * a situation where the visibility map bit is set, and the page-level
+ * a situation where the page info map bit is set, and the page-level
* PD_ALL_VISIBLE bit is clear. If that were to occur, then a subsequent
- * page modification would fail to clear the visibility map bit.
+ * page modification would fail to clear the page info map bit.
*/
static void
heap_xlog_visible(XLogReaderState *record)
{
XLogRecPtr lsn = record->EndRecPtr;
xl_heap_visible *xlrec = (xl_heap_visible *) XLogRecGetData(record);
- Buffer vmbuffer = InvalidBuffer;
+ Buffer pimbuffer = InvalidBuffer;
Buffer buffer;
Page page;
RelFileNode rnode;
@@ -7784,7 +7785,7 @@ heap_xlog_visible(XLogReaderState *record)
/*
* Read the heap page, if it still exists. If the heap file has dropped or
* truncated later in recovery, we don't need to update the page, but we'd
- * better still update the visibility map.
+ * better still update the page info map.
*/
action = XLogReadBufferForRedo(record, 1, &buffer);
if (action == BLK_NEEDS_REDO)
@@ -7797,14 +7798,19 @@ heap_xlog_visible(XLogReaderState *record)
* we're not inspecting the existing page contents in any way, we
* don't care.
*
- * However, all operations that clear the visibility map bit *do* bump
+ * However, all operations that clear the page info map bit *do* bump
* the LSN, and those operations will only be replayed if the XLOG LSN
* follows the page LSN. Thus, if the page LSN has advanced past our
* XLOG record's LSN, we mustn't mark the page all-visible, because
* the subsequent update won't be replayed to clear the flag.
*/
page = BufferGetPage(buffer);
- PageSetAllVisible(page);
+
+ if (xlrec->flags & PAGEINFOMAP_ALL_VISIBLE)
+ PageSetAllVisible(page);
+ if (xlrec->flags & PAGEINFOMAP_ALL_FROZEN)
+ PageSetAllFrozen(page);
+
MarkBufferDirty(buffer);
}
else if (action == BLK_RESTORED)
@@ -7820,28 +7826,28 @@ heap_xlog_visible(XLogReaderState *record)
/*
* Even if we skipped the heap page update due to the LSN interlock, it's
- * still safe to update the visibility map. Any WAL record that clears
- * the visibility map bit does so before checking the page LSN, so any
+ * still safe to update the page info map. Any WAL record that clears
+ * the page info map bit does so before checking the page LSN, so any
* bits that need to be cleared will still be cleared.
*/
if (XLogReadBufferForRedoExtended(record, 0, RBM_ZERO_ON_ERROR, false,
- &vmbuffer) == BLK_NEEDS_REDO)
+ &pimbuffer) == BLK_NEEDS_REDO)
{
- Page vmpage = BufferGetPage(vmbuffer);
+ Page pimpage = BufferGetPage(pimbuffer);
Relation reln;
/* initialize the page if it was read as zeros */
- if (PageIsNew(vmpage))
- PageInit(vmpage, BLCKSZ, 0);
+ if (PageIsNew(pimpage))
+ PageInit(pimpage, BLCKSZ, 0);
/*
- * XLogReplayBufferExtended locked the buffer. But visibilitymap_set
+ * XLogReplayBufferExtended locked the buffer. But pageinfomap_set
* will handle locking itself.
*/
- LockBuffer(vmbuffer, BUFFER_LOCK_UNLOCK);
+ LockBuffer(pimbuffer, BUFFER_LOCK_UNLOCK);
reln = CreateFakeRelcacheEntry(rnode);
- visibilitymap_pin(reln, blkno, &vmbuffer);
+ pageinfomap_pin(reln, blkno, &pimbuffer);
/*
* Don't set the bit if replay has already passed this point.
@@ -7854,15 +7860,15 @@ heap_xlog_visible(XLogReaderState *record)
* we did for the heap page. If this results in a dropped bit, no
* real harm is done; and the next VACUUM will fix it.
*/
- if (lsn > PageGetLSN(vmpage))
- visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
- xlrec->cutoff_xid);
+ if (lsn > PageGetLSN(pimpage))
+ pageinfomap_set(reln, blkno, InvalidBuffer, lsn, pimbuffer,
+ xlrec->cutoff_xid, xlrec->flags);
- ReleaseBuffer(vmbuffer);
+ ReleaseBuffer(pimbuffer);
FreeFakeRelcacheEntry(reln);
}
- else if (BufferIsValid(vmbuffer))
- UnlockReleaseBuffer(vmbuffer);
+ else if (BufferIsValid(pimbuffer))
+ UnlockReleaseBuffer(pimbuffer);
}
/*
@@ -7965,17 +7971,17 @@ heap_xlog_delete(XLogReaderState *record)
ItemPointerSetOffsetNumber(&target_tid, xlrec->offnum);
/*
- * The visibility map may need to be fixed even if the heap page is
+ * The page info map may need to be fixed even if the heap page is
* already up-to-date.
*/
if (xlrec->flags & XLH_DELETE_ALL_VISIBLE_CLEARED)
{
Relation reln = CreateFakeRelcacheEntry(target_node);
- Buffer vmbuffer = InvalidBuffer;
+ Buffer pimbuffer = InvalidBuffer;
- visibilitymap_pin(reln, blkno, &vmbuffer);
- visibilitymap_clear(reln, blkno, vmbuffer);
- ReleaseBuffer(vmbuffer);
+ pageinfomap_pin(reln, blkno, &pimbuffer);
+ pageinfomap_clear(reln, blkno, pimbuffer);
+ ReleaseBuffer(pimbuffer);
FreeFakeRelcacheEntry(reln);
}
@@ -8043,17 +8049,17 @@ heap_xlog_insert(XLogReaderState *record)
ItemPointerSetOffsetNumber(&target_tid, xlrec->offnum);
/*
- * The visibility map may need to be fixed even if the heap page is
+ * The page info map may need to be fixed even if the heap page is
* already up-to-date.
*/
if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
{
Relation reln = CreateFakeRelcacheEntry(target_node);
- Buffer vmbuffer = InvalidBuffer;
+ Buffer pimbuffer = InvalidBuffer;
- visibilitymap_pin(reln, blkno, &vmbuffer);
- visibilitymap_clear(reln, blkno, vmbuffer);
- ReleaseBuffer(vmbuffer);
+ pageinfomap_pin(reln, blkno, &pimbuffer);
+ pageinfomap_clear(reln, blkno, pimbuffer);
+ ReleaseBuffer(pimbuffer);
FreeFakeRelcacheEntry(reln);
}
@@ -8163,17 +8169,17 @@ heap_xlog_multi_insert(XLogReaderState *record)
XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
/*
- * The visibility map may need to be fixed even if the heap page is
+ * The page info map may need to be fixed even if the heap page is
* already up-to-date.
*/
if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
{
Relation reln = CreateFakeRelcacheEntry(rnode);
- Buffer vmbuffer = InvalidBuffer;
+ Buffer pimbuffer = InvalidBuffer;
- visibilitymap_pin(reln, blkno, &vmbuffer);
- visibilitymap_clear(reln, blkno, vmbuffer);
- ReleaseBuffer(vmbuffer);
+ pageinfomap_pin(reln, blkno, &pimbuffer);
+ pageinfomap_clear(reln, blkno, pimbuffer);
+ ReleaseBuffer(pimbuffer);
FreeFakeRelcacheEntry(reln);
}
@@ -8318,17 +8324,17 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
ItemPointerSet(&newtid, newblk, xlrec->new_offnum);
/*
- * The visibility map may need to be fixed even if the heap page is
+ * The page info map may need to be fixed even if the heap page is
* already up-to-date.
*/
if (xlrec->flags & XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED)
{
Relation reln = CreateFakeRelcacheEntry(rnode);
- Buffer vmbuffer = InvalidBuffer;
+ Buffer pimbuffer = InvalidBuffer;
- visibilitymap_pin(reln, oldblk, &vmbuffer);
- visibilitymap_clear(reln, oldblk, vmbuffer);
- ReleaseBuffer(vmbuffer);
+ pageinfomap_pin(reln, oldblk, &pimbuffer);
+ pageinfomap_clear(reln, oldblk, pimbuffer);
+ ReleaseBuffer(pimbuffer);
FreeFakeRelcacheEntry(reln);
}
@@ -8402,17 +8408,17 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
newaction = XLogReadBufferForRedo(record, 0, &nbuffer);
/*
- * The visibility map may need to be fixed even if the heap page is
+ * The page info map may need to be fixed even if the heap page is
* already up-to-date.
*/
if (xlrec->flags & XLH_UPDATE_NEW_ALL_VISIBLE_CLEARED)
{
Relation reln = CreateFakeRelcacheEntry(rnode);
- Buffer vmbuffer = InvalidBuffer;
+ Buffer pimbuffer = InvalidBuffer;
- visibilitymap_pin(reln, newblk, &vmbuffer);
- visibilitymap_clear(reln, newblk, vmbuffer);
- ReleaseBuffer(vmbuffer);
+ pageinfomap_pin(reln, newblk, &pimbuffer);
+ pageinfomap_clear(reln, newblk, pimbuffer);
+ ReleaseBuffer(pimbuffer);
FreeFakeRelcacheEntry(reln);
}
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 6db73bf..8f702be 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -18,7 +18,7 @@
#include "access/heapam.h"
#include "access/hio.h"
#include "access/htup_details.h"
-#include "access/visibilitymap.h"
+#include "access/pageinfomap.h"
#include "storage/bufmgr.h"
#include "storage/freespace.h"
#include "storage/lmgr.h"
@@ -112,16 +112,16 @@ ReadBufferBI(Relation relation, BlockNumber targetBlock,
/*
* For each heap page which is all-visible, acquire a pin on the appropriate
- * visibility map page, if we haven't already got one.
+ * page info map page, if we haven't already got one.
*
* buffer2 may be InvalidBuffer, if only one buffer is involved. buffer1
* must not be InvalidBuffer. If both buffers are specified, buffer1 must
* be less than buffer2.
*/
static void
-GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
+GetPageInfoMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
BlockNumber block1, BlockNumber block2,
- Buffer *vmbuffer1, Buffer *vmbuffer2)
+ Buffer *pimbuffer1, Buffer *pimbuffer2)
{
bool need_to_pin_buffer1;
bool need_to_pin_buffer2;
@@ -133,10 +133,10 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
{
/* Figure out which pins we need but don't have. */
need_to_pin_buffer1 = PageIsAllVisible(BufferGetPage(buffer1))
- && !visibilitymap_pin_ok(block1, *vmbuffer1);
+ && !pageinfomap_pin_ok(block1, *pimbuffer1);
need_to_pin_buffer2 = buffer2 != InvalidBuffer
&& PageIsAllVisible(BufferGetPage(buffer2))
- && !visibilitymap_pin_ok(block2, *vmbuffer2);
+ && !pageinfomap_pin_ok(block2, *pimbuffer2);
if (!need_to_pin_buffer1 && !need_to_pin_buffer2)
return;
@@ -147,9 +147,9 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
/* Get pins. */
if (need_to_pin_buffer1)
- visibilitymap_pin(relation, block1, vmbuffer1);
+ pageinfomap_pin(relation, block1, pimbuffer1);
if (need_to_pin_buffer2)
- visibilitymap_pin(relation, block2, vmbuffer2);
+ pageinfomap_pin(relation, block2, pimbuffer2);
/* Relock buffers. */
LockBuffer(buffer1, BUFFER_LOCK_EXCLUSIVE);
@@ -192,7 +192,7 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
* happen if space is freed in that page after heap_update finds there's not
* enough there). In that case, the page will be pinned and locked only once.
*
- * For the vmbuffer and vmbuffer_other arguments, we avoid deadlock by
+ * For the pimbuffer and pimbuffer_other arguments, we avoid deadlock by
* locking them only after locking the corresponding heap page, and taking
* no further lwlocks while they are locked.
*
@@ -228,7 +228,7 @@ Buffer
RelationGetBufferForTuple(Relation relation, Size len,
Buffer otherBuffer, int options,
BulkInsertState bistate,
- Buffer *vmbuffer, Buffer *vmbuffer_other)
+ Buffer *pimbuffer, Buffer *pimbuffer_other)
{
bool use_fsm = !(options & HEAP_INSERT_SKIP_FSM);
Buffer buffer = InvalidBuffer;
@@ -316,7 +316,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
* the possibility they are the same block.
*
* If the page-level all-visible flag is set, caller will need to
- * clear both that and the corresponding visibility map bit. However,
+ * clear both that and the corresponding page info map bit. However,
* by the time we return, we'll have x-locked the buffer, and we don't
* want to do any I/O while in that state. So we check the bit here
* before taking the lock, and pin the page if it appears necessary.
@@ -328,7 +328,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
/* easy case */
buffer = ReadBufferBI(relation, targetBlock, bistate);
if (PageIsAllVisible(BufferGetPage(buffer)))
- visibilitymap_pin(relation, targetBlock, vmbuffer);
+ pageinfomap_pin(relation, targetBlock, pimbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
}
else if (otherBlock == targetBlock)
@@ -336,7 +336,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
/* also easy case */
buffer = otherBuffer;
if (PageIsAllVisible(BufferGetPage(buffer)))
- visibilitymap_pin(relation, targetBlock, vmbuffer);
+ pageinfomap_pin(relation, targetBlock, pimbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
}
else if (otherBlock < targetBlock)
@@ -344,7 +344,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
/* lock other buffer first */
buffer = ReadBuffer(relation, targetBlock);
if (PageIsAllVisible(BufferGetPage(buffer)))
- visibilitymap_pin(relation, targetBlock, vmbuffer);
+ pageinfomap_pin(relation, targetBlock, pimbuffer);
LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
}
@@ -353,7 +353,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
/* lock target buffer first */
buffer = ReadBuffer(relation, targetBlock);
if (PageIsAllVisible(BufferGetPage(buffer)))
- visibilitymap_pin(relation, targetBlock, vmbuffer);
+ pageinfomap_pin(relation, targetBlock, pimbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
}
@@ -374,19 +374,19 @@ RelationGetBufferForTuple(Relation relation, Size len,
* caller passed us the right page anyway.
*
* Note also that it's possible that by the time we get the pin and
- * retake the buffer locks, the visibility map bit will have been
+ * retake the buffer locks, the page info map bit will have been
* cleared by some other backend anyway. In that case, we'll have
* done a bit of extra work for no gain, but there's no real harm
* done.
*/
if (otherBuffer == InvalidBuffer || buffer <= otherBuffer)
- GetVisibilityMapPins(relation, buffer, otherBuffer,
- targetBlock, otherBlock, vmbuffer,
- vmbuffer_other);
+ GetPageInfoMapPins(relation, buffer, otherBuffer,
+ targetBlock, otherBlock, pimbuffer,
+ pimbuffer_other);
else
- GetVisibilityMapPins(relation, otherBuffer, buffer,
- otherBlock, targetBlock, vmbuffer_other,
- vmbuffer);
+ GetPageInfoMapPins(relation, otherBuffer, buffer,
+ otherBlock, targetBlock, pimbuffer_other,
+ pimbuffer);
/*
* Now we can check to see if there's enough free space here. If so,
diff --git a/src/backend/access/heap/pageinfomap.c b/src/backend/access/heap/pageinfomap.c
new file mode 100644
index 0000000..6cea796
--- /dev/null
+++ b/src/backend/access/heap/pageinfomap.c
@@ -0,0 +1,676 @@
+/*-------------------------------------------------------------------------
+ *
+ * pageinfomap.c
+ * bitmap for tracking information of heap tuples
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/heap/pageinfomap.c
+ *
+ * INTERFACE ROUTINES
+ * pageinfomap_clear - clear a bit in the page info map
+ * pageinfomap_pin - pin a map page for setting a bit
+ * pageinfomap_pin_ok - check whether correct map page is already pinned
+ * pageinfomap_set - set a bit in a previously pinned page
+ * pageinfomap_get_status - get status of bits
+ * pageinfomap_count - count number of bits set in page info map
+ * pageinfomap_truncate - truncate the page info map
+ *
+ * NOTES
+ *
+ * The page info map is a bitmap with two bits (all-visible and all-frozen)
+ * per heap page. A set all-visible bit means that all tuples on the page are
+ * known visible to all transactions, and therefore the page doesn't need to
+ * be vacuumed. A set all-frozen bit means that all tuples on the page are
+ * completely frozen, and therefore the page doesn't need to be vacuumed even
+ * if whole table scanning vacuum is required (e.g. anti-wraparound vacuum).
+ * The all-frozen bit must be set only when the page is already all-visible.
+ *
+ * The map is conservative in the sense that we make sure that whenever a bit
+ * is set, we know the condition is true, but if a bit is not set, it might or
+ * might not be true.
+ *
+ * Clearing a page info map bit is not separately WAL-logged. The callers
+ * must make sure that whenever a bit is cleared, the bit is cleared on WAL
+ * replay of the updating operation as well. And all-frozen bit must be
+ * cleared with all-visible at the same time.
+ *
+ * When we *set* a page info map during VACUUM, we must write WAL. This may
+ * seem counterintuitive, since the bit is basically a hint: if it is clear,
+ * it may still be the case that every tuple on the page is all-visible or
+ * all-frozen we just don't know that for certain. The difficulty is that
+ * there are two bits which are typically set together: the PD_ALL_VISIBLE
+ * or PD_ALL_FROZEN bit on the page itself, and the corresponding page info map
+ * bit. If a crash occurs after the page info map page makes it to disk and before
+ * the updated heap page makes it to disk, redo must set the bit on the heap page.
+ * Otherwise, the next insert, update, or delete on the heap page will fail to
+ * realize that the page info map bit must be cleared, possibly causing index-only
+ * scans to return wrong answers.
+ *
+ * VACUUM will normally skip pages for which the page info map bit is set;
+ * such pages can't contain any dead tuples and therefore don't need vacuuming.
+ * The page info map has the all-frozen bit which indicates all tuples on
+ * corresponding page has been completely frozen, so the page info map is also
+ * used for anti-wraparound vacuum, even if freezing of tuples is required.
+ *
+ * LOCKING
+ *
+ * In heapam.c, whenever a page is modified so that not all tuples on the
+ * page are visible to everyone anymore, the corresponding bit in the
+ * page info map is cleared. In order to be crash-safe, we need to do this
+ * while still holding a lock on the heap page and in the same critical
+ * section that logs the page modification. However, we don't want to hold
+ * the buffer lock over any I/O that may be required to read in the page information
+ * map page. To avoid this, we examine the heap page before locking it;
+ * if the page-level PD_ALL_VISIBLE or PD_ALL_FROZEN bit is set, we pin the
+ * page info map bit. Then, we lock the buffer. But this creates a race
+ * condition: there is a possibility that in the time it takes to lock the
+ * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that happens,
+ * we have to unlock the buffer, pin the page info map page, and relock the
+ * buffer. This shouldn't happen often, because only VACUUM currently sets
+ * page info map bits, and the race will only occur if VACUUM processes a given
+ * page at almost exactly the same time that someone tries to further modify it.
+ *
+ * To set a bit, you need to hold a lock on the heap page. That prevents
+ * the race condition where VACUUM sees that all tuples on the page are
+ * visible to everyone, but another backend modifies the page before VACUUM
+ * sets the bit in the page info map.
+ *
+ * When a bit is set, the LSN of the page info map page is updated to make
+ * sure that the page info map update doesn't get written to disk before the
+ * WAL record of the changes that made it possible to set the bit is flushed.
+ * But when a bit is cleared, we don't have to do that because it's always
+ * safe to clear a bit in the map from correctness point of view.
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heapam_xlog.h"
+#include "access/pageinfomap.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "storage/lmgr.h"
+#include "storage/smgr.h"
+#include "utils/inval.h"
+
+
+/*#define TRACE_PAGEINFOMAP */
+
+/*
+ * Size of the bitmap on each page info map page, in bytes. There's no
+ * extra headers, so the whole page minus the standard page header is
+ * used for the bitmap.
+ */
+#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
+
+/*
+ * Number of bits allocated for each heap block.
+ * One for all-visible, other for all-frozen.
+*/
+#define BITS_PER_HEAPBLOCK 2
+
+/* Number of heap blocks we can represent in one byte. */
+#define HEAPBLOCKS_PER_BYTE 4
+
+/* Number of heap blocks we can represent in one page info map page. */
+#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
+
+/* Mapping from heap block number to the right bit in the page info map */
+#define HEAPBLK_TO_MAPBLOCK(x) ((x) / HEAPBLOCKS_PER_PAGE)
+#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
+#define HEAPBLK_TO_MAPBIT(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)
+
+/* tables for fast counting of set bits for visible and freeze */
+static const uint8 number_of_ones_for_visible[256] = {
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4
+};
+static const uint8 number_of_ones_for_frozen[256] = {
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4
+};
+
+/* prototypes for internal routines */
+static Buffer pim_readbuf(Relation rel, BlockNumber blkno, bool extend);
+static void pim_extend(Relation rel, BlockNumber npimblocks);
+
+
+/*
+ * pageinfomap_clear - clear all bits in page info map
+ *
+ * You must pass a buffer containing the correct map page to this function.
+ * Call pageinfomap_pin first to pin the right one. This function doesn't do
+ * any I/O.
+ */
+void
+pageinfomap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
+{
+ BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+ int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
+ int mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
+ uint8 mask = PAGEINFOMAP_ALL_FLAGS << mapBit;
+ char *map;
+
+#ifdef TRACE_PAGEINFOMAP
+ elog(DEBUG1, "pim_clear %s block %d", RelationGetRelationName(rel), heapBlk);
+#endif
+
+ if (!BufferIsValid(buf) || BufferGetBlockNumber(buf) != mapBlock)
+ elog(ERROR, "wrong buffer passed to pageinfomap_clear");
+
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ map = PageGetContents(BufferGetPage(buf));
+
+ if (map[mapByte] & mask)
+ {
+ map[mapByte] &= ~mask;
+
+ MarkBufferDirty(buf);
+ }
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+}
+
+/*
+ * pageinfomap_pin - pin a map page for setting a bit
+ *
+ * Setting a bit in the page info map is a two-phase operation. First, call
+ * pageinfomap_pin, to pin the page info map page containing the bit for
+ * the heap page. Because that can require I/O to read the map page, you
+ * shouldn't hold a lock on the heap page while doing that. Then, call
+ * pageinfomap_set to actually set the bit.
+ *
+ * On entry, *buf should be InvalidBuffer or a valid buffer returned by
+ * an earlier call to pageinfomap_pin or pageinfomap_get_status on the same
+ * relation. On return, *buf is a valid buffer with the map page containing
+ * the bit for heapBlk.
+ *
+ * If the page doesn't exist in the map file yet, it is extended.
+ */
+void
+pageinfomap_pin(Relation rel, BlockNumber heapBlk, Buffer *buf)
+{
+ BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+
+ /* Reuse the old pinned buffer if possible */
+ if (BufferIsValid(*buf))
+ {
+ if (BufferGetBlockNumber(*buf) == mapBlock)
+ return;
+
+ ReleaseBuffer(*buf);
+ }
+ *buf = pim_readbuf(rel, mapBlock, true);
+}
+
+/*
+ * pageinfomap_pin_ok - do we already have the correct page pinned?
+ *
+ * On entry, buf should be InvalidBuffer or a valid buffer returned by
+ * an earlier call to pageinfomap_pin or pageinfomap_get_status on the same
+ * relation. The return value indicates whether the buffer covers the
+ * given heapBlk.
+ */
+bool
+pageinfomap_pin_ok(BlockNumber heapBlk, Buffer buf)
+{
+ BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+
+ return BufferIsValid(buf) && BufferGetBlockNumber(buf) == mapBlock;
+}
+
+/*
+ * pageinfomap_set - set bit(s) on a previously pinned page
+ *
+ * recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
+ * or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
+ * one provided; in normal running, we generate a new XLOG record and set the
+ * page LSN to that value. cutoff_xid is the largest xmin on the page being
+ * marked all-visible; it is needed for Hot Standby, and can be
+ * InvalidTransactionId if the page contains no tuples.
+ *
+ * Caller is expected to set the heap page's PD_ALL_VISIBLE or PD_ALL_FROZEN
+ * bit before calling this function. Except in recovery, caller should also
+ * pass the heap buffer and flags which indicates what flag we want to set.
+ * When checksums are enabled and we're not in recovery, we must add the heap
+ * buffer to the WAL chain to protect it from being torn.
+ *
+ * You must pass a buffer containing the correct map page to this function.
+ * Call pageinfomap_pin first to pin the right one. This function doesn't do
+ * any I/O.
+ */
+void
+pageinfomap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
+ XLogRecPtr recptr, Buffer pimBuf, TransactionId cutoff_xid,
+ uint8 flags)
+{
+ BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+ uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
+ uint8 mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
+ Page page;
+ char *map;
+
+#ifdef TRACE_PAGEINFOMAP
+ elog(DEBUG1, "pim_set %s block %d, flag %u", RelationGetRelationName(rel), heapBlk, flags);
+#endif
+
+ Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
+ Assert(InRecovery || BufferIsValid(heapBuf));
+ Assert(flags & PAGEINFOMAP_ALL_FLAGS);
+
+ /* Check that we have the right heap page pinned, if present */
+ if (BufferIsValid(heapBuf) && BufferGetBlockNumber(heapBuf) != heapBlk)
+ elog(ERROR, "wrong heap buffer passed to pageinfomap_set");
+
+ /* Check that we have the right PIM page pinned */
+ if (!BufferIsValid(pimBuf) || BufferGetBlockNumber(pimBuf) != mapBlock)
+ elog(ERROR, "wrong PIM buffer passed to pageinfomap_set");
+
+ page = BufferGetPage(pimBuf);
+ map = PageGetContents(page);
+ LockBuffer(pimBuf, BUFFER_LOCK_EXCLUSIVE);
+
+ if (flags != (map[mapByte] & (flags << mapBit)))
+ {
+ START_CRIT_SECTION();
+
+ map[mapByte] |= (flags << mapBit);
+ MarkBufferDirty(pimBuf);
+
+ if (RelationNeedsWAL(rel))
+ {
+ if (XLogRecPtrIsInvalid(recptr))
+ {
+ Assert(!InRecovery);
+ recptr = log_heap_visible(rel->rd_node, heapBuf, pimBuf,
+ cutoff_xid, flags);
+
+ /*
+ * If data checksums are enabled (or wal_log_hints=on), we
+ * need to protect the heap page from being torn.
+ */
+ if (XLogHintBitIsNeeded())
+ {
+ Page heapPage = BufferGetPage(heapBuf);
+
+ /*
+ * Caller is expected to set PD_ALL_VISIBLE or
+ * PD_ALL_FROZEN first.
+ */
+ Assert(((flags | PAGEINFOMAP_ALL_VISIBLE) && PageIsAllVisible(heapPage)) ||
+ ((flags | PAGEINFOMAP_ALL_FROZEN) && PageIsAllFrozen(heapPage)));
+ PageSetLSN(heapPage, recptr);
+ }
+ }
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+ }
+
+ LockBuffer(pimBuf, BUFFER_LOCK_UNLOCK);
+}
+
+/*
+ * pageinfomap_get_status - get status of bits
+ *
+ * Are all tuples on heapBlk visible to all or are marked frozen, according
+ * to the page info map?
+ *
+ * On entry, *buf should be InvalidBuffer or a valid buffer returned by an
+ * earlier call to pageinfomap_pin or pageinfomap_get_status on the same
+ * relation. On return, *buf is a valid buffer with the map page containing
+ * the bit for heapBlk, or InvalidBuffer. The caller is responsible for
+ * releasing *buf after it's done testing and setting bits, and must pass flags
+ * for which it needs to check the value in page info map.
+ *
+ * NOTE: This function is typically called without a lock on the heap page,
+ * so somebody else could change the bit just after we look at it. In fact,
+ * since we don't lock the page info map page either, it's even possible that
+ * someone else could have changed the bit just before we look at it, but yet
+ * we might see the old value. It is the caller's responsibility to deal with
+ * all concurrency issues!
+ */
+uint8
+pageinfomap_get_status(Relation rel, BlockNumber heapBlk, Buffer *buf)
+{
+ BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+ uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
+ uint8 mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
+ char *map;
+
+#ifdef TRACE_PAGEINFOMAP
+ elog(DEBUG1, "pim_get_status %s, block %d", RelationGetRelationName(rel), heapBlk);
+#endif
+
+ /* Reuse the old pinned buffer if possible */
+ if (BufferIsValid(*buf))
+ {
+ if (BufferGetBlockNumber(*buf) != mapBlock)
+ {
+ ReleaseBuffer(*buf);
+ *buf = InvalidBuffer;
+ }
+ }
+
+ if (!BufferIsValid(*buf))
+ {
+ *buf = pim_readbuf(rel, mapBlock, false);
+ if (!BufferIsValid(*buf))
+ return false;
+ }
+
+ map = PageGetContents(BufferGetPage(*buf));
+
+ /*
+ * The double bits read is atomic. There could be memory-ordering effects
+ * here, but for performance reasons we make it the caller's job to worry
+ * about that.
+ */
+ return ((map[mapByte] >> mapBit) & PAGEINFOMAP_ALL_FLAGS);
+}
+
+/*
+ * pageinfomap_count - count number of bits set in page info map
+ *
+ * Note: we ignore the possibility of race conditions when the table is being
+ * extended concurrently with the call. New pages added to the table aren't
+ * going to be marked all-visible or all-frozen, so they won't affect the result.
+ * The caller must set the flags which indicates what flag we want to count.
+ */
+BlockNumber
+pageinfomap_count(Relation rel, BlockNumber *all_frozen)
+{
+ BlockNumber mapBlock;
+ BlockNumber all_visible = 0;
+
+ if (all_frozen)
+ *all_frozen = 0;
+
+ for (mapBlock = 0;; mapBlock++)
+ {
+ Buffer mapBuffer;
+ unsigned char *map;
+ int i;
+
+ /*
+ * Read till we fall off the end of the map. We assume that any extra
+ * bytes in the last page are zeroed, so we don't bother excluding
+ * them from the count.
+ */
+ mapBuffer = pim_readbuf(rel, mapBlock, false);
+ if (!BufferIsValid(mapBuffer))
+ break;
+
+ /*
+ * We choose not to lock the page, since the result is going to be
+ * immediately stale anyway if anyone is concurrently setting or
+ * clearing bits, and we only really need an approximate value.
+ */
+ map = (unsigned char *) PageGetContents(BufferGetPage(mapBuffer));
+
+ for (i = 0; i < MAPSIZE; i++)
+ {
+ all_visible += number_of_ones_for_visible[map[i]];
+ if (all_frozen)
+ *all_frozen += number_of_ones_for_frozen[map[i]];
+ }
+
+ ReleaseBuffer(mapBuffer);
+ }
+
+ return all_visible;
+}
+
+/*
+ * pageinfomap_truncate - truncate the page info map
+ *
+ * The caller must hold AccessExclusiveLock on the relation, to ensure that
+ * other backends receive the smgr invalidation event that this function sends
+ * before they access the PIM again.
+ *
+ * nheapblocks is the new size of the heap.
+ */
+void
+pageinfomap_truncate(Relation rel, BlockNumber nheapblocks)
+{
+ BlockNumber newnblocks;
+
+ /* last remaining block, byte, and bit */
+ BlockNumber truncBlock = HEAPBLK_TO_MAPBLOCK(nheapblocks);
+ uint32 truncByte = HEAPBLK_TO_MAPBYTE(nheapblocks);
+ uint8 truncBit = HEAPBLK_TO_MAPBIT(nheapblocks);
+
+#ifdef TRACE_PAGEINFOMAP
+ elog(DEBUG1, "pim_truncate %s block %d", RelationGetRelationName(rel), nheapblocks);
+#endif
+
+ RelationOpenSmgr(rel);
+
+ /*
+ * If no page info map has been created yet for this relation, there's
+ * nothing to truncate.
+ */
+ if (!smgrexists(rel->rd_smgr, PAGEINFOMAP_FORKNUM))
+ return;
+
+ /*
+ * Unless the new size is exactly at a page info map page boundary, the
+ * tail bits in the last remaining map page, representing truncated heap
+ * blocks, need to be cleared. This is not only tidy, but also necessary
+ * because we don't get a chance to clear the bits if the heap is extended
+ * again.
+ */
+ if (truncByte != 0 || truncBit != 0)
+ {
+ Buffer mapBuffer;
+ Page page;
+ char *map;
+
+ newnblocks = truncBlock + 1;
+
+ mapBuffer = pim_readbuf(rel, truncBlock, false);
+ if (!BufferIsValid(mapBuffer))
+ {
+ /* nothing to do, the file was already smaller */
+ return;
+ }
+
+ page = BufferGetPage(mapBuffer);
+ map = PageGetContents(page);
+
+ LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+ /* Clear out the unwanted bytes. */
+ MemSet(&map[truncByte + 1], 0, MAPSIZE - (truncByte + 1));
+
+ /*----
+ * Mask out the unwanted bits of the last remaining byte.
+ *
+ * ((1 << 0) - 1) = 00000000
+ * ((1 << 1) - 1) = 00000001
+ * ...
+ * ((1 << 6) - 1) = 00111111
+ * ((1 << 7) - 1) = 01111111
+ *----
+ */
+ map[truncByte] &= (1 << truncBit) - 1;
+
+ MarkBufferDirty(mapBuffer);
+ UnlockReleaseBuffer(mapBuffer);
+ }
+ else
+ newnblocks = truncBlock;
+
+ if (smgrnblocks(rel->rd_smgr, PAGEINFOMAP_FORKNUM) <= newnblocks)
+ {
+ /* nothing to do, the file was already smaller than requested size */
+ return;
+ }
+
+ /* Truncate the unused PIM pages, and send smgr inval message */
+ smgrtruncate(rel->rd_smgr, PAGEINFOMAP_FORKNUM, newnblocks);
+
+ /*
+ * We might as well update the local smgr_pim_nblocks setting. smgrtruncate
+ * sent an smgr cache inval message, which will cause other backends to
+ * invalidate their copy of smgr_pim_nblocks, and this one too at the next
+ * command boundary. But this ensures it isn't outright wrong until then.
+ */
+ if (rel->rd_smgr)
+ rel->rd_smgr->smgr_pim_nblocks = newnblocks;
+}
+
+/*
+ * Read a page info map page.
+ *
+ * If the page doesn't exist, InvalidBuffer is returned, or if 'extend' is
+ * true, the page info map file is extended.
+ */
+static Buffer
+pim_readbuf(Relation rel, BlockNumber blkno, bool extend)
+{
+ Buffer buf;
+
+ /*
+ * We might not have opened the relation at the smgr level yet, or we
+ * might have been forced to close it by a sinval message. The code below
+ * won't necessarily notice relation extension immediately when extend =
+ * false, so we rely on sinval messages to ensure that our ideas about the
+ * size of the map aren't too far out of date.
+ */
+ RelationOpenSmgr(rel);
+
+ /*
+ * If we haven't cached the size of the page info map fork yet, check it
+ * first.
+ */
+ if (rel->rd_smgr->smgr_pim_nblocks == InvalidBlockNumber)
+ {
+ if (smgrexists(rel->rd_smgr, PAGEINFOMAP_FORKNUM))
+ rel->rd_smgr->smgr_pim_nblocks = smgrnblocks(rel->rd_smgr,
+ PAGEINFOMAP_FORKNUM);
+ else
+ rel->rd_smgr->smgr_pim_nblocks = 0;
+ }
+
+ /* Handle requests beyond EOF */
+ if (blkno >= rel->rd_smgr->smgr_pim_nblocks)
+ {
+ if (extend)
+ pim_extend(rel, blkno + 1);
+ else
+ return InvalidBuffer;
+ }
+
+ /*
+ * Use ZERO_ON_ERROR mode, and initialize the page if necessary. It's
+ * always safe to clear bits, so it's better to clear corrupt pages than
+ * error out.
+ */
+ buf = ReadBufferExtended(rel, PAGEINFOMAP_FORKNUM, blkno,
+ RBM_ZERO_ON_ERROR, NULL);
+ if (PageIsNew(BufferGetPage(buf)))
+ PageInit(BufferGetPage(buf), BLCKSZ, 0);
+ return buf;
+}
+
+/*
+ * Ensure that the page info map fork is at least pim_nblocks long, extending
+ * it if necessary with zeroed pages.
+ */
+static void
+pim_extend(Relation rel, BlockNumber pim_nblocks)
+{
+ BlockNumber pim_nblocks_now;
+ Page pg;
+
+ pg = (Page) palloc(BLCKSZ);
+ PageInit(pg, BLCKSZ, 0);
+
+ /*
+ * We use the relation extension lock to lock out other backends trying to
+ * extend the page info map at the same time. It also locks out extension
+ * of the main fork, unnecessarily, but extending the page info map
+ * happens seldom enough that it doesn't seem worthwhile to have a
+ * separate lock tag type for it.
+ *
+ * Note that another backend might have extended or created the relation
+ * by the time we get the lock.
+ */
+ LockRelationForExtension(rel, ExclusiveLock);
+
+ /* Might have to re-open if a cache flush happened */
+ RelationOpenSmgr(rel);
+
+ /*
+ * Create the file first if it doesn't exist. If smgr_pim_nblocks is
+ * positive then it must exist, no need for an smgrexists call.
+ */
+ if ((rel->rd_smgr->smgr_pim_nblocks == 0 ||
+ rel->rd_smgr->smgr_pim_nblocks == InvalidBlockNumber) &&
+ !smgrexists(rel->rd_smgr, PAGEINFOMAP_FORKNUM))
+ smgrcreate(rel->rd_smgr, PAGEINFOMAP_FORKNUM, false);
+
+ pim_nblocks_now = smgrnblocks(rel->rd_smgr, PAGEINFOMAP_FORKNUM);
+
+ /* Now extend the file */
+ while (pim_nblocks_now < pim_nblocks)
+ {
+ PageSetChecksumInplace(pg, pim_nblocks_now);
+
+ smgrextend(rel->rd_smgr, PAGEINFOMAP_FORKNUM, pim_nblocks_now,
+ (char *) pg, false);
+ pim_nblocks_now++;
+ }
+
+ /*
+ * Send a shared-inval message to force other backends to close any smgr
+ * references they may have for this rel, which we are about to change.
+ * This is a useful optimization because it means that backends don't have
+ * to keep checking for creation or extension of the file, which happens
+ * infrequently.
+ */
+ CacheInvalidateSmgr(rel->rd_smgr->smgr_rnode);
+
+ /* Update local cache with the up-to-date size */
+ rel->rd_smgr->smgr_pim_nblocks = pim_nblocks_now;
+
+ UnlockRelationForExtension(rel, ExclusiveLock);
+
+ pfree(pg);
+}
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
deleted file mode 100644
index 7c38772..0000000
--- a/src/backend/access/heap/visibilitymap.c
+++ /dev/null
@@ -1,635 +0,0 @@
-/*-------------------------------------------------------------------------
- *
- * visibilitymap.c
- * bitmap for tracking visibility of heap tuples
- *
- * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
- * Portions Copyright (c) 1994, Regents of the University of California
- *
- *
- * IDENTIFICATION
- * src/backend/access/heap/visibilitymap.c
- *
- * INTERFACE ROUTINES
- * visibilitymap_clear - clear a bit in the visibility map
- * visibilitymap_pin - pin a map page for setting a bit
- * visibilitymap_pin_ok - check whether correct map page is already pinned
- * visibilitymap_set - set a bit in a previously pinned page
- * visibilitymap_test - test if a bit is set
- * visibilitymap_count - count number of bits set in visibility map
- * visibilitymap_truncate - truncate the visibility map
- *
- * NOTES
- *
- * The visibility map is a bitmap with one bit per heap page. A set bit means
- * that all tuples on the page are known visible to all transactions, and
- * therefore the page doesn't need to be vacuumed. The map is conservative in
- * the sense that we make sure that whenever a bit is set, we know the
- * condition is true, but if a bit is not set, it might or might not be true.
- *
- * Clearing a visibility map bit is not separately WAL-logged. The callers
- * must make sure that whenever a bit is cleared, the bit is cleared on WAL
- * replay of the updating operation as well.
- *
- * When we *set* a visibility map during VACUUM, we must write WAL. This may
- * seem counterintuitive, since the bit is basically a hint: if it is clear,
- * it may still be the case that every tuple on the page is visible to all
- * transactions; we just don't know that for certain. The difficulty is that
- * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
- * on the page itself, and the visibility map bit. If a crash occurs after the
- * visibility map page makes it to disk and before the updated heap page makes
- * it to disk, redo must set the bit on the heap page. Otherwise, the next
- * insert, update, or delete on the heap page will fail to realize that the
- * visibility map bit must be cleared, possibly causing index-only scans to
- * return wrong answers.
- *
- * VACUUM will normally skip pages for which the visibility map bit is set;
- * such pages can't contain any dead tuples and therefore don't need vacuuming.
- * The visibility map is not used for anti-wraparound vacuums, because
- * an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
- * present in the table, even on pages that don't have any dead tuples.
- *
- * LOCKING
- *
- * In heapam.c, whenever a page is modified so that not all tuples on the
- * page are visible to everyone anymore, the corresponding bit in the
- * visibility map is cleared. In order to be crash-safe, we need to do this
- * while still holding a lock on the heap page and in the same critical
- * section that logs the page modification. However, we don't want to hold
- * the buffer lock over any I/O that may be required to read in the visibility
- * map page. To avoid this, we examine the heap page before locking it;
- * if the page-level PD_ALL_VISIBLE bit is set, we pin the visibility map
- * bit. Then, we lock the buffer. But this creates a race condition: there
- * is a possibility that in the time it takes to lock the buffer, the
- * PD_ALL_VISIBLE bit gets set. If that happens, we have to unlock the
- * buffer, pin the visibility map page, and relock the buffer. This shouldn't
- * happen often, because only VACUUM currently sets visibility map bits,
- * and the race will only occur if VACUUM processes a given page at almost
- * exactly the same time that someone tries to further modify it.
- *
- * To set a bit, you need to hold a lock on the heap page. That prevents
- * the race condition where VACUUM sees that all tuples on the page are
- * visible to everyone, but another backend modifies the page before VACUUM
- * sets the bit in the visibility map.
- *
- * When a bit is set, the LSN of the visibility map page is updated to make
- * sure that the visibility map update doesn't get written to disk before the
- * WAL record of the changes that made it possible to set the bit is flushed.
- * But when a bit is cleared, we don't have to do that because it's always
- * safe to clear a bit in the map from correctness point of view.
- *
- *-------------------------------------------------------------------------
- */
-#include "postgres.h"
-
-#include "access/heapam_xlog.h"
-#include "access/visibilitymap.h"
-#include "access/xlog.h"
-#include "miscadmin.h"
-#include "storage/bufmgr.h"
-#include "storage/lmgr.h"
-#include "storage/smgr.h"
-#include "utils/inval.h"
-
-
-/*#define TRACE_VISIBILITYMAP */
-
-/*
- * Size of the bitmap on each visibility map page, in bytes. There's no
- * extra headers, so the whole page minus the standard page header is
- * used for the bitmap.
- */
-#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
-
-/* Number of bits allocated for each heap block. */
-#define BITS_PER_HEAPBLOCK 1
-
-/* Number of heap blocks we can represent in one byte. */
-#define HEAPBLOCKS_PER_BYTE 8
-
-/* Number of heap blocks we can represent in one visibility map page. */
-#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
-
-/* Mapping from heap block number to the right bit in the visibility map */
-#define HEAPBLK_TO_MAPBLOCK(x) ((x) / HEAPBLOCKS_PER_PAGE)
-#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
-#define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
-
-/* table for fast counting of set bits */
-static const uint8 number_of_ones[256] = {
- 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
-};
-
-/* prototypes for internal routines */
-static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend);
-static void vm_extend(Relation rel, BlockNumber nvmblocks);
-
-
-/*
- * visibilitymap_clear - clear a bit in visibility map
- *
- * You must pass a buffer containing the correct map page to this function.
- * Call visibilitymap_pin first to pin the right one. This function doesn't do
- * any I/O.
- */
-void
-visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
-{
- BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
- int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
- int mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- uint8 mask = 1 << mapBit;
- char *map;
-
-#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
-#endif
-
- if (!BufferIsValid(buf) || BufferGetBlockNumber(buf) != mapBlock)
- elog(ERROR, "wrong buffer passed to visibilitymap_clear");
-
- LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
- map = PageGetContents(BufferGetPage(buf));
-
- if (map[mapByte] & mask)
- {
- map[mapByte] &= ~mask;
-
- MarkBufferDirty(buf);
- }
-
- LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-}
-
-/*
- * visibilitymap_pin - pin a map page for setting a bit
- *
- * Setting a bit in the visibility map is a two-phase operation. First, call
- * visibilitymap_pin, to pin the visibility map page containing the bit for
- * the heap page. Because that can require I/O to read the map page, you
- * shouldn't hold a lock on the heap page while doing that. Then, call
- * visibilitymap_set to actually set the bit.
- *
- * On entry, *buf should be InvalidBuffer or a valid buffer returned by
- * an earlier call to visibilitymap_pin or visibilitymap_test on the same
- * relation. On return, *buf is a valid buffer with the map page containing
- * the bit for heapBlk.
- *
- * If the page doesn't exist in the map file yet, it is extended.
- */
-void
-visibilitymap_pin(Relation rel, BlockNumber heapBlk, Buffer *buf)
-{
- BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
-
- /* Reuse the old pinned buffer if possible */
- if (BufferIsValid(*buf))
- {
- if (BufferGetBlockNumber(*buf) == mapBlock)
- return;
-
- ReleaseBuffer(*buf);
- }
- *buf = vm_readbuf(rel, mapBlock, true);
-}
-
-/*
- * visibilitymap_pin_ok - do we already have the correct page pinned?
- *
- * On entry, buf should be InvalidBuffer or a valid buffer returned by
- * an earlier call to visibilitymap_pin or visibilitymap_test on the same
- * relation. The return value indicates whether the buffer covers the
- * given heapBlk.
- */
-bool
-visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
-{
- BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
-
- return BufferIsValid(buf) && BufferGetBlockNumber(buf) == mapBlock;
-}
-
-/*
- * visibilitymap_set - set a bit on a previously pinned page
- *
- * recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
- * or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
- * one provided; in normal running, we generate a new XLOG record and set the
- * page LSN to that value. cutoff_xid is the largest xmin on the page being
- * marked all-visible; it is needed for Hot Standby, and can be
- * InvalidTransactionId if the page contains no tuples.
- *
- * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
- * this function. Except in recovery, caller should also pass the heap
- * buffer. When checksums are enabled and we're not in recovery, we must add
- * the heap buffer to the WAL chain to protect it from being torn.
- *
- * You must pass a buffer containing the correct map page to this function.
- * Call visibilitymap_pin first to pin the right one. This function doesn't do
- * any I/O.
- */
-void
-visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid)
-{
- BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
- uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
- uint8 mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- Page page;
- char *map;
-
-#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
-#endif
-
- Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
- Assert(InRecovery || BufferIsValid(heapBuf));
-
- /* Check that we have the right heap page pinned, if present */
- if (BufferIsValid(heapBuf) && BufferGetBlockNumber(heapBuf) != heapBlk)
- elog(ERROR, "wrong heap buffer passed to visibilitymap_set");
-
- /* Check that we have the right VM page pinned */
- if (!BufferIsValid(vmBuf) || BufferGetBlockNumber(vmBuf) != mapBlock)
- elog(ERROR, "wrong VM buffer passed to visibilitymap_set");
-
- page = BufferGetPage(vmBuf);
- map = PageGetContents(page);
- LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
-
- if (!(map[mapByte] & (1 << mapBit)))
- {
- START_CRIT_SECTION();
-
- map[mapByte] |= (1 << mapBit);
- MarkBufferDirty(vmBuf);
-
- if (RelationNeedsWAL(rel))
- {
- if (XLogRecPtrIsInvalid(recptr))
- {
- Assert(!InRecovery);
- recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
- cutoff_xid);
-
- /*
- * If data checksums are enabled (or wal_log_hints=on), we
- * need to protect the heap page from being torn.
- */
- if (XLogHintBitIsNeeded())
- {
- Page heapPage = BufferGetPage(heapBuf);
-
- /* caller is expected to set PD_ALL_VISIBLE first */
- Assert(PageIsAllVisible(heapPage));
- PageSetLSN(heapPage, recptr);
- }
- }
- PageSetLSN(page, recptr);
- }
-
- END_CRIT_SECTION();
- }
-
- LockBuffer(vmBuf, BUFFER_LOCK_UNLOCK);
-}
-
-/*
- * visibilitymap_test - test if a bit is set
- *
- * Are all tuples on heapBlk visible to all, according to the visibility map?
- *
- * On entry, *buf should be InvalidBuffer or a valid buffer returned by an
- * earlier call to visibilitymap_pin or visibilitymap_test on the same
- * relation. On return, *buf is a valid buffer with the map page containing
- * the bit for heapBlk, or InvalidBuffer. The caller is responsible for
- * releasing *buf after it's done testing and setting bits.
- *
- * NOTE: This function is typically called without a lock on the heap page,
- * so somebody else could change the bit just after we look at it. In fact,
- * since we don't lock the visibility map page either, it's even possible that
- * someone else could have changed the bit just before we look at it, but yet
- * we might see the old value. It is the caller's responsibility to deal with
- * all concurrency issues!
- */
-bool
-visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
-{
- BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
- uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
- uint8 mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- bool result;
- char *map;
-
-#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_test %s %d", RelationGetRelationName(rel), heapBlk);
-#endif
-
- /* Reuse the old pinned buffer if possible */
- if (BufferIsValid(*buf))
- {
- if (BufferGetBlockNumber(*buf) != mapBlock)
- {
- ReleaseBuffer(*buf);
- *buf = InvalidBuffer;
- }
- }
-
- if (!BufferIsValid(*buf))
- {
- *buf = vm_readbuf(rel, mapBlock, false);
- if (!BufferIsValid(*buf))
- return false;
- }
-
- map = PageGetContents(BufferGetPage(*buf));
-
- /*
- * A single-bit read is atomic. There could be memory-ordering effects
- * here, but for performance reasons we make it the caller's job to worry
- * about that.
- */
- result = (map[mapByte] & (1 << mapBit)) ? true : false;
-
- return result;
-}
-
-/*
- * visibilitymap_count - count number of bits set in visibility map
- *
- * Note: we ignore the possibility of race conditions when the table is being
- * extended concurrently with the call. New pages added to the table aren't
- * going to be marked all-visible, so they won't affect the result.
- */
-BlockNumber
-visibilitymap_count(Relation rel)
-{
- BlockNumber result = 0;
- BlockNumber mapBlock;
-
- for (mapBlock = 0;; mapBlock++)
- {
- Buffer mapBuffer;
- unsigned char *map;
- int i;
-
- /*
- * Read till we fall off the end of the map. We assume that any extra
- * bytes in the last page are zeroed, so we don't bother excluding
- * them from the count.
- */
- mapBuffer = vm_readbuf(rel, mapBlock, false);
- if (!BufferIsValid(mapBuffer))
- break;
-
- /*
- * We choose not to lock the page, since the result is going to be
- * immediately stale anyway if anyone is concurrently setting or
- * clearing bits, and we only really need an approximate value.
- */
- map = (unsigned char *) PageGetContents(BufferGetPage(mapBuffer));
-
- for (i = 0; i < MAPSIZE; i++)
- {
- result += number_of_ones[map[i]];
- }
-
- ReleaseBuffer(mapBuffer);
- }
-
- return result;
-}
-
-/*
- * visibilitymap_truncate - truncate the visibility map
- *
- * The caller must hold AccessExclusiveLock on the relation, to ensure that
- * other backends receive the smgr invalidation event that this function sends
- * before they access the VM again.
- *
- * nheapblocks is the new size of the heap.
- */
-void
-visibilitymap_truncate(Relation rel, BlockNumber nheapblocks)
-{
- BlockNumber newnblocks;
-
- /* last remaining block, byte, and bit */
- BlockNumber truncBlock = HEAPBLK_TO_MAPBLOCK(nheapblocks);
- uint32 truncByte = HEAPBLK_TO_MAPBYTE(nheapblocks);
- uint8 truncBit = HEAPBLK_TO_MAPBIT(nheapblocks);
-
-#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_truncate %s %d", RelationGetRelationName(rel), nheapblocks);
-#endif
-
- RelationOpenSmgr(rel);
-
- /*
- * If no visibility map has been created yet for this relation, there's
- * nothing to truncate.
- */
- if (!smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM))
- return;
-
- /*
- * Unless the new size is exactly at a visibility map page boundary, the
- * tail bits in the last remaining map page, representing truncated heap
- * blocks, need to be cleared. This is not only tidy, but also necessary
- * because we don't get a chance to clear the bits if the heap is extended
- * again.
- */
- if (truncByte != 0 || truncBit != 0)
- {
- Buffer mapBuffer;
- Page page;
- char *map;
-
- newnblocks = truncBlock + 1;
-
- mapBuffer = vm_readbuf(rel, truncBlock, false);
- if (!BufferIsValid(mapBuffer))
- {
- /* nothing to do, the file was already smaller */
- return;
- }
-
- page = BufferGetPage(mapBuffer);
- map = PageGetContents(page);
-
- LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
-
- /* Clear out the unwanted bytes. */
- MemSet(&map[truncByte + 1], 0, MAPSIZE - (truncByte + 1));
-
- /*----
- * Mask out the unwanted bits of the last remaining byte.
- *
- * ((1 << 0) - 1) = 00000000
- * ((1 << 1) - 1) = 00000001
- * ...
- * ((1 << 6) - 1) = 00111111
- * ((1 << 7) - 1) = 01111111
- *----
- */
- map[truncByte] &= (1 << truncBit) - 1;
-
- MarkBufferDirty(mapBuffer);
- UnlockReleaseBuffer(mapBuffer);
- }
- else
- newnblocks = truncBlock;
-
- if (smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM) <= newnblocks)
- {
- /* nothing to do, the file was already smaller than requested size */
- return;
- }
-
- /* Truncate the unused VM pages, and send smgr inval message */
- smgrtruncate(rel->rd_smgr, VISIBILITYMAP_FORKNUM, newnblocks);
-
- /*
- * We might as well update the local smgr_vm_nblocks setting. smgrtruncate
- * sent an smgr cache inval message, which will cause other backends to
- * invalidate their copy of smgr_vm_nblocks, and this one too at the next
- * command boundary. But this ensures it isn't outright wrong until then.
- */
- if (rel->rd_smgr)
- rel->rd_smgr->smgr_vm_nblocks = newnblocks;
-}
-
-/*
- * Read a visibility map page.
- *
- * If the page doesn't exist, InvalidBuffer is returned, or if 'extend' is
- * true, the visibility map file is extended.
- */
-static Buffer
-vm_readbuf(Relation rel, BlockNumber blkno, bool extend)
-{
- Buffer buf;
-
- /*
- * We might not have opened the relation at the smgr level yet, or we
- * might have been forced to close it by a sinval message. The code below
- * won't necessarily notice relation extension immediately when extend =
- * false, so we rely on sinval messages to ensure that our ideas about the
- * size of the map aren't too far out of date.
- */
- RelationOpenSmgr(rel);
-
- /*
- * If we haven't cached the size of the visibility map fork yet, check it
- * first.
- */
- if (rel->rd_smgr->smgr_vm_nblocks == InvalidBlockNumber)
- {
- if (smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM))
- rel->rd_smgr->smgr_vm_nblocks = smgrnblocks(rel->rd_smgr,
- VISIBILITYMAP_FORKNUM);
- else
- rel->rd_smgr->smgr_vm_nblocks = 0;
- }
-
- /* Handle requests beyond EOF */
- if (blkno >= rel->rd_smgr->smgr_vm_nblocks)
- {
- if (extend)
- vm_extend(rel, blkno + 1);
- else
- return InvalidBuffer;
- }
-
- /*
- * Use ZERO_ON_ERROR mode, and initialize the page if necessary. It's
- * always safe to clear bits, so it's better to clear corrupt pages than
- * error out.
- */
- buf = ReadBufferExtended(rel, VISIBILITYMAP_FORKNUM, blkno,
- RBM_ZERO_ON_ERROR, NULL);
- if (PageIsNew(BufferGetPage(buf)))
- PageInit(BufferGetPage(buf), BLCKSZ, 0);
- return buf;
-}
-
-/*
- * Ensure that the visibility map fork is at least vm_nblocks long, extending
- * it if necessary with zeroed pages.
- */
-static void
-vm_extend(Relation rel, BlockNumber vm_nblocks)
-{
- BlockNumber vm_nblocks_now;
- Page pg;
-
- pg = (Page) palloc(BLCKSZ);
- PageInit(pg, BLCKSZ, 0);
-
- /*
- * We use the relation extension lock to lock out other backends trying to
- * extend the visibility map at the same time. It also locks out extension
- * of the main fork, unnecessarily, but extending the visibility map
- * happens seldom enough that it doesn't seem worthwhile to have a
- * separate lock tag type for it.
- *
- * Note that another backend might have extended or created the relation
- * by the time we get the lock.
- */
- LockRelationForExtension(rel, ExclusiveLock);
-
- /* Might have to re-open if a cache flush happened */
- RelationOpenSmgr(rel);
-
- /*
- * Create the file first if it doesn't exist. If smgr_vm_nblocks is
- * positive then it must exist, no need for an smgrexists call.
- */
- if ((rel->rd_smgr->smgr_vm_nblocks == 0 ||
- rel->rd_smgr->smgr_vm_nblocks == InvalidBlockNumber) &&
- !smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM))
- smgrcreate(rel->rd_smgr, VISIBILITYMAP_FORKNUM, false);
-
- vm_nblocks_now = smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM);
-
- /* Now extend the file */
- while (vm_nblocks_now < vm_nblocks)
- {
- PageSetChecksumInplace(pg, vm_nblocks_now);
-
- smgrextend(rel->rd_smgr, VISIBILITYMAP_FORKNUM, vm_nblocks_now,
- (char *) pg, false);
- vm_nblocks_now++;
- }
-
- /*
- * Send a shared-inval message to force other backends to close any smgr
- * references they may have for this rel, which we are about to change.
- * This is a useful optimization because it means that backends don't have
- * to keep checking for creation or extension of the file, which happens
- * infrequently.
- */
- CacheInvalidateSmgr(rel->rd_smgr->smgr_rnode);
-
- /* Update local cache with the up-to-date size */
- rel->rd_smgr->smgr_vm_nblocks = vm_nblocks_now;
-
- UnlockRelationForExtension(rel, ExclusiveLock);
-
- pfree(pg);
-}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e59b163..2c30126 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -27,7 +27,7 @@
#include "access/relscan.h"
#include "access/sysattr.h"
#include "access/transam.h"
-#include "access/visibilitymap.h"
+#include "access/pageinfomap.h"
#include "access/xact.h"
#include "bootstrap/bootstrap.h"
#include "catalog/binary_upgrade.h"
@@ -1813,8 +1813,8 @@ FormIndexDatum(IndexInfo *indexInfo,
* isprimary: if true, set relhaspkey true; else no change
* reltuples: if >= 0, set reltuples to this value; else no change
*
- * If reltuples >= 0, relpages and relallvisible are also updated (using
- * RelationGetNumberOfBlocks() and visibilitymap_count()).
+ * If reltuples >= 0, relpages, relallvisible are also updated (using
+ * RelationGetNumberOfBlocks() and pageinfomap_count()).
*
* NOTE: an important side-effect of this operation is that an SI invalidation
* message is sent out to all backends --- including me --- causing relcache
@@ -1921,7 +1921,7 @@ index_update_stats(Relation rel,
BlockNumber relallvisible;
if (rd_rel->relkind != RELKIND_INDEX)
- relallvisible = visibilitymap_count(rel);
+ relallvisible = pageinfomap_count(rel, NULL);
else /* don't bother for indexes */
relallvisible = 0;
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index d4440c9..eaf0796 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,7 +19,7 @@
#include "postgres.h"
-#include "access/visibilitymap.h"
+#include "access/pageinfomap.h"
#include "access/xact.h"
#include "access/xlog.h"
#include "access/xloginsert.h"
@@ -237,17 +237,17 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
*/
rel->rd_smgr->smgr_targblock = InvalidBlockNumber;
rel->rd_smgr->smgr_fsm_nblocks = InvalidBlockNumber;
- rel->rd_smgr->smgr_vm_nblocks = InvalidBlockNumber;
+ rel->rd_smgr->smgr_pim_nblocks = InvalidBlockNumber;
/* Truncate the FSM first if it exists */
fsm = smgrexists(rel->rd_smgr, FSM_FORKNUM);
if (fsm)
FreeSpaceMapTruncateRel(rel, nblocks);
- /* Truncate the visibility map too if it exists. */
- vm = smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM);
+ /* Truncate the page info map too if it exists. */
+ vm = smgrexists(rel->rd_smgr, PAGEINFOMAP_FORKNUM);
if (vm)
- visibilitymap_truncate(rel, nblocks);
+ pageinfomap_truncate(rel, nblocks);
/*
* We WAL-log the truncation before actually truncating, which means
@@ -278,8 +278,8 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
/*
* Flush, because otherwise the truncation of the main relation might
* hit the disk before the WAL record, and the truncation of the FSM
- * or visibility map. If we crashed during that window, we'd be left
- * with a truncated heap, but the FSM or visibility map would still
+ * or page info map. If we crashed during that window, we'd be left
+ * with a truncated heap, but the FSM or page info map would still
* contain entries for the non-existent heap pages.
*/
if (fsm || vm)
@@ -527,13 +527,13 @@ smgr_redo(XLogReaderState *record)
/* Also tell xlogutils.c about it */
XLogTruncateRelation(xlrec->rnode, MAIN_FORKNUM, xlrec->blkno);
- /* Truncate FSM and VM too */
+ /* Truncate FSM and PIM too */
rel = CreateFakeRelcacheEntry(xlrec->rnode);
if (smgrexists(reln, FSM_FORKNUM))
FreeSpaceMapTruncateRel(rel, xlrec->blkno);
- if (smgrexists(reln, VISIBILITYMAP_FORKNUM))
- visibilitymap_truncate(rel, xlrec->blkno);
+ if (smgrexists(reln, PAGEINFOMAP_FORKNUM))
+ pageinfomap_truncate(rel, xlrec->blkno);
FreeFakeRelcacheEntry(rel);
}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index ccc030f..8c555eb 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -444,6 +444,7 @@ CREATE VIEW pg_stat_all_tables AS
pg_stat_get_live_tuples(C.oid) AS n_live_tup,
pg_stat_get_dead_tuples(C.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(C.oid) AS n_mod_since_analyze,
+ pg_stat_get_frozen_pages(C.oid) AS n_frozen_page,
pg_stat_get_last_vacuum_time(C.oid) as last_vacuum,
pg_stat_get_last_autovacuum_time(C.oid) as last_autovacuum,
pg_stat_get_last_analyze_time(C.oid) as last_analyze,
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index ddb68ab..a341297 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -20,7 +20,7 @@
#include "access/transam.h"
#include "access/tupconvert.h"
#include "access/tuptoaster.h"
-#include "access/visibilitymap.h"
+#include "access/pageinfomap.h"
#include "access/xact.h"
#include "catalog/catalog.h"
#include "catalog/index.h"
@@ -314,6 +314,8 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
+ BlockNumber relallvisible,
+ relallfrozen;
if (inh)
ereport(elevel,
@@ -564,6 +566,10 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
}
}
+ /* Calculate the number of all-visible and all-frozen bit */
+ if (!inh)
+ relallvisible = pageinfomap_count(onerel, &relallfrozen);
+
/*
* Update pages/tuples stats in pg_class ... but not if we're doing
* inherited stats.
@@ -572,7 +578,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
vac_update_relstats(onerel,
relpages,
totalrows,
- visibilitymap_count(onerel),
+ relallvisible,
hasindex,
InvalidTransactionId,
InvalidMultiXactId,
@@ -608,7 +614,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
* tracks per-table stats.
*/
if (!inh)
- pgstat_report_analyze(onerel, totalrows, totaldeadrows);
+ pgstat_report_analyze(onerel, totalrows, totaldeadrows, relallfrozen);
/* If this isn't part of VACUUM ANALYZE, let index AMs do cleanup */
if (!(options & VACOPT_VACUUM))
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7c4ef58..0a02a25 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -729,11 +729,11 @@ vac_estimate_reltuples(Relation relation, bool is_analyze,
* marked with xmin = our xid.
*
* In addition to fundamentally nontransactional statistics such as
- * relpages and relallvisible, we try to maintain certain lazily-updated
- * DDL flags such as relhasindex, by clearing them if no longer correct.
- * It's safe to do this in VACUUM, which can't run in parallel with
- * CREATE INDEX/RULE/TRIGGER and can't be part of a transaction block.
- * However, it's *not* safe to do it in an ANALYZE that's within an
+ * relpages, relallvisible, we try to maintain certain lazily-updated
+ * DDL flags such as relhasindex, by clearing them if no onger correct.
+ * It's safe to do this in VACUUM, which can't run in
+ * parallel with CREATE INDEX/RULE/TRIGGER and can't be part of a transaction
+ * block. However, it's *not* safe to do it in an ANALYZE that's within an
* outer transaction, because for example the current transaction might
* have dropped the last index; then we'd think relhasindex should be
* cleared, but if the transaction later rolls back this would be wrong.
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 2429889..2a42928 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -43,7 +43,7 @@
#include "access/htup_details.h"
#include "access/multixact.h"
#include "access/transam.h"
-#include "access/visibilitymap.h"
+#include "access/pageinfomap.h"
#include "access/xlog.h"
#include "catalog/catalog.h"
#include "catalog/storage.h"
@@ -93,7 +93,7 @@
/*
* Before we consider skipping a page that's marked as clean in
- * visibility map, we must've seen at least this many clean pages.
+ * page info map, we must've seen at least this many clean pages.
*/
#define SKIP_PAGES_THRESHOLD ((BlockNumber) 32)
@@ -106,6 +106,8 @@ typedef struct LVRelStats
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* number of pages we examined */
BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber pimskipped_frozen_pages; /* # of pages we skipped by all-frozen bit
+ of page info map */
double scanned_tuples; /* counts only tuples on scanned pages */
double old_rel_tuples; /* previous value of pg_class.reltuples */
double new_rel_tuples; /* new estimated total # of tuples */
@@ -146,7 +148,7 @@ static void lazy_cleanup_index(Relation indrel,
IndexBulkDeleteResult *stats,
LVRelStats *vacrelstats);
static int lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
- int tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer);
+ int tupindex, LVRelStats *vacrelstats, Buffer *pimbuffer);
static void lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats);
static BlockNumber count_nondeletable_pages(Relation onerel,
LVRelStats *vacrelstats);
@@ -156,7 +158,7 @@ static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
static int vac_cmp_itemptr(const void *left, const void *right);
static bool heap_page_is_all_visible(Relation rel, Buffer buf,
- TransactionId *visibility_cutoff_xid);
+ TransactionId *visibility_cutoff_xid, bool *all_frozen);
/*
@@ -188,7 +190,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
MultiXactId mxactFullScanLimit;
BlockNumber new_rel_pages;
double new_rel_tuples;
- BlockNumber new_rel_allvisible;
+ BlockNumber new_rel_allvisible,
+ new_rel_allfrozen;
double new_live_tuples;
TransactionId new_frozen_xid;
MultiXactId new_min_multi;
@@ -221,7 +224,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* We request a full scan if either the table's frozen Xid is now older
* than or equal to the requested Xid full-table scan limit; or if the
* table's minimum MultiXactId is older than or equal to the requested
- * mxid full-table scan limit.
+ * mxid full-table scan limit. During full scan, we could skip some pags
+ * according to all-frozen bit of page info map.
*/
scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
xidFullScanLimit);
@@ -253,7 +257,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
- if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages + vacrelstats->pimskipped_frozen_pages)
+ < vacrelstats->rel_pages)
{
Assert(!scan_all);
scanned_all = false;
@@ -301,10 +306,13 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_tuples = vacrelstats->old_rel_tuples;
}
- new_rel_allvisible = visibilitymap_count(onerel);
+ new_rel_allvisible = pageinfomap_count(onerel, &new_rel_allfrozen);
if (new_rel_allvisible > new_rel_pages)
new_rel_allvisible = new_rel_pages;
+ if (new_rel_allfrozen > new_rel_pages)
+ new_rel_allfrozen = new_rel_pages;
+
new_frozen_xid = scanned_all ? FreezeLimit : InvalidTransactionId;
new_min_multi = scanned_all ? MultiXactCutoff : InvalidMultiXactId;
@@ -325,7 +333,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
pgstat_report_vacuum(RelationGetRelid(onerel),
onerel->rd_rel->relisshared,
new_live_tuples,
- vacrelstats->new_dead_tuples);
+ vacrelstats->new_dead_tuples,
+ new_rel_allfrozen);
/* and log the action if appropriate */
if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
@@ -360,10 +369,11 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel),
vacrelstats->num_index_scans);
- appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins\n"),
+ appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped according to pim\n"),
vacrelstats->pages_removed,
vacrelstats->rel_pages,
- vacrelstats->pinskipped_pages);
+ vacrelstats->pinskipped_pages,
+ vacrelstats->pimskipped_frozen_pages);
appendStringInfo(&buf,
_("tuples: %.0f removed, %.0f remain, %.0f are dead but not yet removable\n"),
vacrelstats->tuples_deleted,
@@ -451,7 +461,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
IndexBulkDeleteResult **indstats;
int i;
PGRUsage ru0;
- Buffer vmbuffer = InvalidBuffer;
+ Buffer pimbuffer = InvalidBuffer;
BlockNumber next_not_all_visible_block;
bool skipping_all_visible_blocks;
xl_heap_freeze_tuple *frozen;
@@ -482,40 +492,43 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
/*
* We want to skip pages that don't require vacuuming according to the
- * visibility map, but only when we can skip at least SKIP_PAGES_THRESHOLD
+ * page info map, but only when we can skip at least SKIP_PAGES_THRESHOLD
* consecutive pages. Since we're reading sequentially, the OS should be
* doing readahead for us, so there's no gain in skipping a page now and
* then; that's likely to disable readahead and so be counterproductive.
- * Also, skipping even a single page means that we can't update
- * relfrozenxid, so we only want to do it if we can skip a goodly number
- * of pages.
+ * Also, skipping even a single page accorinding to all-visible bit of
+ * page info map means that we can't update relfrozenxid, so we only want
+ * to do it if we can skip a goodly number. On the other hand, we count
+ * both how many pages we skipped according to all-frozen bit of page information
+ * map and how many pages we freeze, so we can update relfrozenxid if
+ * the sum of two is as many as pages of table.
*
* Before entering the main loop, establish the invariant that
* next_not_all_visible_block is the next block number >= blkno that's not
- * all-visible according to the visibility map, or nblocks if there's no
+ * all-visible according to the page info map, or nblocks if there's no
* such block. Also, we set up the skipping_all_visible_blocks flag,
* which is needed because we need hysteresis in the decision: once we've
* started skipping blocks, we may as well skip everything up to the next
* not-all-visible block.
*
- * Note: if scan_all is true, we won't actually skip any pages; but we
+ * Note: if scan_all is true, we might not actually skip any pages; but we
* maintain next_not_all_visible_block anyway, so as to set up the
- * all_visible_according_to_vm flag correctly for each page.
+ * all_visible_according_to_pim flag correctly for each page.
*
- * Note: The value returned by visibilitymap_test could be slightly
+ * Note: The value returned by pageinfomap_get_status could be slightly
* out-of-date, since we make this test before reading the corresponding
* heap page or locking the buffer. This is OK. If we mistakenly think
- * that the page is all-visible when in fact the flag's just been cleared,
- * we might fail to vacuum the page. But it's OK to skip pages when
- * scan_all is not set, so no great harm done; the next vacuum will find
- * them. If we make the reverse mistake and vacuum a page unnecessarily,
- * it'll just be a no-op.
+ * that the page is all-visible/all-frozen when in fact the flag's just
+ * been cleared, we might fail to vacuum the page. But it's OK to skip
+ * pages when scan_all is not set, so no great harm done; the next vacuum
+ * will find them. If we make the reverse mistake and vacuum a page
+ * unnecessarily, it'll just be a no-op.
*/
for (next_not_all_visible_block = 0;
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer))
+ if (!PIM_ALL_VISIBLE(onerel, next_not_all_visible_block, &pimbuffer))
break;
vacuum_delay_point();
}
@@ -533,9 +546,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
bool tupgone,
hastup;
int prev_dead_count;
- int nfrozen;
+ int nfrozen; /* # of tuples is frozen */
+ int nalready_frozen; /* # of tuples is already frozen */
+ int ntotal_frozen; /* # of frozen tuples in single page */
+ int ntup_per_page;
Size freespace;
- bool all_visible_according_to_vm;
+ bool all_visible_according_to_pim;
+ bool all_frozen_according_to_pim;
bool all_visible;
bool has_dead_tuples;
TransactionId visibility_cutoff_xid = InvalidTransactionId;
@@ -547,8 +564,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block,
- &vmbuffer))
+ if (!PIM_ALL_VISIBLE(onerel, next_not_all_visible_block, &pimbuffer))
break;
vacuum_delay_point();
}
@@ -562,14 +578,29 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
skipping_all_visible_blocks = true;
else
skipping_all_visible_blocks = false;
- all_visible_according_to_vm = false;
+
+ all_visible_according_to_pim = false;
+ all_frozen_according_to_pim = false;
}
else
{
- /* Current block is all-visible */
- if (skipping_all_visible_blocks && !scan_all)
- continue;
- all_visible_according_to_vm = true;
+ /*
+ * This block is at least all-visible according to the page info map.
+ * We check whether this block is all-frozen or not, to skip to
+ * vacuum this page even if scan_all is true.
+ */
+ bool all_frozen = PIM_ALL_FROZEN(onerel, blkno, &pimbuffer);
+
+ if (scan_all && all_frozen)
+ {
+ vacrelstats->pimskipped_frozen_pages++;
+ continue;
+ }
+ else if (!scan_all && skipping_all_visible_blocks)
+ continue;
+
+ all_visible_according_to_pim = true;
+ all_frozen_according_to_pim = all_frozen;
}
vacuum_delay_point();
@@ -583,14 +614,14 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
{
/*
* Before beginning index vacuuming, we release any pin we may
- * hold on the visibility map page. This isn't necessary for
+ * hold on the page info map page. This isn't necessary for
* correctness, but we do it anyway to avoid holding the pin
* across a lengthy, unrelated operation.
*/
- if (BufferIsValid(vmbuffer))
+ if (BufferIsValid(pimbuffer))
{
- ReleaseBuffer(vmbuffer);
- vmbuffer = InvalidBuffer;
+ ReleaseBuffer(pimbuffer);
+ pimbuffer = InvalidBuffer;
}
/* Log cleanup info before we touch indexes */
@@ -614,14 +645,14 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
}
/*
- * Pin the visibility map page in case we need to mark the page
+ * Pin the page info map page in case we need to mark the page
* all-visible. In most cases this will be very cheap, because we'll
* already have the correct page pinned anyway. However, it's
* possible that (a) next_not_all_visible_block is covered by a
- * different VM page than the current block or (b) we released our pin
+ * different PIM page than the current block or (b) we released our pin
* and did a cycle of index vacuuming.
*/
- visibilitymap_pin(onerel, blkno, &vmbuffer);
+ pageinfomap_pin(onerel, blkno, &pimbuffer);
buf = ReadBufferExtended(onerel, MAIN_FORKNUM, blkno,
RBM_NORMAL, vac_strategy);
@@ -716,7 +747,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
empty_pages++;
freespace = PageGetHeapFreeSpace(page);
- /* empty pages are always all-visible */
+ /* empty pages are always all-visible and all-frozen */
if (!PageIsAllVisible(page))
{
START_CRIT_SECTION();
@@ -739,8 +770,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
log_newpage_buffer(buf, true);
PageSetAllVisible(page);
- visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, InvalidTransactionId);
+ PageSetAllFrozen(page);
+ pageinfomap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+ pimbuffer, InvalidTransactionId,
+ PAGEINFOMAP_ALL_VISIBLE | PAGEINFOMAP_ALL_FROZEN);
END_CRIT_SECTION();
}
@@ -764,6 +797,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
all_visible = true;
has_dead_tuples = false;
nfrozen = 0;
+ nalready_frozen = 0;
+ ntup_per_page = 0;
hastup = false;
prev_dead_count = vacrelstats->num_dead_tuples;
maxoff = PageGetMaxOffsetNumber(page);
@@ -918,8 +953,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
{
num_tuples += 1;
+ ntup_per_page += 1;;
hastup = true;
+ /* Check whether this tuple is alrady frozen or not */
+ if (HeapTupleHeaderXminFrozen(tuple.t_data))
+ nalready_frozen += 1;
+
/*
* Each non-removable tuple must be checked to see if it needs
* freezing. Note we already have exclusive buffer lock.
@@ -931,9 +971,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
} /* scan along page */
/*
- * If we froze any tuples, mark the buffer dirty, and write a WAL
- * record recording the changes. We must log the changes to be
- * crash-safe against future truncation of CLOG.
+ * If we freeze any tuples, mark the buffer dirty, and write a WAL
+ * record recording the changes. We must log the changes to be crash-safe
+ * against future truncation of CLOG.
*/
if (nfrozen > 0)
{
@@ -966,6 +1006,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
END_CRIT_SECTION();
}
+ /* Compute the number of frozen tuples in a page */
+ ntotal_frozen = nfrozen + nalready_frozen;
+
/*
* If there are no indexes then we can vacuum the page right now
* instead of doing a second scan.
@@ -974,7 +1017,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
vacrelstats->num_dead_tuples > 0)
{
/* Remove tuples from heap */
- lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats, &vmbuffer);
+ lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats, &pimbuffer);
has_dead_tuples = false;
/*
@@ -988,63 +1031,94 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
freespace = PageGetHeapFreeSpace(page);
- /* mark page all-visible, if appropriate */
- if (all_visible && !all_visible_according_to_vm)
+ /* This page is all visible */
+ if (all_visible)
{
- /*
- * It should never be the case that the visibility map page is set
- * while the page-level bit is clear, but the reverse is allowed
- * (if checksums are not enabled). Regardless, set the both bits
- * so that we get back in sync.
- *
- * NB: If the heap page is all-visible but the VM bit is not set,
- * we don't need to dirty the heap page. However, if checksums
- * are enabled, we do need to make sure that the heap page is
- * dirtied before passing it to visibilitymap_set(), because it
- * may be logged. Given that this situation should only happen in
- * rare cases after a crash, it is not worth optimizing.
- */
- PageSetAllVisible(page);
- MarkBufferDirty(buf);
- visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ /* mark page all-visible, if appropriate */
+ if (!all_visible_according_to_pim)
+ {
+ /*
+ * It should never be the case that the page info map page is set
+ * while the page-level bit is clear, but the reverse is allowed
+ * (if checksums are not enabled). Regardless, set the both bits
+ * so that we get back in sync.
+ *
+ * NB: If the heap page is all-visible but the PIM bit is not set,
+ * we don't need to dirty the heap page. However, if checksums
+ * are enabled, we do need to make sure that the heap page is
+ * dirtied before passing it to pageinfomap_set(), because it
+ * may be logged. Given that this situation should only happen in
+ * rare cases after a crash, it is not worth optimizing.
+ */
+ PageSetAllVisible(page);
+ flags |= PAGEINFOMAP_ALL_VISIBLE;
+ }
+
+ /* mark page all-frozen, if all tuples are frozen and not marked yet */
+ if ((ntotal_frozen == ntup_per_page) && !all_frozen_according_to_pim)
+ {
+ Assert(PageIsAllVisible(page));
+
+ PageSetAllFrozen(page);
+ flags |= PAGEINFOMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ {
+ MarkBufferDirty(buf);
+ pageinfomap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+ pimbuffer, visibility_cutoff_xid, flags);
+ }
}
/*
- * As of PostgreSQL 9.2, the visibility map bit should never be set if
+ * As of PostgreSQL 9.2, the page info map bit should never be set if
* the page-level bit is clear. However, it's possible that the bit
* got cleared after we checked it and before we took the buffer
* content lock, so we must recheck before jumping to the conclusion
* that something bad has happened.
*/
- else if (all_visible_according_to_vm && !PageIsAllVisible(page)
- && visibilitymap_test(onerel, blkno, &vmbuffer))
+ else if (all_visible_according_to_pim && !PageIsAllVisible(page)
+ && PIM_ALL_VISIBLE(onerel, blkno, &pimbuffer))
{
- elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
+ /* If the all-frozen is set then all-visible must be set */
+ if (all_frozen_according_to_pim)
+ Assert(PIM_ALL_FROZEN(onerel, blkno, &pimbuffer) &&
+ PIM_ALL_VISIBLE(onerel, blkno, &pimbuffer));
+
+ elog(WARNING, "page is not marked all-visible (and all-frozen) but page info map bit(s) is set in relation \"%s\" page %u",
relname, blkno);
- visibilitymap_clear(onerel, blkno, vmbuffer);
+ pageinfomap_clear(onerel, blkno, pimbuffer);
}
/*
* It's possible for the value returned by GetOldestXmin() to move
* backwards, so it's not wrong for us to see tuples that appear to
- * not be visible to everyone yet, while PD_ALL_VISIBLE is already
- * set. The real safe xmin value never moves backwards, but
- * GetOldestXmin() is conservative and sometimes returns a value
+ * not be visible to everyone yet, while PD_ALL_VISIBLE (and PD_ALL_FROZEN)
+ * are already set. The real safe xmin value never moves backwards,
+ * but GetOldestXmin() is conservative and sometimes returns a value
* that's unnecessarily small, so if we see that contradiction it just
* means that the tuples that we think are not visible to everyone yet
- * actually are, and the PD_ALL_VISIBLE flag is correct.
+ * actually are, and the PD_ALL_VISIBLE (and PD_ALL_FROZEN) flags are
+ * correct.
*
* There should never be dead tuples on a page with PD_ALL_VISIBLE
* set, however.
*/
else if (PageIsAllVisible(page) && has_dead_tuples)
{
- elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
+ /* If the all-frozen is set then all-visible must be set */
+ if (PageIsAllFrozen(page))
+ Assert(PIM_ALL_FROZEN(onerel, blkno, &pimbuffer) &&
+ PIM_ALL_VISIBLE(onerel, blkno, &pimbuffer));
+
+ elog(WARNING, "page containing dead tuples is marked as all-visible (and all-frozen) in relation \"%s\" page %u",
relname, blkno);
PageClearAllVisible(page);
MarkBufferDirty(buf);
- visibilitymap_clear(onerel, blkno, vmbuffer);
+ pageinfomap_clear(onerel, blkno, pimbuffer);
}
UnlockReleaseBuffer(buf);
@@ -1078,12 +1152,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
num_tuples);
/*
- * Release any remaining pin on visibility map page.
+ * Release any remaining pin on page info map page.
*/
- if (BufferIsValid(vmbuffer))
+ if (BufferIsValid(pimbuffer))
{
- ReleaseBuffer(vmbuffer);
- vmbuffer = InvalidBuffer;
+ ReleaseBuffer(pimbuffer);
+ pimbuffer = InvalidBuffer;
}
/* If any tuples need to be deleted, perform final vacuum cycle */
@@ -1114,6 +1188,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
RelationGetRelationName(onerel),
tups_vacuumed, vacuumed_pages)));
+ /* Report how many frozen pages vacuum skipped according to page info map */
+ ereport(elevel,
+ (errmsg_plural("skipped %d frozen page according to page info map",
+ "skipped %d frozen pages according to page info map",
+ vacrelstats->pimskipped_frozen_pages,
+ vacrelstats->pimskipped_frozen_pages)));
+
/*
* This is pretty messy, but we split it up so that we can skip emitting
* individual parts of the message when not applicable.
@@ -1162,7 +1243,7 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
int tupindex;
int npages;
PGRUsage ru0;
- Buffer vmbuffer = InvalidBuffer;
+ Buffer pimbuffer = InvalidBuffer;
pg_rusage_init(&ru0);
npages = 0;
@@ -1187,7 +1268,7 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
continue;
}
tupindex = lazy_vacuum_page(onerel, tblk, buf, tupindex, vacrelstats,
- &vmbuffer);
+ &pimbuffer);
/* Now that we've compacted the page, record its available space */
page = BufferGetPage(buf);
@@ -1198,10 +1279,10 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
npages++;
}
- if (BufferIsValid(vmbuffer))
+ if (BufferIsValid(pimbuffer))
{
- ReleaseBuffer(vmbuffer);
- vmbuffer = InvalidBuffer;
+ ReleaseBuffer(pimbuffer);
+ pimbuffer = InvalidBuffer;
}
ereport(elevel,
@@ -1224,12 +1305,13 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
*/
static int
lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
- int tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer)
+ int tupindex, LVRelStats *vacrelstats, Buffer *pimbuffer)
{
Page page = BufferGetPage(buffer);
OffsetNumber unused[MaxOffsetNumber];
int uncnt = 0;
TransactionId visibility_cutoff_xid;
+ bool all_frozen;
START_CRIT_SECTION();
@@ -1270,7 +1352,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
/*
* End critical section, so we safely can do visibility tests (which
* possibly need to perform IO and allocate memory!). If we crash now the
- * page (including the corresponding vm bit) might not be marked all
+ * page (including the corresponding pim bit) might not be marked all
* visible, but that's fine. A later vacuum will fix that.
*/
END_CRIT_SECTION();
@@ -1281,19 +1363,34 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
* dirty, exclusively locked, and, if needed, a full page image has been
* emitted in the log_heap_clean() above.
*/
- if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid))
+ if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid, &all_frozen))
PageSetAllVisible(page);
/*
* All the changes to the heap page have been done. If the all-visible
- * flag is now set, also set the VM bit.
+ * flag is now set, also set the PIM all-visible bit.
+ * Also, if this page is all-frozen, set the PIM all-frozen bit and flag.
*/
- if (PageIsAllVisible(page) &&
- !visibilitymap_test(onerel, blkno, vmbuffer))
+ if (PageIsAllVisible(page))
{
- Assert(BufferIsValid(*vmbuffer));
- visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
- visibility_cutoff_xid);
+ uint8 pim_status = pageinfomap_get_status(onerel, blkno, pimbuffer);
+ uint8 flags = 0;
+
+ if (!(pim_status & PAGEINFOMAP_ALL_VISIBLE))
+ flags |= PAGEINFOMAP_ALL_VISIBLE;
+
+ /* Set the PIM all-frozen bit to flag, if needed */
+ if (all_frozen && !(pim_status & PAGEINFOMAP_ALL_FROZEN))
+ {
+ PageSetAllFrozen(page);
+ flags |= PAGEINFOMAP_ALL_FROZEN;
+ }
+
+ Assert(BufferIsValid(*pimbuffer));
+
+ if (pim_status != flags)
+ pageinfomap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *pimbuffer,
+ visibility_cutoff_xid, flags);
}
return tupindex;
@@ -1783,10 +1880,12 @@ vac_cmp_itemptr(const void *left, const void *right)
/*
* Check if every tuple in the given page is visible to all current and future
* transactions. Also return the visibility_cutoff_xid which is the highest
- * xmin amongst the visible tuples.
+ * xmin amongst the visible tuples, and all_frozen which implies that all tuples
+ * of this page are frozen.
*/
static bool
-heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid)
+heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid,
+ bool *all_frozen)
{
Page page = BufferGetPage(buf);
BlockNumber blockno = BufferGetBlockNumber(buf);
@@ -1795,6 +1894,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
bool all_visible = true;
*visibility_cutoff_xid = InvalidTransactionId;
+ *all_frozen = true;
/*
* This is a stripped down version of the line pointer scan in
@@ -1818,11 +1918,12 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/*
* Dead line pointers can have index pointers pointing to them. So
- * they can't be treated as visible
+ * they can't be treated as visible and frozen.
*/
if (ItemIdIsDead(itemid))
{
all_visible = false;
+ *all_frozen = false;
break;
}
@@ -1859,6 +1960,10 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
+
+ /* Check whether this tuple is alrady frozen or not */
+ if (!HeapTupleHeaderXminFrozen(tuple.t_data))
+ *all_frozen = false;
}
break;
@@ -1867,6 +1972,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
case HEAPTUPLE_INSERT_IN_PROGRESS:
case HEAPTUPLE_DELETE_IN_PROGRESS:
all_visible = false;
+ *all_frozen = false;
break;
default:
@@ -1875,5 +1981,8 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
}
} /* scan along page */
+ if (!all_visible)
+ *all_frozen = false;
+
return all_visible;
}
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 9f54c46..f4cd9c6 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -25,7 +25,7 @@
#include "postgres.h"
#include "access/relscan.h"
-#include "access/visibilitymap.h"
+#include "access/pageinfomap.h"
#include "executor/execdebug.h"
#include "executor/nodeIndexonlyscan.h"
#include "executor/nodeIndexscan.h"
@@ -85,38 +85,37 @@ IndexOnlyNext(IndexOnlyScanState *node)
* which all tuples are known visible to everybody. In any case,
* we'll use the index tuple not the heap tuple as the data source.
*
- * Note on Memory Ordering Effects: visibilitymap_test does not lock
- * the visibility map buffer, and therefore the result we read here
+ * Note on Memory Ordering Effects: pageinfomap_get_stattus does not lock
+ * the page info map buffer, and therefore the result we read here
* could be slightly stale. However, it can't be stale enough to
* matter.
*
- * We need to detect clearing a VM bit due to an insert right away,
+ * We need to detect clearing a PIM bit due to an insert right away,
* because the tuple is present in the index page but not visible. The
* reading of the TID by this scan (using a shared lock on the index
* buffer) is serialized with the insert of the TID into the index
- * (using an exclusive lock on the index buffer). Because the VM bit
+ * (using an exclusive lock on the index buffer). Because the PIM bit
* is cleared before updating the index, and locking/unlocking of the
* index page acts as a full memory barrier, we are sure to see the
* cleared bit if we see a recently-inserted TID.
*
* Deletes do not update the index page (only VACUUM will clear out
- * the TID), so the clearing of the VM bit by a delete is not
+ * the TID), so the clearing of the PIM bit by a delete is not
* serialized with this test below, and we may see a value that is
* significantly stale. However, we don't care about the delete right
* away, because the tuple is still visible until the deleting
* transaction commits or the statement ends (if it's our
- * transaction). In either case, the lock on the VM buffer will have
+ * transaction). In either case, the lock on the PIM buffer will have
* been released (acting as a write barrier) after clearing the bit.
* And for us to have a snapshot that includes the deleting
* transaction (making the tuple invisible), we must have acquired
* ProcArrayLock after that time, acting as a read barrier.
*
* It's worth going through this complexity to avoid needing to lock
- * the VM buffer, which could cause significant contention.
+ * the PIM buffer, which could cause significant contention.
*/
- if (!visibilitymap_test(scandesc->heapRelation,
- ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
+ if (!PIM_ALL_VISIBLE(scandesc->heapRelation, ItemPointerGetBlockNumber(tid),
+ &node->ioss_PIMBuffer))
{
/*
* Rats, we have to visit the heap to check visibility.
@@ -322,11 +321,11 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
indexScanDesc = node->ioss_ScanDesc;
relation = node->ss.ss_currentRelation;
- /* Release VM buffer pin, if any. */
- if (node->ioss_VMBuffer != InvalidBuffer)
+ /* Release PIM buffer pin, if any. */
+ if (node->ioss_PIMBuffer != InvalidBuffer)
{
- ReleaseBuffer(node->ioss_VMBuffer);
- node->ioss_VMBuffer = InvalidBuffer;
+ ReleaseBuffer(node->ioss_PIMBuffer);
+ node->ioss_PIMBuffer = InvalidBuffer;
}
/*
@@ -546,7 +545,7 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
/* Set it up for index-only scan */
indexstate->ioss_ScanDesc->xs_want_itup = true;
- indexstate->ioss_VMBuffer = InvalidBuffer;
+ indexstate->ioss_PIMBuffer = InvalidBuffer;
/*
* If no run-time keys to calculate, go ahead and pass the scankeys to the
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 990486c..d27a35b 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -468,7 +468,7 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count)
* estimates based on the correlation squared (XXX is that appropriate?).
*
* If it's an index-only scan, then we will not need to fetch any heap
- * pages for which the visibility map shows all tuples are visible.
+ * pages for which the page info map shows all tuples are visible.
* Hence, reduce the estimated number of heap fetches accordingly.
* We use the measured fraction of the entire heap that is all-visible,
* which might not be particularly relevant to the subset of the heap
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 9442e5f..7a1565a 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -780,7 +780,7 @@ infer_collation_opclass_match(InferenceElem *elem, Relation idxRel,
* estimate_rel_size - estimate # pages and # tuples in a table or index
*
* We also estimate the fraction of the pages that are marked all-visible in
- * the visibility map, for use in estimation of index-only scans.
+ * the page info map, for use in estimation of index-only scans.
*
* If attr_widths isn't NULL, it points to the zero-index entry of the
* relation's attr_widths[] cache; we fill this in if we have need to compute
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index ab018c4..ca7257a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1329,7 +1329,8 @@ pgstat_report_autovac(Oid dboid)
*/
void
pgstat_report_vacuum(Oid tableoid, bool shared,
- PgStat_Counter livetuples, PgStat_Counter deadtuples)
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages)
{
PgStat_MsgVacuum msg;
@@ -1343,6 +1344,7 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
msg.m_vacuumtime = GetCurrentTimestamp();
msg.m_live_tuples = livetuples;
msg.m_dead_tuples = deadtuples;
+ msg.m_frozen_pages = frozenpages;
pgstat_send(&msg, sizeof(msg));
}
@@ -1354,7 +1356,8 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
*/
void
pgstat_report_analyze(Relation rel,
- PgStat_Counter livetuples, PgStat_Counter deadtuples)
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages)
{
PgStat_MsgAnalyze msg;
@@ -1394,6 +1397,7 @@ pgstat_report_analyze(Relation rel,
msg.m_analyzetime = GetCurrentTimestamp();
msg.m_live_tuples = livetuples;
msg.m_dead_tuples = deadtuples;
+ msg.m_frozen_pages = frozenpages;
pgstat_send(&msg, sizeof(msg));
}
@@ -3702,6 +3706,7 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
result->n_live_tuples = 0;
result->n_dead_tuples = 0;
result->changes_since_analyze = 0;
+ result->n_frozen_pages = 0;
result->blocks_fetched = 0;
result->blocks_hit = 0;
result->vacuum_timestamp = 0;
@@ -5069,6 +5074,7 @@ pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
tabentry->n_live_tuples = msg->m_live_tuples;
tabentry->n_dead_tuples = msg->m_dead_tuples;
+ tabentry->n_frozen_pages = msg->m_frozen_pages;
if (msg->m_autovacuum)
{
@@ -5103,6 +5109,7 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
tabentry->n_live_tuples = msg->m_live_tuples;
tabentry->n_dead_tuples = msg->m_dead_tuples;
+ tabentry->n_frozen_pages = msg->m_frozen_pages;
/*
* We reset changes_since_analyze to zero, forgetting any changes that
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 244b4ea..9e5bd46 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -167,7 +167,7 @@ smgropen(RelFileNode rnode, BackendId backend)
reln->smgr_owner = NULL;
reln->smgr_targblock = InvalidBlockNumber;
reln->smgr_fsm_nblocks = InvalidBlockNumber;
- reln->smgr_vm_nblocks = InvalidBlockNumber;
+ reln->smgr_pim_nblocks = InvalidBlockNumber;
reln->smgr_which = 0; /* we only have md.c at present */
/* mark it not open */
diff --git a/src/backend/utils/adt/dbsize.c b/src/backend/utils/adt/dbsize.c
index 5ee59d0..c2ac902 100644
--- a/src/backend/utils/adt/dbsize.c
+++ b/src/backend/utils/adt/dbsize.c
@@ -348,12 +348,12 @@ calculate_toast_table_size(Oid toastrelid)
toastRel = relation_open(toastrelid, AccessShareLock);
- /* toast heap size, including FSM and VM size */
+ /* toast heap size, including FSM and PIM size */
for (forkNum = 0; forkNum <= MAX_FORKNUM; forkNum++)
size += calculate_relation_size(&(toastRel->rd_node),
toastRel->rd_backend, forkNum);
- /* toast index size, including FSM and VM size */
+ /* toast index size, including FSM and PIM size */
indexlist = RelationGetIndexList(toastRel);
/* Size is calculated using all the indexes available */
@@ -377,7 +377,7 @@ calculate_toast_table_size(Oid toastrelid)
/*
* Calculate total on-disk size of a given table,
- * including FSM and VM, plus TOAST table if any.
+ * including FSM and PIM, plus TOAST table if any.
* Indexes other than the TOAST table's index are not included.
*
* Note that this also behaves sanely if applied to an index or toast table;
@@ -390,7 +390,7 @@ calculate_table_size(Relation rel)
ForkNumber forkNum;
/*
- * heap size, including FSM and VM
+ * heap size, including FSM and PIM
*/
for (forkNum = 0; forkNum <= MAX_FORKNUM; forkNum++)
size += calculate_relation_size(&(rel->rd_node), rel->rd_backend,
@@ -485,7 +485,7 @@ pg_indexes_size(PG_FUNCTION_ARGS)
/*
* Compute the on-disk size of all files for the relation,
- * including heap data, index data, toast data, FSM, VM.
+ * including heap data, index data, toast data, FSM, PIM.
*/
static int64
calculate_total_relation_size(Relation rel)
@@ -494,7 +494,7 @@ calculate_total_relation_size(Relation rel)
/*
* Aggregate the table size, this includes size of the heap, toast and
- * toast index with free space and visibility map
+ * toast index with free space and page info map
*/
size = calculate_table_size(rel);
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index f7c9bf6..98c14f9 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -46,6 +46,7 @@ extern Datum pg_stat_get_vacuum_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_autovacuum_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_analyze_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_autoanalyze_count(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_frozen_pages(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_function_calls(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_function_total_time(PG_FUNCTION_ARGS);
@@ -450,6 +451,21 @@ pg_stat_get_autoanalyze_count(PG_FUNCTION_ARGS)
}
Datum
+pg_stat_get_frozen_pages(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int32 result;
+ PgStat_StatTabEntry *tabentry;
+
+ if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+ result = 0;
+ else
+ result = (int32) (tabentry->n_frozen_pages);
+
+ PG_RETURN_INT32(result);
+}
+
+Datum
pg_stat_get_function_calls(PG_FUNCTION_ARGS)
{
Oid funcid = PG_GETARG_OID(0);
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 41d4606..2b06013 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -231,6 +231,15 @@ check_cluster_versions(void)
if (old_cluster.major_version > new_cluster.major_version)
pg_fatal("This utility cannot be used to downgrade to older major PostgreSQL versions.\n");
+ /*
+ * We can't allow upgrading with link mode between 9.5 or before and 9.6 or later,
+ * because the visibility map has been changed to the page info map in version 9.6.
+ */
+ if (user_opts.transfer_mode == TRANSFER_MODE_LINK &&
+ GET_MAJOR_VERSION(old_cluster.major_version) < 906 &&
+ GET_MAJOR_VERSION(new_cluster.major_version) >= 906)
+ pg_fatal("This utility cannot upgrade from PostgreSQL version from 9.5 or before to 9.6 or later with link mode.\n");
+
/* get old and new binary versions */
get_bin_version(&old_cluster);
get_bin_version(&new_cluster);
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 79d9390..109b677 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -10,6 +10,7 @@
#include "postgres_fe.h"
#include "pg_upgrade.h"
+#include "storage/bufpage.h"
#include <fcntl.h>
@@ -21,6 +22,43 @@ static int copy_file(const char *fromfile, const char *tofile, bool force);
static int win32_pghardlink(const char *src, const char *dst);
#endif
+static const char *rewriteVisibilitymap(const char *fromfile, const char *tofile,
+ bool force);
+
+/* table for fast rewriting vm file in order to add all-frozen information */
+static const uint16 rewrite_vm_table[256] = {
+ 0, 1, 4, 5, 16, 17, 20, 21, 64, 65, 68, 69, 80, 81, 84, 85,
+ 256, 257, 260, 261, 272, 273, 276, 277, 320, 321, 324, 325, 336, 337, 340, 341,
+ 1024, 1025, 1028, 1029, 1040, 1041, 1044, 1045, 1088, 1089, 1092, 1093, 1104, 1105, 1108, 1109,
+ 1280, 1281, 1284, 1285, 1296, 1297, 1300, 1301, 1344, 1345, 1348, 1349, 1360, 1361, 1364, 1365,
+ 4096, 4097, 4100, 4101, 4112, 4113, 4116, 4117, 4160, 4161, 4164, 4165, 4176, 4177, 4180, 4181,
+ 4352, 4353, 4356, 4357, 4368, 4369, 4372, 4373, 4416, 4417, 4420, 4421, 4432, 4433, 4436, 4437,
+ 5120, 5121, 5124, 5125, 5136, 5137, 5140, 5141, 5184, 5185, 5188, 5189, 5200, 5201, 5204, 5205,
+ 5376, 5377, 5380, 5381, 5392, 5393, 5396, 5397, 5440, 5441, 5444, 5445, 5456, 5457, 5460, 5461,
+ 16384, 16385, 16388, 16389, 16400, 16401, 16404, 16405, 16448, 16449, 16452, 16453, 16464, 16465, 16468, 16469,
+ 16640, 16641, 16644, 16645, 16656, 16657, 16660, 16661, 16704, 16705, 16708, 16709, 16720, 16721, 16724, 16725,
+ 17408, 17409, 17412, 17413, 17424, 17425, 17428, 17429, 17472, 17473, 17476, 17477, 17488, 17489, 17492, 17493,
+ 17664, 17665, 17668, 17669, 17680, 17681, 17684, 17685, 17728, 17729, 17732, 17733, 17744, 17745, 17748, 17749,
+ 20480, 20481, 20484, 20485, 20496, 20497, 20500, 20501, 20544, 20545, 20548, 20549, 20560, 20561, 20564, 20565,
+ 20736, 20737, 20740, 20741, 20752, 20753, 20756, 20757, 20800, 20801, 20804, 20805, 20816, 20817, 20820, 20821,
+ 21504, 21505, 21508, 21509, 21520, 21521, 21524, 21525, 21568, 21569, 21572, 21573, 21584, 21585, 21588, 21589,
+ 21760, 21761, 21764, 21765, 21776, 21777, 21780, 21781, 21824, 21825, 21828, 21829, 21840, 21841, 21844, 21845
+};
+
+/*
+ * copyOrRewriteFile()
+ * This function copies file or rewrite visibility map file to page info map file.
+ * If rewrite_vm is true, we have to rewrite visibility map regardless value of pageConverter.
+ */
+const char *
+copyOrRewriteFile(pageCnvCtx *pageConverter,
+ const char *src, const char *dst, bool force, bool rewrite_vm)
+{
+ if (rewrite_vm)
+ return rewriteVisibilitymap(src, dst, force);
+ else
+ return copyAndUpdateFile(pageConverter, src, dst, force);
+}
/*
* copyAndUpdateFile()
@@ -201,6 +239,96 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
#endif
+/*
+ * rewriteVisibilitymap()
+ *
+ * Since a additional bit which indicates that all tuples on page is completely
+ * frozen is added into visibilitymap, the visibility map become the page info map.
+ * Rewrite a visibility map file while adding all-frozen bit(0) into each bit.
+ */
+static const char *
+rewriteVisibilitymap(const char *fromfile, const char *tofile, bool force)
+{
+#define REWRITE_BUF_SIZE (50 * BLCKSZ)
+#define BITS_PER_HEAPBLOCK 2
+
+ int src_fd = 0;
+ int dst_fd = 0;
+ uint16 vm_bits;
+ ssize_t nbytes;
+ char *buffer = NULL;
+ int ret = 0;
+
+ if ((fromfile == NULL) || (tofile == NULL))
+ return getErrorText(EINVAL);
+
+ if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
+ goto err;
+
+ if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+ goto err;
+
+ buffer = (char *) pg_malloc(REWRITE_BUF_SIZE);
+
+ /* Copy page header data in advance */
+ if ((nbytes = read(src_fd, buffer, MAXALIGN(SizeOfPageHeaderData))) <= 0)
+ goto err;
+
+ if (write(dst_fd, buffer, nbytes) != nbytes)
+ {
+ /* if write didn't set errno, assume problem is no disk space */
+ if (errno == 0)
+ errno = ENOSPC;
+ goto err;
+ }
+
+ /* perform data rewriting i.e read src srouce, write to destination */
+ while (true)
+ {
+ ssize_t nbytes = read(src_fd, buffer, REWRITE_BUF_SIZE);
+ char *cur, *end;
+
+ if (nbytes < 0)
+ {
+ ret = -1;
+ break;
+ }
+
+ if (nbytes == 0)
+ break;
+
+ cur = buffer;
+ end = buffer + nbytes;
+
+ /* Rewrite a byte and write dest_fd per BITS_PER_HEAPBLOCK bytes */
+ while (end > cur)
+ {
+ /* Get rewritten bit from table and its string representation */
+ vm_bits = rewrite_vm_table[(uint8) *cur];
+
+ if (write(dst_fd, &vm_bits, BITS_PER_HEAPBLOCK) != BITS_PER_HEAPBLOCK)
+ {
+ ret = -1;
+ break;
+ }
+ cur++;
+ }
+ }
+
+err:
+
+ if (!buffer)
+ pg_free(buffer);
+
+ if (src_fd != 0)
+ close(src_fd);
+
+ if (dst_fd != 0)
+ close(dst_fd);
+
+ return (errno == 0) ? NULL : getErrorText(errno);
+}
+
void
check_hard_link(void)
{
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 13aa891..95c6df1 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -112,6 +112,10 @@ extern char *output_files[];
#define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031
/*
+ * The visibility map changed to the page info map with this 9.6 commit,
+ */
+#define VISIBILITY_MAP_CHANGE_TO_PAGEINFOMAP_CAT_VER 201511131
+/*
* pg_multixact format changed in 9.3 commit 0ac5ad5134f2769ccbaefec73844f85,
* ("Improve concurrency of foreign key locking") which also updated catalog
* version to this value. pg_upgrade behavior depends on whether old and new
@@ -396,6 +400,8 @@ const pageCnvCtx *setupPageConverter(void);
typedef void *pageCnvCtx;
#endif
+const char *copyOrRewriteFile(pageCnvCtx *pageConverter, const char *src,
+ const char *dst, bool force, bool rewrite_vm);
const char *copyAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
const char *dst, bool force);
const char *linkAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index c22df42..41d80ef 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -18,7 +18,7 @@
static void transfer_single_new_db(pageCnvCtx *pageConverter,
FileNameMap *maps, int size, char *old_tablespace);
static void transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *suffix);
+ const char *old_type_suffix, const char *new_type_suffix);
/*
@@ -171,6 +171,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
{
int mapnum;
bool vm_crashsafe_match = true;
+ bool vm_need_rewrite = false;
/*
* Do the old and new cluster disagree on the crash-safetiness of the vm
@@ -180,13 +181,20 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
vm_crashsafe_match = false;
+ /*
+ * Do we need to rewrite visibilitymap to pageinfomap?
+ */
+ if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_CHANGE_TO_PAGEINFOMAP_CAT_VER &&
+ new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CHANGE_TO_PAGEINFOMAP_CAT_VER)
+ vm_need_rewrite = true;
+
for (mapnum = 0; mapnum < size; mapnum++)
{
if (old_tablespace == NULL ||
strcmp(maps[mapnum].old_tablespace, old_tablespace) == 0)
{
/* transfer primary file */
- transfer_relfile(pageConverter, &maps[mapnum], "");
+ transfer_relfile(pageConverter, &maps[mapnum], "", "");
/* fsm/vm files added in PG 8.4 */
if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)
@@ -194,9 +202,14 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
/*
* Copy/link any fsm and vm files, if they exist
*/
- transfer_relfile(pageConverter, &maps[mapnum], "_fsm");
+ transfer_relfile(pageConverter, &maps[mapnum], "_fsm", "_fsm");
if (vm_crashsafe_match)
- transfer_relfile(pageConverter, &maps[mapnum], "_vm");
+ {
+ if (vm_need_rewrite)
+ transfer_relfile(pageConverter, &maps[mapnum], "_vm", "_pim");
+ else
+ transfer_relfile(pageConverter, &maps[mapnum], "_vm", "_vm");
+ }
}
}
}
@@ -210,7 +223,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
*/
static void
transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *type_suffix)
+ const char *old_type_suffix, const char *new_type_suffix)
{
const char *msg;
char old_file[MAXPGPATH];
@@ -218,6 +231,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
int fd;
int segno;
char extent_suffix[65];
+ bool rewrite_vm = false;
/*
* Now copy/link any related segments as well. Remember, PG breaks large
@@ -236,18 +250,18 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
map->old_tablespace_suffix,
map->old_db_oid,
map->old_relfilenode,
- type_suffix,
+ old_type_suffix,
extent_suffix);
snprintf(new_file, sizeof(new_file), "%s%s/%u/%u%s%s",
map->new_tablespace,
map->new_tablespace_suffix,
map->new_db_oid,
map->new_relfilenode,
- type_suffix,
+ new_type_suffix,
extent_suffix);
/* Is it an extent, fsm, or vm file? */
- if (type_suffix[0] != '\0' || segno != 0)
+ if (old_type_suffix[0] != '\0' || segno != 0)
{
/* Did file open fail? */
if ((fd = open(old_file, O_RDONLY, 0)) == -1)
@@ -276,7 +290,11 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
+ /* Is it an vm file needs to be rewritten? */
+ if (strcmp(old_type_suffix, "_vm") == 0 && strcmp(old_type_suffix, new_type_suffix) != 0)
+ rewrite_vm = true;
+
+ if ((msg = copyOrRewriteFile(pageConverter, old_file, new_file, true, rewrite_vm)) != NULL)
pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
diff --git a/src/bin/pg_upgrade/test.sh b/src/bin/pg_upgrade/test.sh
index aa7f399..f5d80cb 100644
--- a/src/bin/pg_upgrade/test.sh
+++ b/src/bin/pg_upgrade/test.sh
@@ -174,6 +174,11 @@ if "$MAKE" -C "$oldsrc" installcheck; then
mv "$temp_root"/dump1.sql "$temp_root"/dump1.sql.orig
sed "s;$oldsrc;$newsrc;g" "$temp_root"/dump1.sql.orig >"$temp_root"/dump1.sql
fi
+
+ vm_sql="SELECT c.relname, c.relallvisible FROM pg_class as c, pg_namespace as n WHERE c.relnamespace = n.oid AND n.nspname NOT IN ('information_schema', 'pg_toast', 'pg_catalog') ORDER BY c.relname;"
+ # Test for rewriting visibility map
+ vacuumdb -d regression || visibilitymap_vacuum1_status=$?
+ psql -d regression -c "$vm_sql" > "$temp_root"/vm_test1.txt || visibilitymap_test1_status=$?
else
make_installcheck_status=$?
fi
@@ -188,6 +193,14 @@ if [ -n "$pg_dumpall1_status" ]; then
echo "pg_dumpall of pre-upgrade database cluster failed"
exit 1
fi
+if [ -n "$visibilitymap_vacuum1_status" ];then
+ echo "VACUUM of pre-upgrade database cluster for page info map test failed"
+ exit 1
+fi
+if [ -n "$visibilitymap_test1_status" ];then
+ echo "SELECT of pre-upgrade database cluster for page info map test failed"
+ exit 1
+fi
PGDATA=$BASE_PGDATA
@@ -203,6 +216,8 @@ case $testhost in
esac
pg_dumpall -f "$temp_root"/dump2.sql || pg_dumpall2_status=$?
+vacuumdb -d regression || visibilitymap_vacuum2_status=$?
+psql -d regression -c "$vm_sql" > "$temp_root"/vm_test2.txt || visibilitymap_test2_status=$?
pg_ctl -m fast stop
# no need to echo commands anymore
@@ -214,11 +229,26 @@ if [ -n "$pg_dumpall2_status" ]; then
exit 1
fi
+if [ -n "$visibilitymap_vacuum2_status" ];then
+ echo "VACUUM of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
+if [ -n "$visibilitymap_test2_status" ];then
+ echo "SELECT of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
case $testhost in
MINGW*) cmd /c delete_old_cluster.bat ;;
*) sh ./delete_old_cluster.sh ;;
esac
+if ! diff "$temp_root"/vm_test1.txt "$temp_root"/vm_test2.txt >/dev/null; then
+ echo "Visibility map rewriting test failed"
+ exit 1
+fi
+
if diff "$temp_root"/dump1.sql "$temp_root"/dump2.sql >/dev/null; then
echo PASSED
exit 0
diff --git a/src/common/relpath.c b/src/common/relpath.c
index 66dfef1..bac1157 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -34,7 +34,7 @@
const char *const forkNames[] = {
"main", /* MAIN_FORKNUM */
"fsm", /* FSM_FORKNUM */
- "vm", /* VISIBILITYMAP_FORKNUM */
+ "pim", /* PAGEINFOMAP_FORKNUM */
"init" /* INIT_FORKNUM */
};
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index caa0f14..dd8a4cc 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -312,17 +312,18 @@ typedef struct xl_heap_freeze_page
#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, ntuples) + sizeof(uint16))
/*
- * This is what we need to know about setting a visibility map bit
+ * This is what we need to know about setting a page info map bit
*
- * Backup blk 0: visibility map buffer
+ * Backup blk 0: page info map buffer
* Backup blk 1: heap buffer
*/
typedef struct xl_heap_visible
{
TransactionId cutoff_xid;
+ uint8 flags;
} xl_heap_visible;
-#define SizeOfHeapVisible (offsetof(xl_heap_visible, cutoff_xid) + sizeof(TransactionId))
+#define SizeOfHeapVisible (offsetof(xl_heap_visible, flags) + sizeof(uint8))
typedef struct xl_heap_new_cid
{
@@ -389,6 +390,6 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
xl_heap_freeze_tuple *xlrec_tp);
extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
- Buffer vm_buffer, TransactionId cutoff_xid);
+ Buffer pim_buffer, TransactionId cutoff_xid, uint8 flags);
#endif /* HEAPAM_XLOG_H */
diff --git a/src/include/access/hio.h b/src/include/access/hio.h
index b3b91e7..a200e5e 100644
--- a/src/include/access/hio.h
+++ b/src/include/access/hio.h
@@ -40,6 +40,6 @@ extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
Buffer otherBuffer, int options,
BulkInsertState bistate,
- Buffer *vmbuffer, Buffer *vmbuffer_other);
+ Buffer *pimbuffer, Buffer *pimbuffer_other);
#endif /* HIO_H */
diff --git a/src/include/access/pageinfomap.h b/src/include/access/pageinfomap.h
new file mode 100644
index 0000000..da217d2
--- /dev/null
+++ b/src/include/access/pageinfomap.h
@@ -0,0 +1,46 @@
+/*-------------------------------------------------------------------------
+ *
+ * pageinfomap.h
+ * page info map interface
+ *
+ *
+ * Portions Copyright (c) 2007-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/pageinfomap.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PAGEINFOMAP_H
+#define PAGEINFOMAP_H
+
+#include "access/xlogdefs.h"
+#include "storage/block.h"
+#include "storage/buf.h"
+#include "utils/relcache.h"
+
+/* Flags for bit map */
+#define PAGEINFOMAP_ALL_VISIBLE 0x01
+#define PAGEINFOMAP_ALL_FROZEN 0x02
+
+#define PAGEINFOMAP_ALL_FLAGS 0x03
+
+/* Macros for pageinfomap test */
+#define PIM_ALL_VISIBLE(r, b, v) \
+ ((pageinfomap_get_status((r), (b), (v)) & PAGEINFOMAP_ALL_VISIBLE) != 0)
+#define PIM_ALL_FROZEN(r, b, v) \
+ ((pageinfomap_get_status((r), (b), (v)) & PAGEINFOMAP_ALL_FROZEN) != 0)
+
+extern void pageinfomap_clear(Relation rel, BlockNumber heapBlk,
+ Buffer vmbuf);
+extern void pageinfomap_pin(Relation rel, BlockNumber heapBlk,
+ Buffer *vmbuf);
+extern bool pageinfomap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);
+extern void pageinfomap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags);
+extern uint8 pageinfomap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
+extern BlockNumber pageinfomap_count(Relation rel, BlockNumber *all_frozen);
+extern void pageinfomap_truncate(Relation rel, BlockNumber nheapblocks);
+
+#endif /* PAGEINFOMAP_H */
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
deleted file mode 100644
index 0c0e0ef..0000000
--- a/src/include/access/visibilitymap.h
+++ /dev/null
@@ -1,33 +0,0 @@
-/*-------------------------------------------------------------------------
- *
- * visibilitymap.h
- * visibility map interface
- *
- *
- * Portions Copyright (c) 2007-2015, PostgreSQL Global Development Group
- * Portions Copyright (c) 1994, Regents of the University of California
- *
- * src/include/access/visibilitymap.h
- *
- *-------------------------------------------------------------------------
- */
-#ifndef VISIBILITYMAP_H
-#define VISIBILITYMAP_H
-
-#include "access/xlogdefs.h"
-#include "storage/block.h"
-#include "storage/buf.h"
-#include "utils/relcache.h"
-
-extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk,
- Buffer vmbuf);
-extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
- Buffer *vmbuf);
-extern bool visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);
-extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid);
-extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
-extern BlockNumber visibilitymap_count(Relation rel);
-extern void visibilitymap_truncate(Relation rel, BlockNumber nheapblocks);
-
-#endif /* VISIBILITYMAP_H */
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index eba4150..3ff384b 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 201511071
+#define CATALOG_VERSION_NO 201511131
#endif
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index d8640db..e3d9530 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2779,6 +2779,8 @@ DATA(insert OID = 3056 ( pg_stat_get_analyze_count PGNSP PGUID 12 1 0 0 0 f f f
DESCR("statistics: number of manual analyzes for a table");
DATA(insert OID = 3057 ( pg_stat_get_autoanalyze_count PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_autoanalyze_count _null_ _null_ _null_ ));
DESCR("statistics: number of auto analyzes for a table");
+DATA(insert OID = 6015 ( pg_stat_get_frozen_pages PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 23 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_frozen_pages _null_ _null_ _null_ ));
+DESCR("statistics: number of frozen pages of table");
DATA(insert OID = 1936 ( pg_stat_get_backend_idset PGNSP PGUID 12 1 100 0 0 f f f f t t s r 0 0 23 "" _null_ _null_ _null_ _null_ _null_ pg_stat_get_backend_idset _null_ _null_ _null_ ));
DESCR("statistics: currently active backend IDs");
DATA(insert OID = 2022 ( pg_stat_get_activity PGNSP PGUID 12 1 100 0 0 f f f f f t s r 1 0 2249 "23" "{23,26,23,26,25,25,25,16,1184,1184,1184,1184,869,25,23,28,28,16,25,25,23,16,25}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,datid,pid,usesysid,application_name,state,query,waiting,xact_start,query_start,backend_start,state_change,client_addr,client_hostname,client_port,backend_xid,backend_xmin,ssl,sslversion,sslcipher,sslbits,sslcompression,sslclientdn}" _null_ _null_ pg_stat_get_activity _null_ _null_ _null_ ));
@@ -3665,7 +3667,7 @@ DESCR("convert a long int to a human readable text using size units");
DATA(insert OID = 3166 ( pg_size_pretty PGNSP PGUID 12 1 0 0 0 f f f f t f v s 1 0 25 "1700" _null_ _null_ _null_ _null_ _null_ pg_size_pretty_numeric _null_ _null_ _null_ ));
DESCR("convert a numeric to a human readable text using size units");
DATA(insert OID = 2997 ( pg_table_size PGNSP PGUID 12 1 0 0 0 f f f f t f v s 1 0 20 "2205" _null_ _null_ _null_ _null_ _null_ pg_table_size _null_ _null_ _null_ ));
-DESCR("disk space usage for the specified table, including TOAST, free space and visibility map");
+DESCR("disk space usage for the specified table, including TOAST, free space and page info map");
DATA(insert OID = 2998 ( pg_indexes_size PGNSP PGUID 12 1 0 0 0 f f f f t f v s 1 0 20 "2205" _null_ _null_ _null_ _null_ _null_ pg_indexes_size _null_ _null_ _null_ ));
DESCR("disk space usage for all indexes attached to the specified table");
DATA(insert OID = 2999 ( pg_relation_filenode PGNSP PGUID 12 1 0 0 0 f f f f t f s s 1 0 26 "2205" _null_ _null_ _null_ _null_ _null_ pg_relation_filenode _null_ _null_ _null_ ));
diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h
index a263779..90ee722 100644
--- a/src/include/common/relpath.h
+++ b/src/include/common/relpath.h
@@ -26,7 +26,7 @@ typedef enum ForkNumber
InvalidForkNumber = -1,
MAIN_FORKNUM = 0,
FSM_FORKNUM,
- VISIBILITYMAP_FORKNUM,
+ PAGEINFOMAP_FORKNUM,
INIT_FORKNUM
/*
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index eb3591a..af23b26 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1363,7 +1363,7 @@ typedef struct IndexScanState
* RuntimeContext expr context for evaling runtime Skeys
* RelationDesc index relation descriptor
* ScanDesc index scan descriptor
- * VMBuffer buffer in use for visibility map testing, if any
+ * PIMBuffer buffer in use for page info map testing, if any
* HeapFetches number of tuples we were forced to fetch from heap
* ----------------
*/
@@ -1381,7 +1381,7 @@ typedef struct IndexOnlyScanState
ExprContext *ioss_RuntimeContext;
Relation ioss_RelationDesc;
IndexScanDesc ioss_ScanDesc;
- Buffer ioss_VMBuffer;
+ Buffer ioss_PIMBuffer;
long ioss_HeapFetches;
} IndexOnlyScanState;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 9ecc163..614ca5a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -355,6 +355,7 @@ typedef struct PgStat_MsgVacuum
TimestampTz m_vacuumtime;
PgStat_Counter m_live_tuples;
PgStat_Counter m_dead_tuples;
+ int32 m_frozen_pages;
} PgStat_MsgVacuum;
@@ -372,6 +373,7 @@ typedef struct PgStat_MsgAnalyze
TimestampTz m_analyzetime;
PgStat_Counter m_live_tuples;
PgStat_Counter m_dead_tuples;
+ int32 m_frozen_pages;
} PgStat_MsgAnalyze;
@@ -551,7 +553,7 @@ typedef union PgStat_Msg
* ------------------------------------------------------------
*/
-#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9D
+#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9E
/* ----------
* PgStat_StatDBEntry The collector's data per database
@@ -615,6 +617,8 @@ typedef struct PgStat_StatTabEntry
PgStat_Counter n_dead_tuples;
PgStat_Counter changes_since_analyze;
+ int32 n_frozen_pages;
+
PgStat_Counter blocks_fetched;
PgStat_Counter blocks_hit;
@@ -917,9 +921,11 @@ extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type t
extern void pgstat_report_autovac(Oid dboid);
extern void pgstat_report_vacuum(Oid tableoid, bool shared,
- PgStat_Counter livetuples, PgStat_Counter deadtuples);
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages);
extern void pgstat_report_analyze(Relation rel,
- PgStat_Counter livetuples, PgStat_Counter deadtuples);
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages);
extern void pgstat_report_recovery_conflict(int reason);
extern void pgstat_report_deadlock(void);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index a2f78ee..102aa81 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -178,8 +178,10 @@ typedef PageHeaderData *PageHeader;
* tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
+#define PD_ALL_FROZEN 0x0008 /* all tuples on page are completely
+ frozen */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x000F /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -367,7 +369,12 @@ typedef PageHeaderData *PageHeader;
#define PageSetAllVisible(page) \
(((PageHeader) (page))->pd_flags |= PD_ALL_VISIBLE)
#define PageClearAllVisible(page) \
- (((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
+ (((PageHeader) (page))->pd_flags &= ~(PD_ALL_VISIBLE | PD_ALL_FROZEN))
+
+#define PageIsAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags & PD_ALL_FROZEN)
+#define PageSetAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags |= PD_ALL_FROZEN)
#define PageIsPrunable(page, oldestxmin) \
( \
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 69a624f..c676694 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -54,7 +54,7 @@ typedef struct SMgrRelationData
*/
BlockNumber smgr_targblock; /* current insertion target block */
BlockNumber smgr_fsm_nblocks; /* last known size of fsm fork */
- BlockNumber smgr_vm_nblocks; /* last known size of vm fork */
+ BlockNumber smgr_pim_nblocks; /* last known size of pim fork */
/* additional public fields may someday exist here */
diff --git a/src/test/regress/expected/pageinfomap.out b/src/test/regress/expected/pageinfomap.out
new file mode 100644
index 0000000..31543ba
--- /dev/null
+++ b/src/test/regress/expected/pageinfomap.out
@@ -0,0 +1,22 @@
+--
+-- Page Info Map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+-- All pages are become all-visible
+VACUUM FREEZE vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+ ?column?
+----------
+ t
+(1 row)
+
+-- Check whether vacuum skips all-frozen pages
+\set VERBOSITY terse
+VACUUM FREEZE VERBOSE vmtest;
+INFO: vacuuming "public.vmtest"
+INFO: index "vmtest_pkey" now contains 10000 row versions in 30 pages
+INFO: skipped 45 frozen pages according to page info map
+INFO: "vmtest": found 0 removable, 0 nonremovable row versions in 0 out of 45 pages
+\set VERBOSITY default
+DROP TABLE vmtest;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 80374e4..b259e65 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1683,6 +1683,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,
pg_stat_get_live_tuples(c.oid) AS n_live_tup,
pg_stat_get_dead_tuples(c.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(c.oid) AS n_mod_since_analyze,
+ pg_stat_get_frozen_pages(c.oid) AS n_frozen_page,
pg_stat_get_last_vacuum_time(c.oid) AS last_vacuum,
pg_stat_get_last_autovacuum_time(c.oid) AS last_autovacuum,
pg_stat_get_last_analyze_time(c.oid) AS last_analyze,
@@ -1795,6 +1796,7 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
+ pg_stat_all_tables.n_frozen_page,
pg_stat_all_tables.last_vacuum,
pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,
@@ -1838,6 +1840,7 @@ pg_stat_user_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
+ pg_stat_all_tables.n_frozen_page,
pg_stat_all_tables.last_vacuum,
pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index f5be70f..dd49786 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -127,6 +127,8 @@ SELECT count(*) FROM tenk2 WHERE unique1 = 1;
1
(1 row)
+-- do VACUUM FREEZE
+VACUUM FREEZE tenk2;
-- force the rate-limiting logic in pgstat_report_tabstat() to time out
-- and send a message
SELECT pg_sleep(1.0);
@@ -175,6 +177,14 @@ SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages,
t | t
(1 row)
+SELECT n_frozen_page = (pg_relation_size('tenk2') / current_setting('block_size')::int)
+ FROM pg_stat_user_tables
+ WHERE relname ='tenk2';
+ ?column?
+----------
+ t
+(1 row)
+
SELECT pr.snap_ts < pg_stat_get_snapshot_timestamp() as snapshot_newer
FROM prevstats AS pr;
snapshot_newer
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 3987b4c..c4d0281 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -110,3 +110,6 @@ test: event_trigger
# run stats by itself because its delay may be insufficient under heavy load
test: stats
+
+# page info map and vacuum test cannot run concurrently with any test that runs SQL
+test: pageinfomap
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 379f272..69fbab1 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -160,3 +160,4 @@ test: with
test: xml
test: event_trigger
test: stats
+test: pageinfomap
diff --git a/src/test/regress/sql/pageinfomap.sql b/src/test/regress/sql/pageinfomap.sql
new file mode 100644
index 0000000..739c715
--- /dev/null
+++ b/src/test/regress/sql/pageinfomap.sql
@@ -0,0 +1,16 @@
+--
+-- Page Info Map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+
+-- All pages are become all-visible
+VACUUM FREEZE vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+
+-- Check whether vacuum skips all-frozen pages
+\set VERBOSITY terse
+VACUUM FREEZE VERBOSE vmtest;
+\set VERBOSITY default
+
+DROP TABLE vmtest;
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index cd2d592..b3c640f 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -120,6 +120,8 @@ ROLLBACK;
SELECT count(*) FROM tenk2;
-- do an indexscan
SELECT count(*) FROM tenk2 WHERE unique1 = 1;
+-- do VACUUM FREEZE
+VACUUM FREEZE tenk2;
-- force the rate-limiting logic in pgstat_report_tabstat() to time out
-- and send a message
@@ -145,6 +147,10 @@ SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages,
FROM pg_statio_user_tables AS st, pg_class AS cl, prevstats AS pr
WHERE st.relname='tenk2' AND cl.relname='tenk2';
+SELECT n_frozen_page = (pg_relation_size('tenk2') / current_setting('block_size')::int)
+ FROM pg_stat_user_tables
+ WHERE relname ='tenk2';
+
SELECT pr.snap_ts < pg_stat_get_snapshot_timestamp() as snapshot_newer
FROM prevstats AS pr;
On Tue, Nov 3, 2015 at 09:03:49AM +0530, Amit Kapila wrote:
I think in that case we can call it as page info map or page state map, but
I find retaining visibility map name in this case or for future (if we want to
add another bit) as confusing.� In-fact if you find�"visibility and�freeze
map",
as excessively long, then we can change it to "page info map" or "page state
map" now as well.
Coming in late here, but the problem with "page info map" is that free
space is also page info (how much free space on each page), so "page
info map" isn't very descriptive. "page status" or "page state" might
make more sense, but even then, free space is a kind of page
status/state. What is happening is that broadening the name to cover
both visibility and freeze state also encompasses free space.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-10-31 11:02:12 +0530, Amit Kapila wrote:
On Thu, Oct 8, 2015 at 11:05 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 1 October 2015 at 23:30, Josh Berkus <josh@agliodbs.com> wrote:
On 10/01/2015 07:43 AM, Robert Haas wrote:
On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao <masao.fujii@gmail.com>
wrote:
I wonder how much it's worth renaming only the file extension while
there are many places where "visibility map" and "vm" are used,
for example, log messages, function names, variables, etc.I'd be inclined to keep calling it the visibility map (vm) even if it
also contains freeze information.What is your main worry about changing the name of this map, is it
about more code churn or is it about that we might introduce new issues
or is it about that people are already accustomed to call this map as
visibility map?
Several:
* Visibility map is rather descriptive, none of the replacement terms
imo come close. Few people will know what a 'freeze' map is.
* It increases the size of the patch considerably
* It forces tooling that knows about the layout of the database
directory to change their tools
On the benfit side the only argument I've heard so far is that it allows
to disambiguate the format. But, uh, a look at the major version does
that just as well, for far less trouble.
It seems to me quite logical for understanding purpose as well. Any new
person who wants to work in this area or is looking into it will always
wonder why this map is named as visibility map even though it contains
information about visibility of page as well as frozen state of page.
Being frozen is about visibility as well.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Nov 14, 2015 at 1:12 AM, Bruce Momjian <bruce@momjian.us> wrote:
On Tue, Nov 3, 2015 at 09:03:49AM +0530, Amit Kapila wrote:
I think in that case we can call it as page info map or page state map,
but
I find retaining visibility map name in this case or for future (if we
want to
add another bit) as confusing. In-fact if you find "visibility and
freeze
map",
as excessively long, then we can change it to "page info map" or "page
state
map" now as well.
Coming in late here, but the problem with "page info map" is that free
space is also page info (how much free space on each page), so "page
info map" isn't very descriptive. "page status" or "page state" might
make more sense, but even then, free space is a kind of page
status/state. What is happening is that broadening the name to cover
both visibility and freeze state also encompasses free space.
Valid point, but I think free space map is a specific information of page
stored in a completely different format. "page info"/"page state" map
could contain information about multiple states of page in same format.
There is yet another option of changing it Visibility and Freeze map and
or change file extension to vfm, but Robert felt that is rather long name
and I also agree with him.
Do you see retaining the visibility map as better option ?
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Sat, Nov 14, 2015 at 1:19 AM, Andres Freund <andres@anarazel.de> wrote:
On 2015-10-31 11:02:12 +0530, Amit Kapila wrote:
On Thu, Oct 8, 2015 at 11:05 PM, Simon Riggs <simon@2ndquadrant.com>
wrote:
On 1 October 2015 at 23:30, Josh Berkus <josh@agliodbs.com> wrote:
On 10/01/2015 07:43 AM, Robert Haas wrote:
On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao <masao.fujii@gmail.com>
wrote:
I wonder how much it's worth renaming only the file extension
while
there are many places where "visibility map" and "vm" are used,
for example, log messages, function names, variables, etc.I'd be inclined to keep calling it the visibility map (vm) even if
it
also contains freeze information.
What is your main worry about changing the name of this map, is it
about more code churn or is it about that we might introduce new issues
or is it about that people are already accustomed to call this map as
visibility map?Several:
* Visibility map is rather descriptive, none of the replacement terms
imo come close. Few people will know what a 'freeze' map is.
* It increases the size of the patch considerably
* It forces tooling that knows about the layout of the database
directory to change their tools
All these points are legitimate.
On the benfit side the only argument I've heard so far is that it allows
to disambiguate the format. But, uh, a look at the major version does
that just as well, for far less trouble.It seems to me quite logical for understanding purpose as well. Any new
person who wants to work in this area or is looking into it will always
wonder why this map is named as visibility map even though it contains
information about visibility of page as well as frozen state of page.Being frozen is about visibility as well.
OTOH being visible doesn't mean page is frozen. I understand that frozen is
related to visibility, but still it is a separate state of page and used
for different
purpose. I think this is a subjective point and we could go either way, it
is
just a matter in which way more people are comfortable.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Sun, Nov 15, 2015 at 1:47 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Sat, Nov 14, 2015 at 1:19 AM, Andres Freund <andres@anarazel.de> wrote:
On 2015-10-31 11:02:12 +0530, Amit Kapila wrote:
On Thu, Oct 8, 2015 at 11:05 PM, Simon Riggs <simon@2ndquadrant.com>
wrote:On 1 October 2015 at 23:30, Josh Berkus <josh@agliodbs.com> wrote:
On 10/01/2015 07:43 AM, Robert Haas wrote:
On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao <masao.fujii@gmail.com>
wrote:
I wonder how much it's worth renaming only the file extension
while
there are many places where "visibility map" and "vm" are used,
for example, log messages, function names, variables, etc.I'd be inclined to keep calling it the visibility map (vm) even if
it
also contains freeze information.What is your main worry about changing the name of this map, is it
about more code churn or is it about that we might introduce new issues
or is it about that people are already accustomed to call this map as
visibility map?Several:
* Visibility map is rather descriptive, none of the replacement terms
imo come close. Few people will know what a 'freeze' map is.
* It increases the size of the patch considerably
* It forces tooling that knows about the layout of the database
directory to change their toolsAll these points are legitimate.
On the benfit side the only argument I've heard so far is that it allows
to disambiguate the format. But, uh, a look at the major version does
that just as well, for far less trouble.It seems to me quite logical for understanding purpose as well. Any new
person who wants to work in this area or is looking into it will always
wonder why this map is named as visibility map even though it contains
information about visibility of page as well as frozen state of page.Being frozen is about visibility as well.
OTOH being visible doesn't mean page is frozen. I understand that frozen is
related to visibility, but still it is a separate state of page and used for
different
purpose. I think this is a subjective point and we could go either way, it
is
just a matter in which way more people are comfortable.
I'm stickin' with what I said before, and what I think Andres is
saying too: renaming the map is a horrible idea. It produces lots of
code churn for no real benefit. We usually avoid such changes, and I
think we should do so here, too.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Nov 17, 2015 at 10:45 AM, Robert Haas <robertmhaas@gmail.com
<javascript:;>> wrote:
On Sun, Nov 15, 2015 at 1:47 AM, Amit Kapila <amit.kapila16@gmail.com
<javascript:;>> wrote:
On Sat, Nov 14, 2015 at 1:19 AM, Andres Freund <andres@anarazel.de
<javascript:;>> wrote:
On 2015-10-31 11:02:12 +0530, Amit Kapila wrote:
On Thu, Oct 8, 2015 at 11:05 PM, Simon Riggs <simon@2ndquadrant.com
<javascript:;>>
wrote:
On 1 October 2015 at 23:30, Josh Berkus <josh@agliodbs.com
<javascript:;>> wrote:
On 10/01/2015 07:43 AM, Robert Haas wrote:
On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao <
masao.fujii@gmail.com <javascript:;>>
wrote:
I wonder how much it's worth renaming only the file extension
while
there are many places where "visibility map" and "vm" are used,
for example, log messages, function names, variables, etc.I'd be inclined to keep calling it the visibility map (vm) even
if
it
also contains freeze information.What is your main worry about changing the name of this map, is it
about more code churn or is it about that we might introduce new
issues
or is it about that people are already accustomed to call this map as
visibility map?Several:
* Visibility map is rather descriptive, none of the replacement terms
imo come close. Few people will know what a 'freeze' map is.
* It increases the size of the patch considerably
* It forces tooling that knows about the layout of the database
directory to change their toolsAll these points are legitimate.
On the benfit side the only argument I've heard so far is that it allows
to disambiguate the format. But, uh, a look at the major version does
that just as well, for far less trouble.It seems to me quite logical for understanding purpose as well. Any
new
person who wants to work in this area or is looking into it will
always
wonder why this map is named as visibility map even though it contains
information about visibility of page as well as frozen state of page.Being frozen is about visibility as well.
OTOH being visible doesn't mean page is frozen. I understand that
frozen is
related to visibility, but still it is a separate state of page and used
for
different
purpose. I think this is a subjective point and we could go either way,
it
is
just a matter in which way more people are comfortable.I'm stickin' with what I said before, and what I think Andres is
saying too: renaming the map is a horrible idea. It produces lots of
code churn for no real benefit. We usually avoid such changes, and I
think we should do so here, too.
I understood.
I'm going to turn the patch back to visibility map, and just add the logic
of enhancement of VACUUM FREEZE.
If we want to add the other status not related to visibility into
visibility map in the future, it would be worth to consider.
Regards,
--
Masahiko Sawada
--
Regards,
--
Masahiko Sawada
On 17 November 2015 at 10:29, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Tue, Nov 17, 2015 at 10:45 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sun, Nov 15, 2015 at 1:47 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:On Sat, Nov 14, 2015 at 1:19 AM, Andres Freund <andres@anarazel.de>
wrote:On 2015-10-31 11:02:12 +0530, Amit Kapila wrote:
On Thu, Oct 8, 2015 at 11:05 PM, Simon Riggs <simon@2ndquadrant.com>
wrote:On 1 October 2015 at 23:30, Josh Berkus <josh@agliodbs.com> wrote:
On 10/01/2015 07:43 AM, Robert Haas wrote:
On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao
<masao.fujii@gmail.com>wrote:
I wonder how much it's worth renaming only the file extension
while
there are many places where "visibility map" and "vm" are used,
for example, log messages, function names, variables, etc.I'd be inclined to keep calling it the visibility map (vm) even
if
it
also contains freeze information.What is your main worry about changing the name of this map, is it
about more code churn or is it about that we might introduce new
issues
or is it about that people are already accustomed to call this map as
visibility map?Several:
* Visibility map is rather descriptive, none of the replacement terms
imo come close. Few people will know what a 'freeze' map is.
* It increases the size of the patch considerably
* It forces tooling that knows about the layout of the database
directory to change their toolsAll these points are legitimate.
On the benfit side the only argument I've heard so far is that it allows
to disambiguate the format. But, uh, a look at the major version does
that just as well, for far less trouble.It seems to me quite logical for understanding purpose as well. Any
new
person who wants to work in this area or is looking into it will
always
wonder why this map is named as visibility map even though it contains
information about visibility of page as well as frozen state of page.Being frozen is about visibility as well.
OTOH being visible doesn't mean page is frozen. I understand that frozen
is
related to visibility, but still it is a separate state of page and used
for
different
purpose. I think this is a subjective point and we could go either way,
it
is
just a matter in which way more people are comfortable.I'm stickin' with what I said before, and what I think Andres is
saying too: renaming the map is a horrible idea. It produces lots of
code churn for no real benefit. We usually avoid such changes, and I
think we should do so here, too.I understood.
I'm going to turn the patch back to visibility map, and just add the logic
of enhancement of VACUUM FREEZE.
If we want to add the other status not related to visibility into visibility
map in the future, it would be worth to consider.
Could someone post a TL;DR summary of what the current plan looks
like? I can see there is a huge amount of discussion to trawl back
through. I can see it's something to do with the visibility map. And
does it avoid freezing very large tables like the title originally
sought?
Thanks
Thom
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 11/17/15 4:41 AM, Thom Brown wrote:
Could someone post a TL;DR summary of what the current plan looks
like? I can see there is a huge amount of discussion to trawl back
through. I can see it's something to do with the visibility map. And
does it avoid freezing very large tables like the title originally
sought?
Basically, it follows the same pattern that all-visible bits do, except
instead of indicating a page is all-visible, the bit shows that all
tuples on the page are frozen. That allows a scan_all vacuum to skip
those pages.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 17 November 2015 at 15:43, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 11/17/15 4:41 AM, Thom Brown wrote:
Could someone post a TL;DR summary of what the current plan looks
like? I can see there is a huge amount of discussion to trawl back
through. I can see it's something to do with the visibility map. And
does it avoid freezing very large tables like the title originally
sought?Basically, it follows the same pattern that all-visible bits do, except
instead of indicating a page is all-visible, the bit shows that all tuples
on the page are frozen. That allows a scan_all vacuum to skip those pages.
So the visibility map is being repurposed? And if a row on a frozen
page is modified, what happens to the visibility of all other rows on
that page, as the bit will be set back to 0? I think I'm missing a
critical part of this functionality.
Thom
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Nov 18, 2015 at 12:56 AM, Thom Brown <thom@linux.com> wrote:
On 17 November 2015 at 15:43, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 11/17/15 4:41 AM, Thom Brown wrote:
Could someone post a TL;DR summary of what the current plan looks
like? I can see there is a huge amount of discussion to trawl back
through. I can see it's something to do with the visibility map. And
does it avoid freezing very large tables like the title originally
sought?Basically, it follows the same pattern that all-visible bits do, except
instead of indicating a page is all-visible, the bit shows that all tuples
on the page are frozen. That allows a scan_all vacuum to skip those pages.So the visibility map is being repurposed?
My proposal is to add additional one bit that indicates all tuples on
page are completely frozen, into visibility map.
That is, the visibility map will become a bitmap with two bits
(all-visible, all-frozen) per page.
And if a row on a frozen
page is modified, what happens to the visibility of all other rows on
that page, as the bit will be set back to 0?
In this case, the corresponding VM both bits are cleared.
Such behaviour is almost same as what postgresql is doing today.
Regards,
--
Masahiko Sawada
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Nov 17, 2015 at 7:29 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Tue, Nov 17, 2015 at 10:45 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sun, Nov 15, 2015 at 1:47 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:On Sat, Nov 14, 2015 at 1:19 AM, Andres Freund <andres@anarazel.de>
wrote:On 2015-10-31 11:02:12 +0530, Amit Kapila wrote:
On Thu, Oct 8, 2015 at 11:05 PM, Simon Riggs <simon@2ndquadrant.com>
wrote:On 1 October 2015 at 23:30, Josh Berkus <josh@agliodbs.com> wrote:
On 10/01/2015 07:43 AM, Robert Haas wrote:
On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao
<masao.fujii@gmail.com>wrote:
I wonder how much it's worth renaming only the file extension
while
there are many places where "visibility map" and "vm" are used,
for example, log messages, function names, variables, etc.I'd be inclined to keep calling it the visibility map (vm) even
if
it
also contains freeze information.What is your main worry about changing the name of this map, is it
about more code churn or is it about that we might introduce new
issues
or is it about that people are already accustomed to call this map as
visibility map?Several:
* Visibility map is rather descriptive, none of the replacement terms
imo come close. Few people will know what a 'freeze' map is.
* It increases the size of the patch considerably
* It forces tooling that knows about the layout of the database
directory to change their toolsAll these points are legitimate.
On the benfit side the only argument I've heard so far is that it allows
to disambiguate the format. But, uh, a look at the major version does
that just as well, for far less trouble.It seems to me quite logical for understanding purpose as well. Any
new
person who wants to work in this area or is looking into it will
always
wonder why this map is named as visibility map even though it contains
information about visibility of page as well as frozen state of page.Being frozen is about visibility as well.
OTOH being visible doesn't mean page is frozen. I understand that frozen
is
related to visibility, but still it is a separate state of page and used
for
different
purpose. I think this is a subjective point and we could go either way,
it
is
just a matter in which way more people are comfortable.I'm stickin' with what I said before, and what I think Andres is
saying too: renaming the map is a horrible idea. It produces lots of
code churn for no real benefit. We usually avoid such changes, and I
think we should do so here, too.I understood.
I'm going to turn the patch back to visibility map, and just add the logic
of enhancement of VACUUM FREEZE.
Attached latest v24 patch.
I've changed patch so that just adding frozen bit into visibility map.
So the size of patch is almost half of previous one.
Please review it.
Regards,
--
Masahiko Sawada
Attachments:
000_add_frozen_bit_into_visibilitymap_v24.patchtext/x-patch; charset=US-ASCII; name=000_add_frozen_bit_into_visibilitymap_v24.patchDownload
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 22c5f7a..e8ebfe9 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -87,7 +87,7 @@ statapprox_heap(Relation rel, output_type *stat)
* If the page has only visible tuples, then we can find out the free
* space from the FSM and move on.
*/
- if (visibilitymap_test(rel, blkno, &vmbuffer))
+ if (VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
{
freespace = GetRecordedFreeSpace(rel, blkno);
stat->tuple_len += BLCKSZ - freespace;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 6e14851..c75a166 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5905,7 +5905,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs an eager freezing if the table's
<structname>pg_class</>.<structfield>relfrozenxid</> field has reached
the age specified by this setting. The default is 150 million
transactions. Although users can set this value anywhere from zero to
@@ -5949,7 +5949,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs an eager freezing if the table's
<structname>pg_class</>.<structfield>relminmxid</> field has reached
the age specified by this setting. The default is 150 million multixacts.
Although users can set this value anywhere from zero to two billions,
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 5204b34..5a43c28 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -352,9 +352,9 @@
Vacuum maintains a <link linkend="storage-vm">visibility map</> for each
table to keep track of which pages contain only tuples that are known to be
visible to all active transactions (and all future transactions, until the
- page is again modified). This has two purposes. First, vacuum
- itself can skip such pages on the next run, since there is nothing to
- clean up.
+ page is again modified), and pages contain only frozen tuples.
+ This has two purposes. First, vacuum itself can skip such pages
+ on the next run, since there is nothing to clean up.
</para>
<para>
@@ -438,28 +438,25 @@
</para>
<para>
- <command>VACUUM</> normally skips pages that don't have any dead row
- versions, but those pages might still have row versions with old XID
- values. To ensure all old row versions have been frozen, a
- scan of the whole table is needed.
+ <command>VACUUM</> skips pages that don't have any dead row
+ versions, and pages that have only frozen rows. To ensure all old
+ row versions have been frozen, a scan of all unfrozen pages is needed.
<xref linkend="guc-vacuum-freeze-table-age"> controls when
- <command>VACUUM</> does that: a whole table sweep is forced if
- the table hasn't been fully scanned for <varname>vacuum_freeze_table_age</>
- minus <varname>vacuum_freeze_min_age</> transactions. Setting it to 0
- forces <command>VACUUM</> to always scan all pages, effectively ignoring
- the visibility map.
+ <command>VACUUM</> does that: a whole-table freezing is forced if
+ the table hasn't been ensured all row versions are frozen for
+ <varname>vacuum_freeze_table_age</> minus <varname>vacuum_freeze_min_age</>
+ transcations.
</para>
<para>
The maximum time that a table can go unvacuumed is two billion
transactions minus the <varname>vacuum_freeze_min_age</> value at
- the time <command>VACUUM</> last scanned the whole table. If it were to go
- unvacuumed for longer than
- that, data loss could result. To ensure that this does not happen,
- autovacuum is invoked on any table that might contain unfrozen rows with
- XIDs older than the age specified by the configuration parameter <xref
- linkend="guc-autovacuum-freeze-max-age">. (This will happen even if
- autovacuum is disabled.)
+ the time <command>VACUUM</> last scanned unfrozen pages. If it were to go
+ unvacuumed for longer than that, data loss could result. To ensure
+ that this does not happen, autovacuum is invoked on any table that might
+ contain unfrozen rows with XIDs older than the age specified by the
+ configuration parameter <xref linkend="guc-autovacuum-freeze-max-age">.
+ (This will happen even if autovacuum is disabled.)
</para>
<para>
@@ -490,8 +487,7 @@
a regularly scheduled <command>VACUUM</> or an autovacuum triggered by
normal delete and update activity is run in that window. Setting it too
close could lead to anti-wraparound autovacuums, even though the table
- was recently vacuumed to reclaim space, whereas lower values lead to more
- frequent whole-table scans.
+ was recently vacuumed to reclaim space.
</para>
<para>
@@ -527,7 +523,7 @@
<structname>pg_database</>. In particular,
the <structfield>relfrozenxid</> column of a table's
<structname>pg_class</> row contains the freeze cutoff XID that was used
- by the last whole-table <command>VACUUM</> for that table. All rows
+ by the last whole-table freezing for that table. All rows
inserted by transactions with XIDs older than this cutoff XID are
guaranteed to have been frozen. Similarly,
the <structfield>datfrozenxid</> column of a database's
@@ -555,17 +551,19 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
<command>VACUUM</> normally
only scans pages that have been modified since the last vacuum, but
<structfield>relfrozenxid</> can only be advanced when the whole table is
- scanned. The whole table is scanned when <structfield>relfrozenxid</> is
- more than <varname>vacuum_freeze_table_age</> transactions old, when
- <command>VACUUM</>'s <literal>FREEZE</> option is used, or when all pages
- happen to
- require vacuuming to remove dead row versions. When <command>VACUUM</>
- scans the whole table, after it's finished <literal>age(relfrozenxid)</>
- should be a little more than the <varname>vacuum_freeze_min_age</> setting
- that was used (more by the number of transactions started since the
- <command>VACUUM</> started). If no whole-table-scanning <command>VACUUM</>
- is issued on the table until <varname>autovacuum_freeze_max_age</> is
- reached, an autovacuum will soon be forced for the table.
+ frozen. The whole-table freezing is occuerred only when all pages happen to
+ require freezing to freeze rows. In other cases such as where
+ <structfield>relfrozenxid</> is more than <varname>vacuum_freeze_table_age</>
+ transcations old, where <command>VACUUM</>'s <literal>FREEZE</> option is used,
+ <command>VACUUM</> can skip the pages that all tuples on the page itself are
+ marked as frozen.
+ When all pages of table are eventually marked as frozen by <command>VACUUM</>,
+ after it's finished <literal>age(relfrozenxid)</> should be a little more
+ than the <varname>vacuum_freeze_min_age</> setting that was used (more by
+ the number of transcations started since the <command>VACUUM</> started).
+ If the advancing of <structfield>relfrozenxid</> is not happend until
+ <varname>autovacuum_freeze_max_age</> is reached, an autovacuum will soon
+ be forced for the table.
</para>
<para>
@@ -642,28 +640,28 @@ HINT: Stop the postmaster and vacuum that database in single-user mode.
<structname>pg_class</>.<structfield>relminmxid</> stores the oldest
possible multixact ID still appearing in any tuple of that table.
If this value is older than
- <xref linkend="guc-vacuum-multixact-freeze-table-age">, a whole-table
+ <xref linkend="guc-vacuum-multixact-freeze-table-age">, a table
scan is forced. <function>mxid_age()</> can be used on
<structname>pg_class</>.<structfield>relminmxid</> to find its age.
</para>
<para>
- Whole-table <command>VACUUM</> scans, regardless of
- what causes them, enable advancing the value for that table.
+ When <command>VACUUM</> scans all unfrozen pages, regardless of what causes
+ them, enable advancing the value for that table.
Eventually, as all tables in all databases are scanned and their
oldest multixact values are advanced, on-disk storage for older
multixacts can be removed.
</para>
<para>
- As a safety device, a whole-table vacuum scan will occur for any table
- whose multixact-age is greater than
- <xref linkend="guc-autovacuum-multixact-freeze-max-age">. Whole-table
+ As a safety device, a vacuum scan will occur for any table whose
+ multixact-age is greater than
+ <xref linkend="guc-autovacuum-multixact-freeze-max-age">. A
vacuum scans will also occur progressively for all tables, starting with
those that have the oldest multixact-age, if the amount of used member
storage space exceeds the amount 50% of the addressable storage space.
- Both of these kinds of whole-table scans will occur even if autovacuum is
- nominally disabled.
+ Both of these kinds of scans will occur even if autovacuum is nominally
+ disabled.
</para>
</sect3>
</sect2>
@@ -743,8 +741,8 @@ vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuple
<command>UPDATE</command> and <command>DELETE</command> operation. (It
is only semi-accurate because some information might be lost under heavy
load.) If the <structfield>relfrozenxid</> value of the table is more
- than <varname>vacuum_freeze_table_age</> transactions old, the whole
- table is scanned to freeze old tuples and advance
+ than <varname>vacuum_freeze_table_age</> transactions old, the table is
+ scanned to freeze old tuples and advance
<structfield>relfrozenxid</>, otherwise only pages that have been modified
since the last vacuum are scanned.
</para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e64b7ef..1908a4d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1332,6 +1332,11 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
<entry>Estimated number of rows modified since this table was last analyzed</entry>
</row>
<row>
+ <entry><structfield>n_frozen_page</></entry>
+ <entry><type>integer</></entry>
+ <entry>Number of frozen pages</entry>
+ </row>
+ <row>
<entry><structfield>last_vacuum</></entry>
<entry><type>timestamp with time zone</></entry>
<entry>Last time at which this table was manually vacuumed
diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index eb113c2..d613bb7 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -657,6 +657,12 @@ psql --username postgres --file script.sql postgres
</para>
<para>
+ Since the format of visibility map has been changed in version 9.6,
+ <application>pg_upgrade</> does not support upgrading of database
+ from 9.5 or before to 9.6 or later with link mode (-k).
+ </para>
+
+ <para>
All failure, rebuild, and reindex cases will be reported by
<application>pg_upgrade</> if they affect your installation;
post-upgrade scripts to rebuild tables and indexes will be
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 164d08c..5dc8b04 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -623,18 +623,20 @@ can be used to examine the information stored in free space maps.
<para>
Each heap relation has a Visibility Map
(VM) to keep track of which pages contain only tuples that are known to be
-visible to all active transactions. It's stored
-alongside the main relation data in a separate relation fork, named after the
-filenode number of the relation, plus a <literal>_vm</> suffix. For example,
-if the filenode of a relation is 12345, the VM is stored in a file called
-<filename>12345_vm</>, in the same directory as the main relation file.
+visible to all active transactions, and pages contain only unfrozen tuples.
+It's stored alongside the main relation data in a separate relation fork,
+named after the filenode number of the relation, plus a <literal>_vm</> suffix.
+For example, if the filenode of a relation is 12345, the VM is stored in a file
+called <filename>12345_vm</>, in the same directory as the main relation file.
Note that indexes do not have VMs.
</para>
<para>
-The visibility map simply stores one bit per heap page. A set bit means
-that all tuples on the page are known to be visible to all transactions.
-This means that the page does not contain any tuples that need to be vacuumed.
+The visibility map stores two bits per heap page: all-visible, all-frozen.
+A set all-visible bit means that all tuples on the page are known to be visible
+to all transactions. A set all-frozen bit means that all tuples on the page are
+completely marked as frozen. This means that the page does not contain any tuples
+that need to be vacuumed, even if a scan of whole table is required.
This information can also be used by <firstterm>index-only scans</> to answer
queries using only the index tuple.
</para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 35a2b05..60eb41f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3034,9 +3034,9 @@ heap_delete(Relation relation, ItemPointer tid,
/*
* If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, we'll have to unlock and
- * re-lock, to avoid holding the buffer lock across an I/O. That's a bit
- * unfortunate, but hopefully shouldn't happen often.
+ * visible or all frozen while we were busy locking the buffer, we'll
+ * have to unlock and re-lock, to avoid holding the buffer lock across an
+ * I/O. That's a bit unfortunate, but hopefully shouldn't happen often.
*/
if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
{
@@ -7205,7 +7205,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
*/
XLogRecPtr
log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
- TransactionId cutoff_xid)
+ TransactionId cutoff_xid, uint8 vmflags)
{
xl_heap_visible xlrec;
XLogRecPtr recptr;
@@ -7215,6 +7215,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
Assert(BufferIsValid(vm_buffer));
xlrec.cutoff_xid = cutoff_xid;
+ xlrec.flags = vmflags;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
@@ -7804,7 +7805,12 @@ heap_xlog_visible(XLogReaderState *record)
* the subsequent update won't be replayed to clear the flag.
*/
page = BufferGetPage(buffer);
- PageSetAllVisible(page);
+
+ if (xlrec->flags & VISIBILITYMAP_ALL_VISIBLE)
+ PageSetAllVisible(page);
+ if (xlrec->flags & VISIBILITYMAP_ALL_FROZEN)
+ PageSetAllFrozen(page);
+
MarkBufferDirty(buffer);
}
else if (action == BLK_RESTORED)
@@ -7856,7 +7862,7 @@ heap_xlog_visible(XLogReaderState *record)
*/
if (lsn > PageGetLSN(vmpage))
visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
- xlrec->cutoff_xid);
+ xlrec->cutoff_xid, xlrec->flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 6db73bf..0fe49eb 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -119,7 +119,7 @@ ReadBufferBI(Relation relation, BlockNumber targetBlock,
* be less than buffer2.
*/
static void
-GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
+GetVisibilitymapPins(Relation relation, Buffer buffer1, Buffer buffer2,
BlockNumber block1, BlockNumber block2,
Buffer *vmbuffer1, Buffer *vmbuffer2)
{
@@ -380,11 +380,11 @@ RelationGetBufferForTuple(Relation relation, Size len,
* done.
*/
if (otherBuffer == InvalidBuffer || buffer <= otherBuffer)
- GetVisibilityMapPins(relation, buffer, otherBuffer,
+ GetVisibilitymapPins(relation, buffer, otherBuffer,
targetBlock, otherBlock, vmbuffer,
vmbuffer_other);
else
- GetVisibilityMapPins(relation, otherBuffer, buffer,
+ GetVisibilitymapPins(relation, otherBuffer, buffer,
otherBlock, targetBlock, vmbuffer_other,
vmbuffer);
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 7c38772..387a0d6 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -15,39 +15,46 @@
* visibilitymap_pin - pin a map page for setting a bit
* visibilitymap_pin_ok - check whether correct map page is already pinned
* visibilitymap_set - set a bit in a previously pinned page
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_get_status - get status of bits
* visibilitymap_count - count number of bits set in visibility map
* visibilitymap_truncate - truncate the visibility map
*
* NOTES
*
- * The visibility map is a bitmap with one bit per heap page. A set bit means
- * that all tuples on the page are known visible to all transactions, and
- * therefore the page doesn't need to be vacuumed. The map is conservative in
- * the sense that we make sure that whenever a bit is set, we know the
- * condition is true, but if a bit is not set, it might or might not be true.
+ * The visibility map is a bitmap with two bits (all-visible and all-frozen)
+ * per heap page. A set all-visible bit means that all tuples on the page are
+ * known visible to all transactions, and therefore the page doesn't need to
+ * be vacuumed. A set all-frozen bit means that all tuples on the page are
+ * completely frozen, and therefore the page doesn't need to be vacuumed even
+ * if whole table scanning vacuum is required (e.g. anti-wraparound vacuum).
+ * The all-frozen bit must be set only when the page is already all-visible.
+ *
+ * The map is conservative in the sense that we make sure that whenever a bit
+ * is set, we know the condition is true, but if a bit is not set, it might or
+ * might not be true.
*
* Clearing a visibility map bit is not separately WAL-logged. The callers
* must make sure that whenever a bit is cleared, the bit is cleared on WAL
- * replay of the updating operation as well.
+ * replay of the updating operation as well. And all-frozen bit must be
+ * cleared with all-visible at the same time.
*
* When we *set* a visibility map during VACUUM, we must write WAL. This may
* seem counterintuitive, since the bit is basically a hint: if it is clear,
- * it may still be the case that every tuple on the page is visible to all
- * transactions; we just don't know that for certain. The difficulty is that
- * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
- * on the page itself, and the visibility map bit. If a crash occurs after the
- * visibility map page makes it to disk and before the updated heap page makes
- * it to disk, redo must set the bit on the heap page. Otherwise, the next
- * insert, update, or delete on the heap page will fail to realize that the
- * visibility map bit must be cleared, possibly causing index-only scans to
- * return wrong answers.
+ * it may still be the case that every tuple on the page is all-visible or
+ * all-frozen we just don't know that for certain. The difficulty is that
+ * there are two bits which are typically set together: the PD_ALL_VISIBLE
+ * or PD_ALL_FROZEN bit on the page itself, and the corresponding visibility map
+ * bit. If a crash occurs after the visibility map page makes it to disk and before
+ * the updated heap page makes it to disk, redo must set the bit on the heap page.
+ * Otherwise, the next insert, update, or delete on the heap page will fail to
+ * realize that the visibility map bit must be cleared, possibly causing index-only
+ * scans to return wrong answers.
*
* VACUUM will normally skip pages for which the visibility map bit is set;
* such pages can't contain any dead tuples and therefore don't need vacuuming.
- * The visibility map is not used for anti-wraparound vacuums, because
- * an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
- * present in the table, even on pages that don't have any dead tuples.
+ * The visibility map has the all-frozen bit which indicates all tuples on
+ * corresponding page has been completely frozen, so the visibility map is also
+ * used for anti-wraparound vacuum, even if freezing of tuples is required.
*
* LOCKING
*
@@ -58,14 +65,14 @@
* section that logs the page modification. However, we don't want to hold
* the buffer lock over any I/O that may be required to read in the visibility
* map page. To avoid this, we examine the heap page before locking it;
- * if the page-level PD_ALL_VISIBLE bit is set, we pin the visibility map
- * bit. Then, we lock the buffer. But this creates a race condition: there
- * is a possibility that in the time it takes to lock the buffer, the
- * PD_ALL_VISIBLE bit gets set. If that happens, we have to unlock the
- * buffer, pin the visibility map page, and relock the buffer. This shouldn't
- * happen often, because only VACUUM currently sets visibility map bits,
- * and the race will only occur if VACUUM processes a given page at almost
- * exactly the same time that someone tries to further modify it.
+ * if the page-level PD_ALL_VISIBLE or PD_ALL_FROZEN bit is set, we pin the
+ * visibility map bit. Then, we lock the buffer. But this creates a race
+ * condition: there is a possibility that in the time it takes to lock the
+ * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that happens,
+ * we have to unlock the buffer, pin the visibility map page, and relock the
+ * buffer. This shouldn't happen often, because only VACUUM currently sets
+ * visibility map bits, and the race will only occur if VACUUM processes a given
+ * page at almost exactly the same time that someone tries to further modify it.
*
* To set a bit, you need to hold a lock on the heap page. That prevents
* the race condition where VACUUM sees that all tuples on the page are
@@ -101,11 +108,14 @@
*/
#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
-/* Number of bits allocated for each heap block. */
-#define BITS_PER_HEAPBLOCK 1
+/*
+ * Number of bits allocated for each heap block.
+ * One for all-visible, other for all-frozen.
+*/
+#define BITS_PER_HEAPBLOCK 2
/* Number of heap blocks we can represent in one byte. */
-#define HEAPBLOCKS_PER_BYTE 8
+#define HEAPBLOCKS_PER_BYTE 4
/* Number of heap blocks we can represent in one visibility map page. */
#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
@@ -113,26 +123,44 @@
/* Mapping from heap block number to the right bit in the visibility map */
#define HEAPBLK_TO_MAPBLOCK(x) ((x) / HEAPBLOCKS_PER_PAGE)
#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
-#define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
-
-/* table for fast counting of set bits */
-static const uint8 number_of_ones[256] = {
- 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
+#define HEAPBLK_TO_MAPBIT(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)
+
+/* tables for fast counting of set bits for visible and freeze */
+static const uint8 number_of_ones_for_visible[256] = {
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4
+};
+static const uint8 number_of_ones_for_frozen[256] = {
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4
};
/* prototypes for internal routines */
@@ -141,7 +169,7 @@ static void vm_extend(Relation rel, BlockNumber nvmblocks);
/*
- * visibilitymap_clear - clear a bit in visibility map
+ * visibilitymap_clear - clear all bits in visibility map
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -153,11 +181,11 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
int mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- uint8 mask = 1 << mapBit;
+ uint8 mask = VISIBILITYMAP_VALID_BITS << mapBit;
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_clear %s block %d", RelationGetRelationName(rel), heapBlk);
#endif
if (!BufferIsValid(buf) || BufferGetBlockNumber(buf) != mapBlock)
@@ -186,7 +214,7 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
* visibilitymap_set to actually set the bit.
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by
- * an earlier call to visibilitymap_pin or visibilitymap_test on the same
+ * an earlier call to visibilitymap_pin or visibilitymap_get_status on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk.
*
@@ -212,7 +240,7 @@ visibilitymap_pin(Relation rel, BlockNumber heapBlk, Buffer *buf)
* visibilitymap_pin_ok - do we already have the correct page pinned?
*
* On entry, buf should be InvalidBuffer or a valid buffer returned by
- * an earlier call to visibilitymap_pin or visibilitymap_test on the same
+ * an earlier call to visibilitymap_pin or visibilitymap_get_status on the same
* relation. The return value indicates whether the buffer covers the
* given heapBlk.
*/
@@ -225,7 +253,7 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
}
/*
- * visibilitymap_set - set a bit on a previously pinned page
+ * visibilitymap_set - set bit(s) on a previously pinned page
*
* recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
* or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
@@ -234,10 +262,11 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
* marked all-visible; it is needed for Hot Standby, and can be
* InvalidTransactionId if the page contains no tuples.
*
- * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
- * this function. Except in recovery, caller should also pass the heap
- * buffer. When checksums are enabled and we're not in recovery, we must add
- * the heap buffer to the WAL chain to protect it from being torn.
+ * Caller is expected to set the heap page's PD_ALL_VISIBLE or PD_ALL_FROZEN
+ * bit before calling this function. Except in recovery, caller should also
+ * pass the heap buffer and flags which indicates what flag we want to set.
+ * When checksums are enabled and we're not in recovery, we must add the heap
+ * buffer to the WAL chain to protect it from being torn.
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -245,7 +274,8 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
*/
void
visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid)
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -254,11 +284,12 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_set %s block %d, flag %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
Assert(InRecovery || BufferIsValid(heapBuf));
+ Assert(flags & VISIBILITYMAP_VALID_BITS);
/* Check that we have the right heap page pinned, if present */
if (BufferIsValid(heapBuf) && BufferGetBlockNumber(heapBuf) != heapBlk)
@@ -272,11 +303,11 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map = PageGetContents(page);
LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
- if (!(map[mapByte] & (1 << mapBit)))
+ if (flags != (map[mapByte] & (flags << mapBit)))
{
START_CRIT_SECTION();
- map[mapByte] |= (1 << mapBit);
+ map[mapByte] |= (flags << mapBit);
MarkBufferDirty(vmBuf);
if (RelationNeedsWAL(rel))
@@ -285,7 +316,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Assert(!InRecovery);
recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
- cutoff_xid);
+ cutoff_xid, flags);
/*
* If data checksums are enabled (or wal_log_hints=on), we
@@ -295,11 +326,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Page heapPage = BufferGetPage(heapBuf);
- /* caller is expected to set PD_ALL_VISIBLE first */
- Assert(PageIsAllVisible(heapPage));
+ /*
+ * Caller is expected to set PD_ALL_VISIBLE or
+ * PD_ALL_FROZEN first.
+ */
+ Assert(((flags | VISIBILITYMAP_ALL_VISIBLE) && PageIsAllVisible(heapPage)) ||
+ ((flags | VISIBILITYMAP_ALL_FROZEN) && PageIsAllFrozen(heapPage)));
PageSetLSN(heapPage, recptr);
}
}
+
PageSetLSN(page, recptr);
}
@@ -310,15 +346,17 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
}
/*
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_get_status - get status of bits
*
- * Are all tuples on heapBlk visible to all, according to the visibility map?
+ * Are all tuples on heapBlk visible to all or are marked frozen, according
+ * to the visibility map?
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by an
- * earlier call to visibilitymap_pin or visibilitymap_test on the same
+ * earlier call to visibilitymap_pin or visibilitymap_get_status on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk, or InvalidBuffer. The caller is responsible for
- * releasing *buf after it's done testing and setting bits.
+ * releasing *buf after it's done testing and setting bits, and must pass flags
+ * for which it needs to check the value in visibility map.
*
* NOTE: This function is typically called without a lock on the heap page,
* so somebody else could change the bit just after we look at it. In fact,
@@ -327,17 +365,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
* we might see the old value. It is the caller's responsibility to deal with
* all concurrency issues!
*/
-bool
-visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
+uint8
+visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *buf)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
uint8 mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- bool result;
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_test %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_get_status %s, block %d", RelationGetRelationName(rel), heapBlk);
#endif
/* Reuse the old pinned buffer if possible */
@@ -360,13 +397,11 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
map = PageGetContents(BufferGetPage(*buf));
/*
- * A single-bit read is atomic. There could be memory-ordering effects
+ * The double bits read is atomic. There could be memory-ordering effects
* here, but for performance reasons we make it the caller's job to worry
* about that.
*/
- result = (map[mapByte] & (1 << mapBit)) ? true : false;
-
- return result;
+ return ((map[mapByte] >> mapBit) & VISIBILITYMAP_VALID_BITS);
}
/*
@@ -374,13 +409,17 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
*
* Note: we ignore the possibility of race conditions when the table is being
* extended concurrently with the call. New pages added to the table aren't
- * going to be marked all-visible, so they won't affect the result.
+ * going to be marked all-visible or all-frozen, so they won't affect the result.
+ * The caller must set the flags which indicates what flag we want to count.
*/
BlockNumber
-visibilitymap_count(Relation rel)
+visibilitymap_count(Relation rel, BlockNumber *all_frozen)
{
- BlockNumber result = 0;
BlockNumber mapBlock;
+ BlockNumber all_visible = 0;
+
+ if (all_frozen)
+ *all_frozen = 0;
for (mapBlock = 0;; mapBlock++)
{
@@ -406,13 +445,15 @@ visibilitymap_count(Relation rel)
for (i = 0; i < MAPSIZE; i++)
{
- result += number_of_ones[map[i]];
+ all_visible += number_of_ones_for_visible[map[i]];
+ if (all_frozen)
+ *all_frozen += number_of_ones_for_frozen[map[i]];
}
ReleaseBuffer(mapBuffer);
}
- return result;
+ return all_visible;
}
/*
@@ -435,7 +476,7 @@ visibilitymap_truncate(Relation rel, BlockNumber nheapblocks)
uint8 truncBit = HEAPBLK_TO_MAPBIT(nheapblocks);
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_truncate %s %d", RelationGetRelationName(rel), nheapblocks);
+ elog(DEBUG1, "vm_truncate %s block %d", RelationGetRelationName(rel), nheapblocks);
#endif
RelationOpenSmgr(rel);
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e59b163..1cea026 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1813,7 +1813,7 @@ FormIndexDatum(IndexInfo *indexInfo,
* isprimary: if true, set relhaspkey true; else no change
* reltuples: if >= 0, set reltuples to this value; else no change
*
- * If reltuples >= 0, relpages and relallvisible are also updated (using
+ * If reltuples >= 0, relpages, relallvisible are also updated (using
* RelationGetNumberOfBlocks() and visibilitymap_count()).
*
* NOTE: an important side-effect of this operation is that an SI invalidation
@@ -1921,7 +1921,7 @@ index_update_stats(Relation rel,
BlockNumber relallvisible;
if (rd_rel->relkind != RELKIND_INDEX)
- relallvisible = visibilitymap_count(rel);
+ relallvisible = visibilitymap_count(rel, NULL);
else /* don't bother for indexes */
relallvisible = 0;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index ccc030f..8c555eb 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -444,6 +444,7 @@ CREATE VIEW pg_stat_all_tables AS
pg_stat_get_live_tuples(C.oid) AS n_live_tup,
pg_stat_get_dead_tuples(C.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(C.oid) AS n_mod_since_analyze,
+ pg_stat_get_frozen_pages(C.oid) AS n_frozen_page,
pg_stat_get_last_vacuum_time(C.oid) as last_vacuum,
pg_stat_get_last_autovacuum_time(C.oid) as last_autovacuum,
pg_stat_get_last_analyze_time(C.oid) as last_analyze,
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index ddb68ab..02a2c68 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -314,6 +314,8 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
+ BlockNumber relallvisible,
+ relallfrozen;
if (inh)
ereport(elevel,
@@ -564,6 +566,10 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
}
}
+ /* Calculate the number of all-visible and all-frozen bit */
+ if (!inh)
+ relallvisible = visibilitymap_count(onerel, &relallfrozen);
+
/*
* Update pages/tuples stats in pg_class ... but not if we're doing
* inherited stats.
@@ -572,7 +578,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
vac_update_relstats(onerel,
relpages,
totalrows,
- visibilitymap_count(onerel),
+ relallvisible,
hasindex,
InvalidTransactionId,
InvalidMultiXactId,
@@ -608,7 +614,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
* tracks per-table stats.
*/
if (!inh)
- pgstat_report_analyze(onerel, totalrows, totaldeadrows);
+ pgstat_report_analyze(onerel, totalrows, totaldeadrows, relallfrozen);
/* If this isn't part of VACUUM ANALYZE, let index AMs do cleanup */
if (!(options & VACOPT_VACUUM))
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7c4ef58..0a02a25 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -729,11 +729,11 @@ vac_estimate_reltuples(Relation relation, bool is_analyze,
* marked with xmin = our xid.
*
* In addition to fundamentally nontransactional statistics such as
- * relpages and relallvisible, we try to maintain certain lazily-updated
- * DDL flags such as relhasindex, by clearing them if no longer correct.
- * It's safe to do this in VACUUM, which can't run in parallel with
- * CREATE INDEX/RULE/TRIGGER and can't be part of a transaction block.
- * However, it's *not* safe to do it in an ANALYZE that's within an
+ * relpages, relallvisible, we try to maintain certain lazily-updated
+ * DDL flags such as relhasindex, by clearing them if no onger correct.
+ * It's safe to do this in VACUUM, which can't run in
+ * parallel with CREATE INDEX/RULE/TRIGGER and can't be part of a transaction
+ * block. However, it's *not* safe to do it in an ANALYZE that's within an
* outer transaction, because for example the current transaction might
* have dropped the last index; then we'd think relhasindex should be
* cleared, but if the transaction later rolls back this would be wrong.
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 2429889..994efb7 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -106,6 +106,8 @@ typedef struct LVRelStats
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* number of pages we examined */
BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber vmskipped_frozen_pages; /* # of pages we skipped by all-frozen bit
+ of visibility map */
double scanned_tuples; /* counts only tuples on scanned pages */
double old_rel_tuples; /* previous value of pg_class.reltuples */
double new_rel_tuples; /* new estimated total # of tuples */
@@ -156,7 +158,7 @@ static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
static int vac_cmp_itemptr(const void *left, const void *right);
static bool heap_page_is_all_visible(Relation rel, Buffer buf,
- TransactionId *visibility_cutoff_xid);
+ TransactionId *visibility_cutoff_xid, bool *all_frozen);
/*
@@ -188,7 +190,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
MultiXactId mxactFullScanLimit;
BlockNumber new_rel_pages;
double new_rel_tuples;
- BlockNumber new_rel_allvisible;
+ BlockNumber new_rel_allvisible,
+ new_rel_allfrozen;
double new_live_tuples;
TransactionId new_frozen_xid;
MultiXactId new_min_multi;
@@ -221,7 +224,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* We request a full scan if either the table's frozen Xid is now older
* than or equal to the requested Xid full-table scan limit; or if the
* table's minimum MultiXactId is older than or equal to the requested
- * mxid full-table scan limit.
+ * mxid full-table scan limit. During full scan, we could skip some pags
+ * according to all-frozen bit of visibility map.
*/
scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
xidFullScanLimit);
@@ -253,7 +257,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
- if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages + vacrelstats->vmskipped_frozen_pages)
+ < vacrelstats->rel_pages)
{
Assert(!scan_all);
scanned_all = false;
@@ -301,10 +306,13 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_tuples = vacrelstats->old_rel_tuples;
}
- new_rel_allvisible = visibilitymap_count(onerel);
+ new_rel_allvisible = visibilitymap_count(onerel, &new_rel_allfrozen);
if (new_rel_allvisible > new_rel_pages)
new_rel_allvisible = new_rel_pages;
+ if (new_rel_allfrozen > new_rel_pages)
+ new_rel_allfrozen = new_rel_pages;
+
new_frozen_xid = scanned_all ? FreezeLimit : InvalidTransactionId;
new_min_multi = scanned_all ? MultiXactCutoff : InvalidMultiXactId;
@@ -325,7 +333,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
pgstat_report_vacuum(RelationGetRelid(onerel),
onerel->rd_rel->relisshared,
new_live_tuples,
- vacrelstats->new_dead_tuples);
+ vacrelstats->new_dead_tuples,
+ new_rel_allfrozen);
/* and log the action if appropriate */
if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
@@ -360,10 +369,11 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel),
vacrelstats->num_index_scans);
- appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins\n"),
+ appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped according to vm\n"),
vacrelstats->pages_removed,
vacrelstats->rel_pages,
- vacrelstats->pinskipped_pages);
+ vacrelstats->pinskipped_pages,
+ vacrelstats->vmskipped_frozen_pages);
appendStringInfo(&buf,
_("tuples: %.0f removed, %.0f remain, %.0f are dead but not yet removable\n"),
vacrelstats->tuples_deleted,
@@ -486,9 +496,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* consecutive pages. Since we're reading sequentially, the OS should be
* doing readahead for us, so there's no gain in skipping a page now and
* then; that's likely to disable readahead and so be counterproductive.
- * Also, skipping even a single page means that we can't update
- * relfrozenxid, so we only want to do it if we can skip a goodly number
- * of pages.
+ * Also, skipping even a single page accorinding to all-visible bit of
+ * visibility map means that we can't update relfrozenxid, so we only want
+ * to do it if we can skip a goodly number. On the other hand, we count
+ * both how many pages we skipped according to all-frozen bit of visibility
+ * map and how many pages we freeze, so we can update relfrozenxid if
+ * the sum of two is as many as pages of table.
*
* Before entering the main loop, establish the invariant that
* next_not_all_visible_block is the next block number >= blkno that's not
@@ -498,24 +511,24 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* started skipping blocks, we may as well skip everything up to the next
* not-all-visible block.
*
- * Note: if scan_all is true, we won't actually skip any pages; but we
+ * Note: if scan_all is true, we might not actually skip any pages; but we
* maintain next_not_all_visible_block anyway, so as to set up the
* all_visible_according_to_vm flag correctly for each page.
*
- * Note: The value returned by visibilitymap_test could be slightly
+ * Note: The value returned by visibilitymap_get_status could be slightly
* out-of-date, since we make this test before reading the corresponding
* heap page or locking the buffer. This is OK. If we mistakenly think
- * that the page is all-visible when in fact the flag's just been cleared,
- * we might fail to vacuum the page. But it's OK to skip pages when
- * scan_all is not set, so no great harm done; the next vacuum will find
- * them. If we make the reverse mistake and vacuum a page unnecessarily,
- * it'll just be a no-op.
+ * that the page is all-visible/all-frozen when in fact the flag's just
+ * been cleared, we might fail to vacuum the page. But it's OK to skip
+ * pages when scan_all is not set, so no great harm done; the next vacuum
+ * will find them. If we make the reverse mistake and vacuum a page
+ * unnecessarily, it'll just be a no-op.
*/
for (next_not_all_visible_block = 0;
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer))
+ if (!VM_ALL_VISIBLE(onerel, next_not_all_visible_block, &vmbuffer))
break;
vacuum_delay_point();
}
@@ -533,9 +546,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
bool tupgone,
hastup;
int prev_dead_count;
- int nfrozen;
+ int nfrozen; /* # of tuples is frozen */
+ int nalready_frozen; /* # of tuples is already frozen */
+ int ntotal_frozen; /* # of frozen tuples in single page */
+ int ntup_per_page;
Size freespace;
bool all_visible_according_to_vm;
+ bool all_frozen_according_to_vm;
bool all_visible;
bool has_dead_tuples;
TransactionId visibility_cutoff_xid = InvalidTransactionId;
@@ -547,8 +564,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block,
- &vmbuffer))
+ if (!VM_ALL_VISIBLE(onerel, next_not_all_visible_block, &vmbuffer))
break;
vacuum_delay_point();
}
@@ -562,14 +578,29 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
skipping_all_visible_blocks = true;
else
skipping_all_visible_blocks = false;
+
all_visible_according_to_vm = false;
+ all_frozen_according_to_vm = false;
}
else
{
- /* Current block is all-visible */
- if (skipping_all_visible_blocks && !scan_all)
- continue;
+ /*
+ * This block is at least all-visible according to the visibility map.
+ * We check whether this block is all-frozen or not, to skip to
+ * vacuum this page even if scan_all is true.
+ */
+ bool all_frozen = VM_ALL_FROZEN(onerel, blkno, &vmbuffer);
+
+ if (scan_all && all_frozen)
+ {
+ vacrelstats->vmskipped_frozen_pages++;
+ continue;
+ }
+ else if (!scan_all && skipping_all_visible_blocks)
+ continue;
+
all_visible_according_to_vm = true;
+ all_frozen_according_to_vm = all_frozen;
}
vacuum_delay_point();
@@ -716,7 +747,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
empty_pages++;
freespace = PageGetHeapFreeSpace(page);
- /* empty pages are always all-visible */
+ /* empty pages are always all-visible and all-frozen */
if (!PageIsAllVisible(page))
{
START_CRIT_SECTION();
@@ -739,8 +770,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
log_newpage_buffer(buf, true);
PageSetAllVisible(page);
+ PageSetAllFrozen(page);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, InvalidTransactionId);
+ vmbuffer, InvalidTransactionId,
+ VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
END_CRIT_SECTION();
}
@@ -764,6 +797,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
all_visible = true;
has_dead_tuples = false;
nfrozen = 0;
+ nalready_frozen = 0;
+ ntup_per_page = 0;
hastup = false;
prev_dead_count = vacrelstats->num_dead_tuples;
maxoff = PageGetMaxOffsetNumber(page);
@@ -918,8 +953,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
{
num_tuples += 1;
+ ntup_per_page += 1;;
hastup = true;
+ /* Check whether this tuple is alrady frozen or not */
+ if (HeapTupleHeaderXminFrozen(tuple.t_data))
+ nalready_frozen += 1;
+
/*
* Each non-removable tuple must be checked to see if it needs
* freezing. Note we already have exclusive buffer lock.
@@ -931,9 +971,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
} /* scan along page */
/*
- * If we froze any tuples, mark the buffer dirty, and write a WAL
- * record recording the changes. We must log the changes to be
- * crash-safe against future truncation of CLOG.
+ * If we freeze any tuples, mark the buffer dirty, and write a WAL
+ * record recording the changes. We must log the changes to be crash-safe
+ * against future truncation of CLOG.
*/
if (nfrozen > 0)
{
@@ -966,6 +1006,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
END_CRIT_SECTION();
}
+ /* Compute the number of frozen tuples in a page */
+ ntotal_frozen = nfrozen + nalready_frozen;
+
/*
* If there are no indexes then we can vacuum the page right now
* instead of doing a second scan.
@@ -988,26 +1031,46 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
freespace = PageGetHeapFreeSpace(page);
- /* mark page all-visible, if appropriate */
- if (all_visible && !all_visible_according_to_vm)
+ /* This page is all visible */
+ if (all_visible)
{
- /*
- * It should never be the case that the visibility map page is set
- * while the page-level bit is clear, but the reverse is allowed
- * (if checksums are not enabled). Regardless, set the both bits
- * so that we get back in sync.
- *
- * NB: If the heap page is all-visible but the VM bit is not set,
- * we don't need to dirty the heap page. However, if checksums
- * are enabled, we do need to make sure that the heap page is
- * dirtied before passing it to visibilitymap_set(), because it
- * may be logged. Given that this situation should only happen in
- * rare cases after a crash, it is not worth optimizing.
- */
- PageSetAllVisible(page);
- MarkBufferDirty(buf);
- visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ /* mark page all-visible, if appropriate */
+ if (!all_visible_according_to_vm)
+ {
+ /*
+ * It should never be the case that the visibility map page is set
+ * while the page-level bit is clear, but the reverse is allowed
+ * (if checksums are not enabled). Regardless, set the both bits
+ * so that we get back in sync.
+ *
+ * NB: If the heap page is all-visible but the VM bit is not set,
+ * we don't need to dirty the heap page. However, if checksums
+ * are enabled, we do need to make sure that the heap page is
+ * dirtied before passing it to visibilitymap_set(), because it
+ * may be logged. Given that this situation should only happen in
+ * rare cases after a crash, it is not worth optimizing.
+ */
+ PageSetAllVisible(page);
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+ }
+
+ /* mark page all-frozen, if all tuples are frozen and not marked yet */
+ if ((ntotal_frozen == ntup_per_page) && !all_frozen_according_to_vm)
+ {
+ Assert(PageIsAllVisible(page));
+
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ {
+ MarkBufferDirty(buf);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+ vmbuffer, visibility_cutoff_xid, flags);
+ }
}
/*
@@ -1018,9 +1081,14 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* that something bad has happened.
*/
else if (all_visible_according_to_vm && !PageIsAllVisible(page)
- && visibilitymap_test(onerel, blkno, &vmbuffer))
+ && VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
{
- elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
+ /* If the all-frozen is set then all-visible must be set */
+ if (all_frozen_according_to_vm)
+ Assert(VM_ALL_FROZEN(onerel, blkno, &vmbuffer) &&
+ VM_ALL_VISIBLE(onerel, blkno, &vmbuffer));
+
+ elog(WARNING, "page is not marked all-visible (and all-frozen) but visibility map bit(s) is set in relation \"%s\" page %u",
relname, blkno);
visibilitymap_clear(onerel, blkno, vmbuffer);
}
@@ -1028,19 +1096,25 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
/*
* It's possible for the value returned by GetOldestXmin() to move
* backwards, so it's not wrong for us to see tuples that appear to
- * not be visible to everyone yet, while PD_ALL_VISIBLE is already
- * set. The real safe xmin value never moves backwards, but
- * GetOldestXmin() is conservative and sometimes returns a value
+ * not be visible to everyone yet, while PD_ALL_VISIBLE (and PD_ALL_FROZEN)
+ * are already set. The real safe xmin value never moves backwards,
+ * but GetOldestXmin() is conservative and sometimes returns a value
* that's unnecessarily small, so if we see that contradiction it just
* means that the tuples that we think are not visible to everyone yet
- * actually are, and the PD_ALL_VISIBLE flag is correct.
+ * actually are, and the PD_ALL_VISIBLE (and PD_ALL_FROZEN) flags are
+ * correct.
*
* There should never be dead tuples on a page with PD_ALL_VISIBLE
* set, however.
*/
else if (PageIsAllVisible(page) && has_dead_tuples)
{
- elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
+ /* If the all-frozen is set then all-visible must be set */
+ if (PageIsAllFrozen(page))
+ Assert(VM_ALL_FROZEN(onerel, blkno, &vmbuffer) &&
+ VM_ALL_VISIBLE(onerel, blkno, &vmbuffer));
+
+ elog(WARNING, "page containing dead tuples is marked as all-visible (and all-frozen) in relation \"%s\" page %u",
relname, blkno);
PageClearAllVisible(page);
MarkBufferDirty(buf);
@@ -1114,6 +1188,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
RelationGetRelationName(onerel),
tups_vacuumed, vacuumed_pages)));
+ /* Report how many frozen pages vacuum skipped according to visibility map */
+ ereport(elevel,
+ (errmsg_plural("skipped %d frozen page according to visibility map",
+ "skipped %d frozen pages according to visibility map",
+ vacrelstats->vmskipped_frozen_pages,
+ vacrelstats->vmskipped_frozen_pages)));
+
/*
* This is pretty messy, but we split it up so that we can skip emitting
* individual parts of the message when not applicable.
@@ -1230,6 +1311,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
OffsetNumber unused[MaxOffsetNumber];
int uncnt = 0;
TransactionId visibility_cutoff_xid;
+ bool all_frozen;
START_CRIT_SECTION();
@@ -1281,19 +1363,34 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
* dirty, exclusively locked, and, if needed, a full page image has been
* emitted in the log_heap_clean() above.
*/
- if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid))
+ if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid, &all_frozen))
PageSetAllVisible(page);
/*
* All the changes to the heap page have been done. If the all-visible
- * flag is now set, also set the VM bit.
+ * flag is now set, also set the VM all-visible bit.
+ * Also, if this page is all-frozen, set the VM all-frozen bit and flag.
*/
- if (PageIsAllVisible(page) &&
- !visibilitymap_test(onerel, blkno, vmbuffer))
+ if (PageIsAllVisible(page))
{
+ uint8 vm_status = visibilitymap_get_status(onerel, blkno, vmbuffer);
+ uint8 flags = 0;
+
+ if (!(vm_status & VISIBILITYMAP_ALL_VISIBLE))
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+
+ /* Set the VM all-frozen bit to flag, if needed */
+ if (all_frozen && !(vm_status & VISIBILITYMAP_ALL_FROZEN))
+ {
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
Assert(BufferIsValid(*vmbuffer));
- visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
- visibility_cutoff_xid);
+
+ if (vm_status != flags)
+ visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
+ visibility_cutoff_xid, flags);
}
return tupindex;
@@ -1783,10 +1880,12 @@ vac_cmp_itemptr(const void *left, const void *right)
/*
* Check if every tuple in the given page is visible to all current and future
* transactions. Also return the visibility_cutoff_xid which is the highest
- * xmin amongst the visible tuples.
+ * xmin amongst the visible tuples, and all_frozen which implies that all tuples
+ * of this page are frozen.
*/
static bool
-heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid)
+heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid,
+ bool *all_frozen)
{
Page page = BufferGetPage(buf);
BlockNumber blockno = BufferGetBlockNumber(buf);
@@ -1795,6 +1894,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
bool all_visible = true;
*visibility_cutoff_xid = InvalidTransactionId;
+ *all_frozen = true;
/*
* This is a stripped down version of the line pointer scan in
@@ -1818,11 +1918,12 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/*
* Dead line pointers can have index pointers pointing to them. So
- * they can't be treated as visible
+ * they can't be treated as visible and frozen.
*/
if (ItemIdIsDead(itemid))
{
all_visible = false;
+ *all_frozen = false;
break;
}
@@ -1859,6 +1960,10 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
+
+ /* Check whether this tuple is alrady frozen or not */
+ if (!HeapTupleHeaderXminFrozen(tuple.t_data))
+ *all_frozen = false;
}
break;
@@ -1867,6 +1972,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
case HEAPTUPLE_INSERT_IN_PROGRESS:
case HEAPTUPLE_DELETE_IN_PROGRESS:
all_visible = false;
+ *all_frozen = false;
break;
default:
@@ -1875,5 +1981,8 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
}
} /* scan along page */
+ if (!all_visible)
+ *all_frozen = false;
+
return all_visible;
}
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 9f54c46..e345177 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -85,7 +85,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
* which all tuples are known visible to everybody. In any case,
* we'll use the index tuple not the heap tuple as the data source.
*
- * Note on Memory Ordering Effects: visibilitymap_test does not lock
+ * Note on Memory Ordering Effects: visibilitymap_get_stattus does not lock
* the visibility map buffer, and therefore the result we read here
* could be slightly stale. However, it can't be stale enough to
* matter.
@@ -114,9 +114,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
* It's worth going through this complexity to avoid needing to lock
* the VM buffer, which could cause significant contention.
*/
- if (!visibilitymap_test(scandesc->heapRelation,
- ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
+ if (!VM_ALL_VISIBLE(scandesc->heapRelation, ItemPointerGetBlockNumber(tid),
+ &node->ioss_VMBuffer))
{
/*
* Rats, we have to visit the heap to check visibility.
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 990486c..d100a7d 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -468,7 +468,7 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count)
* estimates based on the correlation squared (XXX is that appropriate?).
*
* If it's an index-only scan, then we will not need to fetch any heap
- * pages for which the visibility map shows all tuples are visible.
+ * pages for which the visibility map map shows all tuples are visible.
* Hence, reduce the estimated number of heap fetches accordingly.
* We use the measured fraction of the entire heap that is all-visible,
* which might not be particularly relevant to the subset of the heap
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index ab018c4..ca7257a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1329,7 +1329,8 @@ pgstat_report_autovac(Oid dboid)
*/
void
pgstat_report_vacuum(Oid tableoid, bool shared,
- PgStat_Counter livetuples, PgStat_Counter deadtuples)
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages)
{
PgStat_MsgVacuum msg;
@@ -1343,6 +1344,7 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
msg.m_vacuumtime = GetCurrentTimestamp();
msg.m_live_tuples = livetuples;
msg.m_dead_tuples = deadtuples;
+ msg.m_frozen_pages = frozenpages;
pgstat_send(&msg, sizeof(msg));
}
@@ -1354,7 +1356,8 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
*/
void
pgstat_report_analyze(Relation rel,
- PgStat_Counter livetuples, PgStat_Counter deadtuples)
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages)
{
PgStat_MsgAnalyze msg;
@@ -1394,6 +1397,7 @@ pgstat_report_analyze(Relation rel,
msg.m_analyzetime = GetCurrentTimestamp();
msg.m_live_tuples = livetuples;
msg.m_dead_tuples = deadtuples;
+ msg.m_frozen_pages = frozenpages;
pgstat_send(&msg, sizeof(msg));
}
@@ -3702,6 +3706,7 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
result->n_live_tuples = 0;
result->n_dead_tuples = 0;
result->changes_since_analyze = 0;
+ result->n_frozen_pages = 0;
result->blocks_fetched = 0;
result->blocks_hit = 0;
result->vacuum_timestamp = 0;
@@ -5069,6 +5074,7 @@ pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
tabentry->n_live_tuples = msg->m_live_tuples;
tabentry->n_dead_tuples = msg->m_dead_tuples;
+ tabentry->n_frozen_pages = msg->m_frozen_pages;
if (msg->m_autovacuum)
{
@@ -5103,6 +5109,7 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
tabentry->n_live_tuples = msg->m_live_tuples;
tabentry->n_dead_tuples = msg->m_dead_tuples;
+ tabentry->n_frozen_pages = msg->m_frozen_pages;
/*
* We reset changes_since_analyze to zero, forgetting any changes that
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index f7c9bf6..98c14f9 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -46,6 +46,7 @@ extern Datum pg_stat_get_vacuum_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_autovacuum_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_analyze_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_autoanalyze_count(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_frozen_pages(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_function_calls(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_function_total_time(PG_FUNCTION_ARGS);
@@ -450,6 +451,21 @@ pg_stat_get_autoanalyze_count(PG_FUNCTION_ARGS)
}
Datum
+pg_stat_get_frozen_pages(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int32 result;
+ PgStat_StatTabEntry *tabentry;
+
+ if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+ result = 0;
+ else
+ result = (int32) (tabentry->n_frozen_pages);
+
+ PG_RETURN_INT32(result);
+}
+
+Datum
pg_stat_get_function_calls(PG_FUNCTION_ARGS)
{
Oid funcid = PG_GETARG_OID(0);
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 41d4606..3a666f8 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -231,6 +231,15 @@ check_cluster_versions(void)
if (old_cluster.major_version > new_cluster.major_version)
pg_fatal("This utility cannot be used to downgrade to older major PostgreSQL versions.\n");
+ /*
+ * We can't allow upgrading with link mode between 9.5 or before and 9.6 or later,
+ * because the format of visibility map has been changed in version 9.6.
+ */
+ if (user_opts.transfer_mode == TRANSFER_MODE_LINK &&
+ GET_MAJOR_VERSION(old_cluster.major_version) < 906 &&
+ GET_MAJOR_VERSION(new_cluster.major_version) >= 906)
+ pg_fatal("This utility cannot upgrade from PostgreSQL version from 9.5 or before to 9.6 or later with link mode.\n");
+
/* get old and new binary versions */
get_bin_version(&old_cluster);
get_bin_version(&new_cluster);
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 37eb832..d448a55 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -10,6 +10,7 @@
#include "postgres_fe.h"
#include "pg_upgrade.h"
+#include "storage/bufpage.h"
#include <fcntl.h>
@@ -21,6 +22,43 @@ static int copy_file(const char *fromfile, const char *tofile, bool force);
static int win32_pghardlink(const char *src, const char *dst);
#endif
+static const char *rewriteVisibilitymap(const char *fromfile, const char *tofile,
+ bool force);
+
+/* table for fast rewriting vm file in order to add all-frozen information */
+static const uint16 rewrite_vm_table[256] = {
+ 0, 1, 4, 5, 16, 17, 20, 21, 64, 65, 68, 69, 80, 81, 84, 85,
+ 256, 257, 260, 261, 272, 273, 276, 277, 320, 321, 324, 325, 336, 337, 340, 341,
+ 1024, 1025, 1028, 1029, 1040, 1041, 1044, 1045, 1088, 1089, 1092, 1093, 1104, 1105, 1108, 1109,
+ 1280, 1281, 1284, 1285, 1296, 1297, 1300, 1301, 1344, 1345, 1348, 1349, 1360, 1361, 1364, 1365,
+ 4096, 4097, 4100, 4101, 4112, 4113, 4116, 4117, 4160, 4161, 4164, 4165, 4176, 4177, 4180, 4181,
+ 4352, 4353, 4356, 4357, 4368, 4369, 4372, 4373, 4416, 4417, 4420, 4421, 4432, 4433, 4436, 4437,
+ 5120, 5121, 5124, 5125, 5136, 5137, 5140, 5141, 5184, 5185, 5188, 5189, 5200, 5201, 5204, 5205,
+ 5376, 5377, 5380, 5381, 5392, 5393, 5396, 5397, 5440, 5441, 5444, 5445, 5456, 5457, 5460, 5461,
+ 16384, 16385, 16388, 16389, 16400, 16401, 16404, 16405, 16448, 16449, 16452, 16453, 16464, 16465, 16468, 16469,
+ 16640, 16641, 16644, 16645, 16656, 16657, 16660, 16661, 16704, 16705, 16708, 16709, 16720, 16721, 16724, 16725,
+ 17408, 17409, 17412, 17413, 17424, 17425, 17428, 17429, 17472, 17473, 17476, 17477, 17488, 17489, 17492, 17493,
+ 17664, 17665, 17668, 17669, 17680, 17681, 17684, 17685, 17728, 17729, 17732, 17733, 17744, 17745, 17748, 17749,
+ 20480, 20481, 20484, 20485, 20496, 20497, 20500, 20501, 20544, 20545, 20548, 20549, 20560, 20561, 20564, 20565,
+ 20736, 20737, 20740, 20741, 20752, 20753, 20756, 20757, 20800, 20801, 20804, 20805, 20816, 20817, 20820, 20821,
+ 21504, 21505, 21508, 21509, 21520, 21521, 21524, 21525, 21568, 21569, 21572, 21573, 21584, 21585, 21588, 21589,
+ 21760, 21761, 21764, 21765, 21776, 21777, 21780, 21781, 21824, 21825, 21828, 21829, 21840, 21841, 21844, 21845
+};
+
+/*
+ * copyOrRewriteFile()
+ * This function copies file or rewrite visibility map file.
+ * If rewrite_vm is true, we have to rewrite visibility map regardless value of pageConverter.
+ */
+const char *
+copyOrRewriteFile(pageCnvCtx *pageConverter,
+ const char *src, const char *dst, bool force, bool rewrite_vm)
+{
+ if (rewrite_vm)
+ return rewriteVisibilitymap(src, dst, force);
+ else
+ return copyAndUpdateFile(pageConverter, src, dst, force);
+}
/*
* copyAndUpdateFile()
@@ -205,6 +243,97 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
#endif
+/*
+ * rewriteVisibilitymap()
+ *
+ * A additional bit that indicates that all tuples on page is complety
+ * frozen is added into visibility map. So the format of visibility map
+ * has been changed.
+ * Copies a visibility map file while adding all-frozen bit(0) into each bit.
+ */
+static const char *
+rewriteVisibilitymap(const char *fromfile, const char *tofile, bool force)
+{
+#define REWRITE_BUF_SIZE (50 * BLCKSZ)
+#define BITS_PER_HEAPBLOCK 2
+
+ int src_fd = 0;
+ int dst_fd = 0;
+ uint16 vm_bits;
+ ssize_t nbytes;
+ char *buffer = NULL;
+ int ret = 0;
+
+ if ((fromfile == NULL) || (tofile == NULL))
+ return getErrorText(EINVAL);
+
+ if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
+ goto err;
+
+ if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+ goto err;
+
+ buffer = (char *) pg_malloc(REWRITE_BUF_SIZE);
+
+ /* Copy page header data in advance */
+ if ((nbytes = read(src_fd, buffer, MAXALIGN(SizeOfPageHeaderData))) <= 0)
+ goto err;
+
+ if (write(dst_fd, buffer, nbytes) != nbytes)
+ {
+ /* if write didn't set errno, assume problem is no disk space */
+ if (errno == 0)
+ errno = ENOSPC;
+ goto err;
+ }
+
+ /* perform data rewriting i.e read src srouce, write to destination */
+ while (true)
+ {
+ ssize_t nbytes = read(src_fd, buffer, REWRITE_BUF_SIZE);
+ char *cur, *end;
+
+ if (nbytes < 0)
+ {
+ ret = -1;
+ break;
+ }
+
+ if (nbytes == 0)
+ break;
+
+ cur = buffer;
+ end = buffer + nbytes;
+
+ /* Rewrite a byte and write dest_fd per BITS_PER_HEAPBLOCK bytes */
+ while (end > cur)
+ {
+ /* Get rewritten bit from table and its string representation */
+ vm_bits = rewrite_vm_table[(uint8) *cur];
+
+ if (write(dst_fd, &vm_bits, BITS_PER_HEAPBLOCK) != BITS_PER_HEAPBLOCK)
+ {
+ ret = -1;
+ break;
+ }
+ cur++;
+ }
+ }
+
+err:
+
+ if (!buffer)
+ pg_free(buffer);
+
+ if (src_fd != 0)
+ close(src_fd);
+
+ if (dst_fd != 0)
+ close(dst_fd);
+
+ return (errno == 0) ? NULL : getErrorText(errno);
+}
+
void
check_hard_link(void)
{
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index fa4661b..3147480 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -110,6 +110,10 @@ extern char *output_files[];
#define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031
/*
+ * The format of visibility map is changed with this 9.6 commit,
+ */
+#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201511181
+/*
* pg_multixact format changed in 9.3 commit 0ac5ad5134f2769ccbaefec73844f85,
* ("Improve concurrency of foreign key locking") which also updated catalog
* version to this value. pg_upgrade behavior depends on whether old and new
@@ -394,6 +398,8 @@ const pageCnvCtx *setupPageConverter(void);
typedef void *pageCnvCtx;
#endif
+const char *copyOrRewriteFile(pageCnvCtx *pageConverter, const char *src,
+ const char *dst, bool force, bool rewrite_vm);
const char *copyAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
const char *dst, bool force);
const char *linkAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index c22df42..76418bd 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -18,7 +18,7 @@
static void transfer_single_new_db(pageCnvCtx *pageConverter,
FileNameMap *maps, int size, char *old_tablespace);
static void transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *suffix);
+ const char *suffix, bool vm_need_rewrite);
/*
@@ -171,6 +171,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
{
int mapnum;
bool vm_crashsafe_match = true;
+ bool vm_need_rewrite = false;
/*
* Do the old and new cluster disagree on the crash-safetiness of the vm
@@ -180,13 +181,20 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
vm_crashsafe_match = false;
+ /*
+ * Do we need to rewrite visibilitymap?
+ */
+ if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ vm_need_rewrite = true;
+
for (mapnum = 0; mapnum < size; mapnum++)
{
if (old_tablespace == NULL ||
strcmp(maps[mapnum].old_tablespace, old_tablespace) == 0)
{
/* transfer primary file */
- transfer_relfile(pageConverter, &maps[mapnum], "");
+ transfer_relfile(pageConverter, &maps[mapnum], "", false);
/* fsm/vm files added in PG 8.4 */
if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)
@@ -194,9 +202,14 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
/*
* Copy/link any fsm and vm files, if they exist
*/
- transfer_relfile(pageConverter, &maps[mapnum], "_fsm");
+ transfer_relfile(pageConverter, &maps[mapnum], "_fsm", false);
if (vm_crashsafe_match)
- transfer_relfile(pageConverter, &maps[mapnum], "_vm");
+ {
+ if (vm_need_rewrite)
+ transfer_relfile(pageConverter, &maps[mapnum], "_vm", true);
+ else
+ transfer_relfile(pageConverter, &maps[mapnum], "_vm", false);
+ }
}
}
}
@@ -210,7 +223,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
*/
static void
transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *type_suffix)
+ const char *type_suffix, bool vm_need_rewrite)
{
const char *msg;
char old_file[MAXPGPATH];
@@ -276,7 +289,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
+ if ((msg = copyOrRewriteFile(pageConverter, old_file, new_file, true, vm_need_rewrite)) != NULL)
pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
diff --git a/src/bin/pg_upgrade/test.sh b/src/bin/pg_upgrade/test.sh
index aa7f399..6b058d4 100644
--- a/src/bin/pg_upgrade/test.sh
+++ b/src/bin/pg_upgrade/test.sh
@@ -174,6 +174,11 @@ if "$MAKE" -C "$oldsrc" installcheck; then
mv "$temp_root"/dump1.sql "$temp_root"/dump1.sql.orig
sed "s;$oldsrc;$newsrc;g" "$temp_root"/dump1.sql.orig >"$temp_root"/dump1.sql
fi
+
+ vm_sql="SELECT c.relname, c.relallvisible FROM pg_class as c, pg_namespace as n WHERE c.relnamespace = n.oid AND n.nspname NOT IN ('information_schema', 'pg_toast', 'pg_catalog') ORDER BY c.relname;"
+ # Test for rewriting visibility map
+ vacuumdb -d regression || visibilitymap_vacuum1_status=$?
+ psql -d regression -c "$vm_sql" > "$temp_root"/vm_test1.txt || visibilitymap_test1_status=$?
else
make_installcheck_status=$?
fi
@@ -188,6 +193,14 @@ if [ -n "$pg_dumpall1_status" ]; then
echo "pg_dumpall of pre-upgrade database cluster failed"
exit 1
fi
+if [ -n "$visibilitymap_vacuum1_status" ];then
+ echo "VACUUM of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+if [ -n "$visibilitymap_test1_status" ];then
+ echo "SELECT of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
PGDATA=$BASE_PGDATA
@@ -203,6 +216,8 @@ case $testhost in
esac
pg_dumpall -f "$temp_root"/dump2.sql || pg_dumpall2_status=$?
+vacuumdb -d regression || visibilitymap_vacuum2_status=$?
+psql -d regression -c "$vm_sql" > "$temp_root"/vm_test2.txt || visibilitymap_test2_status=$?
pg_ctl -m fast stop
# no need to echo commands anymore
@@ -214,11 +229,26 @@ if [ -n "$pg_dumpall2_status" ]; then
exit 1
fi
+if [ -n "$visibilitymap_vacuum2_status" ];then
+ echo "VACUUM of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
+if [ -n "$visibilitymap_test2_status" ];then
+ echo "SELECT of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
case $testhost in
MINGW*) cmd /c delete_old_cluster.bat ;;
*) sh ./delete_old_cluster.sh ;;
esac
+if ! diff "$temp_root"/vm_test1.txt "$temp_root"/vm_test2.txt >/dev/null; then
+ echo "Visibility map rewriting test failed"
+ exit 1
+fi
+
if diff "$temp_root"/dump1.sql "$temp_root"/dump2.sql >/dev/null; then
echo PASSED
exit 0
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index caa0f14..c55d232 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -320,9 +320,10 @@ typedef struct xl_heap_freeze_page
typedef struct xl_heap_visible
{
TransactionId cutoff_xid;
+ uint8 flags;
} xl_heap_visible;
-#define SizeOfHeapVisible (offsetof(xl_heap_visible, cutoff_xid) + sizeof(TransactionId))
+#define SizeOfHeapVisible (offsetof(xl_heap_visible, flags) + sizeof(uint8))
typedef struct xl_heap_new_cid
{
@@ -389,6 +390,6 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
xl_heap_freeze_tuple *xlrec_tp);
extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
- Buffer vm_buffer, TransactionId cutoff_xid);
+ Buffer pim_buffer, TransactionId cutoff_xid, uint8 flags);
#endif /* HEAPAM_XLOG_H */
diff --git a/src/include/access/hio.h b/src/include/access/hio.h
index b3b91e7..a200e5e 100644
--- a/src/include/access/hio.h
+++ b/src/include/access/hio.h
@@ -40,6 +40,6 @@ extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
Buffer otherBuffer, int options,
BulkInsertState bistate,
- Buffer *vmbuffer, Buffer *vmbuffer_other);
+ Buffer *pimbuffer, Buffer *pimbuffer_other);
#endif /* HIO_H */
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 0c0e0ef..5f032ab 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -19,15 +19,28 @@
#include "storage/buf.h"
#include "utils/relcache.h"
+/* Flags for bit map */
+#define VISIBILITYMAP_ALL_VISIBLE 0x01
+#define VISIBILITYMAP_ALL_FROZEN 0x02
+
+#define VISIBILITYMAP_VALID_BITS 0x03 /* OR of all valid visiblitymap flags bits */
+
+/* Macros for visibilitymap test */
+#define VM_ALL_VISIBLE(r, b, v) \
+ ((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_VISIBLE) != 0)
+#define VM_ALL_FROZEN(r, b, v) \
+ ((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_FROZEN) != 0)
+
extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk,
Buffer vmbuf);
extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
Buffer *vmbuf);
extern bool visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid);
-extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
-extern BlockNumber visibilitymap_count(Relation rel);
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags);
+extern uint8 visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
+extern BlockNumber visibilitymap_count(Relation rel, BlockNumber *all_frozen);
extern void visibilitymap_truncate(Relation rel, BlockNumber nheapblocks);
#endif /* VISIBILITYMAP_H */
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index eba4150..f6ae108 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 201511071
+#define CATALOG_VERSION_NO 201511181
#endif
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index d8640db..9a77d7d 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2779,6 +2779,8 @@ DATA(insert OID = 3056 ( pg_stat_get_analyze_count PGNSP PGUID 12 1 0 0 0 f f f
DESCR("statistics: number of manual analyzes for a table");
DATA(insert OID = 3057 ( pg_stat_get_autoanalyze_count PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_autoanalyze_count _null_ _null_ _null_ ));
DESCR("statistics: number of auto analyzes for a table");
+DATA(insert OID = 6015 ( pg_stat_get_frozen_pages PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 23 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_frozen_pages _null_ _null_ _null_ ));
+DESCR("statistics: number of frozen pages of table");
DATA(insert OID = 1936 ( pg_stat_get_backend_idset PGNSP PGUID 12 1 100 0 0 f f f f t t s r 0 0 23 "" _null_ _null_ _null_ _null_ _null_ pg_stat_get_backend_idset _null_ _null_ _null_ ));
DESCR("statistics: currently active backend IDs");
DATA(insert OID = 2022 ( pg_stat_get_activity PGNSP PGUID 12 1 100 0 0 f f f f f t s r 1 0 2249 "23" "{23,26,23,26,25,25,25,16,1184,1184,1184,1184,869,25,23,28,28,16,25,25,23,16,25}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,datid,pid,usesysid,application_name,state,query,waiting,xact_start,query_start,backend_start,state_change,client_addr,client_hostname,client_port,backend_xid,backend_xmin,ssl,sslversion,sslcipher,sslbits,sslcompression,sslclientdn}" _null_ _null_ pg_stat_get_activity _null_ _null_ _null_ ));
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index eb3591a..6165500 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1363,7 +1363,7 @@ typedef struct IndexScanState
* RuntimeContext expr context for evaling runtime Skeys
* RelationDesc index relation descriptor
* ScanDesc index scan descriptor
- * VMBuffer buffer in use for visibility map testing, if any
+ * PIMBuffer buffer in use for visibility map testing, if any
* HeapFetches number of tuples we were forced to fetch from heap
* ----------------
*/
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 9ecc163..614ca5a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -355,6 +355,7 @@ typedef struct PgStat_MsgVacuum
TimestampTz m_vacuumtime;
PgStat_Counter m_live_tuples;
PgStat_Counter m_dead_tuples;
+ int32 m_frozen_pages;
} PgStat_MsgVacuum;
@@ -372,6 +373,7 @@ typedef struct PgStat_MsgAnalyze
TimestampTz m_analyzetime;
PgStat_Counter m_live_tuples;
PgStat_Counter m_dead_tuples;
+ int32 m_frozen_pages;
} PgStat_MsgAnalyze;
@@ -551,7 +553,7 @@ typedef union PgStat_Msg
* ------------------------------------------------------------
*/
-#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9D
+#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9E
/* ----------
* PgStat_StatDBEntry The collector's data per database
@@ -615,6 +617,8 @@ typedef struct PgStat_StatTabEntry
PgStat_Counter n_dead_tuples;
PgStat_Counter changes_since_analyze;
+ int32 n_frozen_pages;
+
PgStat_Counter blocks_fetched;
PgStat_Counter blocks_hit;
@@ -917,9 +921,11 @@ extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type t
extern void pgstat_report_autovac(Oid dboid);
extern void pgstat_report_vacuum(Oid tableoid, bool shared,
- PgStat_Counter livetuples, PgStat_Counter deadtuples);
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages);
extern void pgstat_report_analyze(Relation rel,
- PgStat_Counter livetuples, PgStat_Counter deadtuples);
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages);
extern void pgstat_report_recovery_conflict(int reason);
extern void pgstat_report_deadlock(void);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index a2f78ee..102aa81 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -178,8 +178,10 @@ typedef PageHeaderData *PageHeader;
* tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
+#define PD_ALL_FROZEN 0x0008 /* all tuples on page are completely
+ frozen */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x000F /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -367,7 +369,12 @@ typedef PageHeaderData *PageHeader;
#define PageSetAllVisible(page) \
(((PageHeader) (page))->pd_flags |= PD_ALL_VISIBLE)
#define PageClearAllVisible(page) \
- (((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
+ (((PageHeader) (page))->pd_flags &= ~(PD_ALL_VISIBLE | PD_ALL_FROZEN))
+
+#define PageIsAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags & PD_ALL_FROZEN)
+#define PageSetAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags |= PD_ALL_FROZEN)
#define PageIsPrunable(page, oldestxmin) \
( \
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 80374e4..b259e65 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1683,6 +1683,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,
pg_stat_get_live_tuples(c.oid) AS n_live_tup,
pg_stat_get_dead_tuples(c.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(c.oid) AS n_mod_since_analyze,
+ pg_stat_get_frozen_pages(c.oid) AS n_frozen_page,
pg_stat_get_last_vacuum_time(c.oid) AS last_vacuum,
pg_stat_get_last_autovacuum_time(c.oid) AS last_autovacuum,
pg_stat_get_last_analyze_time(c.oid) AS last_analyze,
@@ -1795,6 +1796,7 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
+ pg_stat_all_tables.n_frozen_page,
pg_stat_all_tables.last_vacuum,
pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,
@@ -1838,6 +1840,7 @@ pg_stat_user_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
+ pg_stat_all_tables.n_frozen_page,
pg_stat_all_tables.last_vacuum,
pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index f5be70f..dd49786 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -127,6 +127,8 @@ SELECT count(*) FROM tenk2 WHERE unique1 = 1;
1
(1 row)
+-- do VACUUM FREEZE
+VACUUM FREEZE tenk2;
-- force the rate-limiting logic in pgstat_report_tabstat() to time out
-- and send a message
SELECT pg_sleep(1.0);
@@ -175,6 +177,14 @@ SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages,
t | t
(1 row)
+SELECT n_frozen_page = (pg_relation_size('tenk2') / current_setting('block_size')::int)
+ FROM pg_stat_user_tables
+ WHERE relname ='tenk2';
+ ?column?
+----------
+ t
+(1 row)
+
SELECT pr.snap_ts < pg_stat_get_snapshot_timestamp() as snapshot_newer
FROM prevstats AS pr;
snapshot_newer
diff --git a/src/test/regress/expected/visibilitymap.out b/src/test/regress/expected/visibilitymap.out
new file mode 100644
index 0000000..3be0354
--- /dev/null
+++ b/src/test/regress/expected/visibilitymap.out
@@ -0,0 +1,22 @@
+--
+-- Visibility Map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+-- All pages are become all-visible
+VACUUM FREEZE vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+ ?column?
+----------
+ t
+(1 row)
+
+-- Check whether vacuum skips all-frozen pages
+\set VERBOSITY terse
+VACUUM FREEZE VERBOSE vmtest;
+INFO: vacuuming "public.vmtest"
+INFO: index "vmtest_pkey" now contains 10000 row versions in 30 pages
+INFO: skipped 45 frozen pages according to visibility map
+INFO: "vmtest": found 0 removable, 0 nonremovable row versions in 0 out of 45 pages
+\set VERBOSITY default
+DROP TABLE vmtest;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 3987b4c..5253a29 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -110,3 +110,6 @@ test: event_trigger
# run stats by itself because its delay may be insufficient under heavy load
test: stats
+
+# page info map and vacuum test cannot run concurrently with any test that runs SQL
+test: visibilitymap
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 379f272..c5fd695 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -160,3 +160,4 @@ test: with
test: xml
test: event_trigger
test: stats
+test: visibilitymap
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index cd2d592..b3c640f 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -120,6 +120,8 @@ ROLLBACK;
SELECT count(*) FROM tenk2;
-- do an indexscan
SELECT count(*) FROM tenk2 WHERE unique1 = 1;
+-- do VACUUM FREEZE
+VACUUM FREEZE tenk2;
-- force the rate-limiting logic in pgstat_report_tabstat() to time out
-- and send a message
@@ -145,6 +147,10 @@ SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages,
FROM pg_statio_user_tables AS st, pg_class AS cl, prevstats AS pr
WHERE st.relname='tenk2' AND cl.relname='tenk2';
+SELECT n_frozen_page = (pg_relation_size('tenk2') / current_setting('block_size')::int)
+ FROM pg_stat_user_tables
+ WHERE relname ='tenk2';
+
SELECT pr.snap_ts < pg_stat_get_snapshot_timestamp() as snapshot_newer
FROM prevstats AS pr;
diff --git a/src/test/regress/sql/visibilitymap.sql b/src/test/regress/sql/visibilitymap.sql
new file mode 100644
index 0000000..365570b
--- /dev/null
+++ b/src/test/regress/sql/visibilitymap.sql
@@ -0,0 +1,16 @@
+--
+-- Visibility Map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+
+-- All pages are become all-visible
+VACUUM FREEZE vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+
+-- Check whether vacuum skips all-frozen pages
+\set VERBOSITY terse
+VACUUM FREEZE VERBOSE vmtest;
+\set VERBOSITY default
+
+DROP TABLE vmtest;
On Tue, Nov 17, 2015 at 10:32 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Attached latest v24 patch.
I've changed patch so that just adding frozen bit into visibility map.
So the size of patch is almost half of previous one.
Should there be an Assert in visibilitymap_get_status (or elsewhere)
against the impossible state of being all frozen but not all visible?
I get an error when running pg_upgrade from 9.4 to 9.6-this
error while copying relation "mediawiki.archive"
("/tmp/data/base/16414/21043_vm" to
"/tmp/data_fm/base/16400/21043_vm"): No such file or directory
Cheers,
Jeff
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Nov 18, 2015 at 11:18 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
I get an error when running pg_upgrade from 9.4 to 9.6-this
error while copying relation "mediawiki.archive"
("/tmp/data/base/16414/21043_vm" to
"/tmp/data_fm/base/16400/21043_vm"): No such file or directory
OK, so the problem seems to be that rewriteVisibilitymap can get
called with errno already set to a nonzero value.
It never clears it, and then fails at the end despite that no error
has actually occurred.
Just setting it to 0 at the top of the function seems to be correct
thing to do. Or does it need to save the old value and restore it?
But now when I want to do the upgrade faster, I run into this:
"This utility cannot upgrade from PostgreSQL version from 9.5 or
before to 9.6 or later with link mode."
Is this really an acceptable a tradeoff? Surely we can arrange to
link everything else and rewrite just the _vm, which is a tiny portion
of the data directory. I don't think that -k promises to link
everything it possibly can.
Cheers,
Jeff
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Nov 19, 2015 at 5:54 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Wed, Nov 18, 2015 at 11:18 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
I get an error when running pg_upgrade from 9.4 to 9.6-this
error while copying relation "mediawiki.archive"
("/tmp/data/base/16414/21043_vm" to
"/tmp/data_fm/base/16400/21043_vm"): No such file or directoryOK, so the problem seems to be that rewriteVisibilitymap can get
called with errno already set to a nonzero value.It never clears it, and then fails at the end despite that no error
has actually occurred.Just setting it to 0 at the top of the function seems to be correct
thing to do. Or does it need to save the old value and restore it?
Thank you for testing!
I think that the former is better, so attached latest patch.
But now when I want to do the upgrade faster, I run into this:
"This utility cannot upgrade from PostgreSQL version from 9.5 or
before to 9.6 or later with link mode."Is this really an acceptable a tradeoff? Surely we can arrange to
link everything else and rewrite just the _vm, which is a tiny portion
of the data directory. I don't think that -k promises to link
everything it possibly can.
I agree.
I've changed the patch so that.
pg_upgarde creates new _vm file and rewrites it even if upgrading to
9.6 with link mode.
Regards,
--
Masahiko Sawada
Attachments:
000_add_frozen_bit_into_visibilitymap_v25.patchapplication/octet-stream; name=000_add_frozen_bit_into_visibilitymap_v25.patchDownload
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 22c5f7a..e8ebfe9 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -87,7 +87,7 @@ statapprox_heap(Relation rel, output_type *stat)
* If the page has only visible tuples, then we can find out the free
* space from the FSM and move on.
*/
- if (visibilitymap_test(rel, blkno, &vmbuffer))
+ if (VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
{
freespace = GetRecordedFreeSpace(rel, blkno);
stat->tuple_len += BLCKSZ - freespace;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 6e14851..c75a166 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5905,7 +5905,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs an eager freezing if the table's
<structname>pg_class</>.<structfield>relfrozenxid</> field has reached
the age specified by this setting. The default is 150 million
transactions. Although users can set this value anywhere from zero to
@@ -5949,7 +5949,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs an eager freezing if the table's
<structname>pg_class</>.<structfield>relminmxid</> field has reached
the age specified by this setting. The default is 150 million multixacts.
Although users can set this value anywhere from zero to two billions,
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 5204b34..5a43c28 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -352,9 +352,9 @@
Vacuum maintains a <link linkend="storage-vm">visibility map</> for each
table to keep track of which pages contain only tuples that are known to be
visible to all active transactions (and all future transactions, until the
- page is again modified). This has two purposes. First, vacuum
- itself can skip such pages on the next run, since there is nothing to
- clean up.
+ page is again modified), and pages contain only frozen tuples.
+ This has two purposes. First, vacuum itself can skip such pages
+ on the next run, since there is nothing to clean up.
</para>
<para>
@@ -438,28 +438,25 @@
</para>
<para>
- <command>VACUUM</> normally skips pages that don't have any dead row
- versions, but those pages might still have row versions with old XID
- values. To ensure all old row versions have been frozen, a
- scan of the whole table is needed.
+ <command>VACUUM</> skips pages that don't have any dead row
+ versions, and pages that have only frozen rows. To ensure all old
+ row versions have been frozen, a scan of all unfrozen pages is needed.
<xref linkend="guc-vacuum-freeze-table-age"> controls when
- <command>VACUUM</> does that: a whole table sweep is forced if
- the table hasn't been fully scanned for <varname>vacuum_freeze_table_age</>
- minus <varname>vacuum_freeze_min_age</> transactions. Setting it to 0
- forces <command>VACUUM</> to always scan all pages, effectively ignoring
- the visibility map.
+ <command>VACUUM</> does that: a whole-table freezing is forced if
+ the table hasn't been ensured all row versions are frozen for
+ <varname>vacuum_freeze_table_age</> minus <varname>vacuum_freeze_min_age</>
+ transcations.
</para>
<para>
The maximum time that a table can go unvacuumed is two billion
transactions minus the <varname>vacuum_freeze_min_age</> value at
- the time <command>VACUUM</> last scanned the whole table. If it were to go
- unvacuumed for longer than
- that, data loss could result. To ensure that this does not happen,
- autovacuum is invoked on any table that might contain unfrozen rows with
- XIDs older than the age specified by the configuration parameter <xref
- linkend="guc-autovacuum-freeze-max-age">. (This will happen even if
- autovacuum is disabled.)
+ the time <command>VACUUM</> last scanned unfrozen pages. If it were to go
+ unvacuumed for longer than that, data loss could result. To ensure
+ that this does not happen, autovacuum is invoked on any table that might
+ contain unfrozen rows with XIDs older than the age specified by the
+ configuration parameter <xref linkend="guc-autovacuum-freeze-max-age">.
+ (This will happen even if autovacuum is disabled.)
</para>
<para>
@@ -490,8 +487,7 @@
a regularly scheduled <command>VACUUM</> or an autovacuum triggered by
normal delete and update activity is run in that window. Setting it too
close could lead to anti-wraparound autovacuums, even though the table
- was recently vacuumed to reclaim space, whereas lower values lead to more
- frequent whole-table scans.
+ was recently vacuumed to reclaim space.
</para>
<para>
@@ -527,7 +523,7 @@
<structname>pg_database</>. In particular,
the <structfield>relfrozenxid</> column of a table's
<structname>pg_class</> row contains the freeze cutoff XID that was used
- by the last whole-table <command>VACUUM</> for that table. All rows
+ by the last whole-table freezing for that table. All rows
inserted by transactions with XIDs older than this cutoff XID are
guaranteed to have been frozen. Similarly,
the <structfield>datfrozenxid</> column of a database's
@@ -555,17 +551,19 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
<command>VACUUM</> normally
only scans pages that have been modified since the last vacuum, but
<structfield>relfrozenxid</> can only be advanced when the whole table is
- scanned. The whole table is scanned when <structfield>relfrozenxid</> is
- more than <varname>vacuum_freeze_table_age</> transactions old, when
- <command>VACUUM</>'s <literal>FREEZE</> option is used, or when all pages
- happen to
- require vacuuming to remove dead row versions. When <command>VACUUM</>
- scans the whole table, after it's finished <literal>age(relfrozenxid)</>
- should be a little more than the <varname>vacuum_freeze_min_age</> setting
- that was used (more by the number of transactions started since the
- <command>VACUUM</> started). If no whole-table-scanning <command>VACUUM</>
- is issued on the table until <varname>autovacuum_freeze_max_age</> is
- reached, an autovacuum will soon be forced for the table.
+ frozen. The whole-table freezing is occuerred only when all pages happen to
+ require freezing to freeze rows. In other cases such as where
+ <structfield>relfrozenxid</> is more than <varname>vacuum_freeze_table_age</>
+ transcations old, where <command>VACUUM</>'s <literal>FREEZE</> option is used,
+ <command>VACUUM</> can skip the pages that all tuples on the page itself are
+ marked as frozen.
+ When all pages of table are eventually marked as frozen by <command>VACUUM</>,
+ after it's finished <literal>age(relfrozenxid)</> should be a little more
+ than the <varname>vacuum_freeze_min_age</> setting that was used (more by
+ the number of transcations started since the <command>VACUUM</> started).
+ If the advancing of <structfield>relfrozenxid</> is not happend until
+ <varname>autovacuum_freeze_max_age</> is reached, an autovacuum will soon
+ be forced for the table.
</para>
<para>
@@ -642,28 +640,28 @@ HINT: Stop the postmaster and vacuum that database in single-user mode.
<structname>pg_class</>.<structfield>relminmxid</> stores the oldest
possible multixact ID still appearing in any tuple of that table.
If this value is older than
- <xref linkend="guc-vacuum-multixact-freeze-table-age">, a whole-table
+ <xref linkend="guc-vacuum-multixact-freeze-table-age">, a table
scan is forced. <function>mxid_age()</> can be used on
<structname>pg_class</>.<structfield>relminmxid</> to find its age.
</para>
<para>
- Whole-table <command>VACUUM</> scans, regardless of
- what causes them, enable advancing the value for that table.
+ When <command>VACUUM</> scans all unfrozen pages, regardless of what causes
+ them, enable advancing the value for that table.
Eventually, as all tables in all databases are scanned and their
oldest multixact values are advanced, on-disk storage for older
multixacts can be removed.
</para>
<para>
- As a safety device, a whole-table vacuum scan will occur for any table
- whose multixact-age is greater than
- <xref linkend="guc-autovacuum-multixact-freeze-max-age">. Whole-table
+ As a safety device, a vacuum scan will occur for any table whose
+ multixact-age is greater than
+ <xref linkend="guc-autovacuum-multixact-freeze-max-age">. A
vacuum scans will also occur progressively for all tables, starting with
those that have the oldest multixact-age, if the amount of used member
storage space exceeds the amount 50% of the addressable storage space.
- Both of these kinds of whole-table scans will occur even if autovacuum is
- nominally disabled.
+ Both of these kinds of scans will occur even if autovacuum is nominally
+ disabled.
</para>
</sect3>
</sect2>
@@ -743,8 +741,8 @@ vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuple
<command>UPDATE</command> and <command>DELETE</command> operation. (It
is only semi-accurate because some information might be lost under heavy
load.) If the <structfield>relfrozenxid</> value of the table is more
- than <varname>vacuum_freeze_table_age</> transactions old, the whole
- table is scanned to freeze old tuples and advance
+ than <varname>vacuum_freeze_table_age</> transactions old, the table is
+ scanned to freeze old tuples and advance
<structfield>relfrozenxid</>, otherwise only pages that have been modified
since the last vacuum are scanned.
</para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e64b7ef..1908a4d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1332,6 +1332,11 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
<entry>Estimated number of rows modified since this table was last analyzed</entry>
</row>
<row>
+ <entry><structfield>n_frozen_page</></entry>
+ <entry><type>integer</></entry>
+ <entry>Number of frozen pages</entry>
+ </row>
+ <row>
<entry><structfield>last_vacuum</></entry>
<entry><type>timestamp with time zone</></entry>
<entry>Last time at which this table was manually vacuumed
diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index eb113c2..275b69c 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -657,6 +657,12 @@ psql --username postgres --file script.sql postgres
</para>
<para>
+ Since the format of visibility map has been changed in version 9.6,
+ <application>pg_upgrade</> creates and rewrite new <literal>'_vm'</literal>
+ file even if upgrading from 9.5 or before to 9.6 or later with link mode (-k).
+ </para>
+
+ <para>
All failure, rebuild, and reindex cases will be reported by
<application>pg_upgrade</> if they affect your installation;
post-upgrade scripts to rebuild tables and indexes will be
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 164d08c..5dc8b04 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -623,18 +623,20 @@ can be used to examine the information stored in free space maps.
<para>
Each heap relation has a Visibility Map
(VM) to keep track of which pages contain only tuples that are known to be
-visible to all active transactions. It's stored
-alongside the main relation data in a separate relation fork, named after the
-filenode number of the relation, plus a <literal>_vm</> suffix. For example,
-if the filenode of a relation is 12345, the VM is stored in a file called
-<filename>12345_vm</>, in the same directory as the main relation file.
+visible to all active transactions, and pages contain only unfrozen tuples.
+It's stored alongside the main relation data in a separate relation fork,
+named after the filenode number of the relation, plus a <literal>_vm</> suffix.
+For example, if the filenode of a relation is 12345, the VM is stored in a file
+called <filename>12345_vm</>, in the same directory as the main relation file.
Note that indexes do not have VMs.
</para>
<para>
-The visibility map simply stores one bit per heap page. A set bit means
-that all tuples on the page are known to be visible to all transactions.
-This means that the page does not contain any tuples that need to be vacuumed.
+The visibility map stores two bits per heap page: all-visible, all-frozen.
+A set all-visible bit means that all tuples on the page are known to be visible
+to all transactions. A set all-frozen bit means that all tuples on the page are
+completely marked as frozen. This means that the page does not contain any tuples
+that need to be vacuumed, even if a scan of whole table is required.
This information can also be used by <firstterm>index-only scans</> to answer
queries using only the index tuple.
</para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9ff7a41..651dd0e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3034,9 +3034,9 @@ heap_delete(Relation relation, ItemPointer tid,
/*
* If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, we'll have to unlock and
- * re-lock, to avoid holding the buffer lock across an I/O. That's a bit
- * unfortunate, but hopefully shouldn't happen often.
+ * visible or all frozen while we were busy locking the buffer, we'll
+ * have to unlock and re-lock, to avoid holding the buffer lock across an
+ * I/O. That's a bit unfortunate, but hopefully shouldn't happen often.
*/
if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
{
@@ -7205,7 +7205,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
*/
XLogRecPtr
log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
- TransactionId cutoff_xid)
+ TransactionId cutoff_xid, uint8 vmflags)
{
xl_heap_visible xlrec;
XLogRecPtr recptr;
@@ -7215,6 +7215,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
Assert(BufferIsValid(vm_buffer));
xlrec.cutoff_xid = cutoff_xid;
+ xlrec.flags = vmflags;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
@@ -7804,7 +7805,12 @@ heap_xlog_visible(XLogReaderState *record)
* the subsequent update won't be replayed to clear the flag.
*/
page = BufferGetPage(buffer);
- PageSetAllVisible(page);
+
+ if (xlrec->flags & VISIBILITYMAP_ALL_VISIBLE)
+ PageSetAllVisible(page);
+ if (xlrec->flags & VISIBILITYMAP_ALL_FROZEN)
+ PageSetAllFrozen(page);
+
MarkBufferDirty(buffer);
}
else if (action == BLK_RESTORED)
@@ -7856,7 +7862,7 @@ heap_xlog_visible(XLogReaderState *record)
*/
if (lsn > PageGetLSN(vmpage))
visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
- xlrec->cutoff_xid);
+ xlrec->cutoff_xid, xlrec->flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 6db73bf..0fe49eb 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -119,7 +119,7 @@ ReadBufferBI(Relation relation, BlockNumber targetBlock,
* be less than buffer2.
*/
static void
-GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
+GetVisibilitymapPins(Relation relation, Buffer buffer1, Buffer buffer2,
BlockNumber block1, BlockNumber block2,
Buffer *vmbuffer1, Buffer *vmbuffer2)
{
@@ -380,11 +380,11 @@ RelationGetBufferForTuple(Relation relation, Size len,
* done.
*/
if (otherBuffer == InvalidBuffer || buffer <= otherBuffer)
- GetVisibilityMapPins(relation, buffer, otherBuffer,
+ GetVisibilitymapPins(relation, buffer, otherBuffer,
targetBlock, otherBlock, vmbuffer,
vmbuffer_other);
else
- GetVisibilityMapPins(relation, otherBuffer, buffer,
+ GetVisibilitymapPins(relation, otherBuffer, buffer,
otherBlock, targetBlock, vmbuffer_other,
vmbuffer);
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 7c38772..387a0d6 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -15,39 +15,46 @@
* visibilitymap_pin - pin a map page for setting a bit
* visibilitymap_pin_ok - check whether correct map page is already pinned
* visibilitymap_set - set a bit in a previously pinned page
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_get_status - get status of bits
* visibilitymap_count - count number of bits set in visibility map
* visibilitymap_truncate - truncate the visibility map
*
* NOTES
*
- * The visibility map is a bitmap with one bit per heap page. A set bit means
- * that all tuples on the page are known visible to all transactions, and
- * therefore the page doesn't need to be vacuumed. The map is conservative in
- * the sense that we make sure that whenever a bit is set, we know the
- * condition is true, but if a bit is not set, it might or might not be true.
+ * The visibility map is a bitmap with two bits (all-visible and all-frozen)
+ * per heap page. A set all-visible bit means that all tuples on the page are
+ * known visible to all transactions, and therefore the page doesn't need to
+ * be vacuumed. A set all-frozen bit means that all tuples on the page are
+ * completely frozen, and therefore the page doesn't need to be vacuumed even
+ * if whole table scanning vacuum is required (e.g. anti-wraparound vacuum).
+ * The all-frozen bit must be set only when the page is already all-visible.
+ *
+ * The map is conservative in the sense that we make sure that whenever a bit
+ * is set, we know the condition is true, but if a bit is not set, it might or
+ * might not be true.
*
* Clearing a visibility map bit is not separately WAL-logged. The callers
* must make sure that whenever a bit is cleared, the bit is cleared on WAL
- * replay of the updating operation as well.
+ * replay of the updating operation as well. And all-frozen bit must be
+ * cleared with all-visible at the same time.
*
* When we *set* a visibility map during VACUUM, we must write WAL. This may
* seem counterintuitive, since the bit is basically a hint: if it is clear,
- * it may still be the case that every tuple on the page is visible to all
- * transactions; we just don't know that for certain. The difficulty is that
- * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
- * on the page itself, and the visibility map bit. If a crash occurs after the
- * visibility map page makes it to disk and before the updated heap page makes
- * it to disk, redo must set the bit on the heap page. Otherwise, the next
- * insert, update, or delete on the heap page will fail to realize that the
- * visibility map bit must be cleared, possibly causing index-only scans to
- * return wrong answers.
+ * it may still be the case that every tuple on the page is all-visible or
+ * all-frozen we just don't know that for certain. The difficulty is that
+ * there are two bits which are typically set together: the PD_ALL_VISIBLE
+ * or PD_ALL_FROZEN bit on the page itself, and the corresponding visibility map
+ * bit. If a crash occurs after the visibility map page makes it to disk and before
+ * the updated heap page makes it to disk, redo must set the bit on the heap page.
+ * Otherwise, the next insert, update, or delete on the heap page will fail to
+ * realize that the visibility map bit must be cleared, possibly causing index-only
+ * scans to return wrong answers.
*
* VACUUM will normally skip pages for which the visibility map bit is set;
* such pages can't contain any dead tuples and therefore don't need vacuuming.
- * The visibility map is not used for anti-wraparound vacuums, because
- * an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
- * present in the table, even on pages that don't have any dead tuples.
+ * The visibility map has the all-frozen bit which indicates all tuples on
+ * corresponding page has been completely frozen, so the visibility map is also
+ * used for anti-wraparound vacuum, even if freezing of tuples is required.
*
* LOCKING
*
@@ -58,14 +65,14 @@
* section that logs the page modification. However, we don't want to hold
* the buffer lock over any I/O that may be required to read in the visibility
* map page. To avoid this, we examine the heap page before locking it;
- * if the page-level PD_ALL_VISIBLE bit is set, we pin the visibility map
- * bit. Then, we lock the buffer. But this creates a race condition: there
- * is a possibility that in the time it takes to lock the buffer, the
- * PD_ALL_VISIBLE bit gets set. If that happens, we have to unlock the
- * buffer, pin the visibility map page, and relock the buffer. This shouldn't
- * happen often, because only VACUUM currently sets visibility map bits,
- * and the race will only occur if VACUUM processes a given page at almost
- * exactly the same time that someone tries to further modify it.
+ * if the page-level PD_ALL_VISIBLE or PD_ALL_FROZEN bit is set, we pin the
+ * visibility map bit. Then, we lock the buffer. But this creates a race
+ * condition: there is a possibility that in the time it takes to lock the
+ * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that happens,
+ * we have to unlock the buffer, pin the visibility map page, and relock the
+ * buffer. This shouldn't happen often, because only VACUUM currently sets
+ * visibility map bits, and the race will only occur if VACUUM processes a given
+ * page at almost exactly the same time that someone tries to further modify it.
*
* To set a bit, you need to hold a lock on the heap page. That prevents
* the race condition where VACUUM sees that all tuples on the page are
@@ -101,11 +108,14 @@
*/
#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
-/* Number of bits allocated for each heap block. */
-#define BITS_PER_HEAPBLOCK 1
+/*
+ * Number of bits allocated for each heap block.
+ * One for all-visible, other for all-frozen.
+*/
+#define BITS_PER_HEAPBLOCK 2
/* Number of heap blocks we can represent in one byte. */
-#define HEAPBLOCKS_PER_BYTE 8
+#define HEAPBLOCKS_PER_BYTE 4
/* Number of heap blocks we can represent in one visibility map page. */
#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
@@ -113,26 +123,44 @@
/* Mapping from heap block number to the right bit in the visibility map */
#define HEAPBLK_TO_MAPBLOCK(x) ((x) / HEAPBLOCKS_PER_PAGE)
#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
-#define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
-
-/* table for fast counting of set bits */
-static const uint8 number_of_ones[256] = {
- 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
+#define HEAPBLK_TO_MAPBIT(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)
+
+/* tables for fast counting of set bits for visible and freeze */
+static const uint8 number_of_ones_for_visible[256] = {
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4
+};
+static const uint8 number_of_ones_for_frozen[256] = {
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4
};
/* prototypes for internal routines */
@@ -141,7 +169,7 @@ static void vm_extend(Relation rel, BlockNumber nvmblocks);
/*
- * visibilitymap_clear - clear a bit in visibility map
+ * visibilitymap_clear - clear all bits in visibility map
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -153,11 +181,11 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
int mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- uint8 mask = 1 << mapBit;
+ uint8 mask = VISIBILITYMAP_VALID_BITS << mapBit;
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_clear %s block %d", RelationGetRelationName(rel), heapBlk);
#endif
if (!BufferIsValid(buf) || BufferGetBlockNumber(buf) != mapBlock)
@@ -186,7 +214,7 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
* visibilitymap_set to actually set the bit.
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by
- * an earlier call to visibilitymap_pin or visibilitymap_test on the same
+ * an earlier call to visibilitymap_pin or visibilitymap_get_status on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk.
*
@@ -212,7 +240,7 @@ visibilitymap_pin(Relation rel, BlockNumber heapBlk, Buffer *buf)
* visibilitymap_pin_ok - do we already have the correct page pinned?
*
* On entry, buf should be InvalidBuffer or a valid buffer returned by
- * an earlier call to visibilitymap_pin or visibilitymap_test on the same
+ * an earlier call to visibilitymap_pin or visibilitymap_get_status on the same
* relation. The return value indicates whether the buffer covers the
* given heapBlk.
*/
@@ -225,7 +253,7 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
}
/*
- * visibilitymap_set - set a bit on a previously pinned page
+ * visibilitymap_set - set bit(s) on a previously pinned page
*
* recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
* or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
@@ -234,10 +262,11 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
* marked all-visible; it is needed for Hot Standby, and can be
* InvalidTransactionId if the page contains no tuples.
*
- * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
- * this function. Except in recovery, caller should also pass the heap
- * buffer. When checksums are enabled and we're not in recovery, we must add
- * the heap buffer to the WAL chain to protect it from being torn.
+ * Caller is expected to set the heap page's PD_ALL_VISIBLE or PD_ALL_FROZEN
+ * bit before calling this function. Except in recovery, caller should also
+ * pass the heap buffer and flags which indicates what flag we want to set.
+ * When checksums are enabled and we're not in recovery, we must add the heap
+ * buffer to the WAL chain to protect it from being torn.
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -245,7 +274,8 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
*/
void
visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid)
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -254,11 +284,12 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_set %s block %d, flag %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
Assert(InRecovery || BufferIsValid(heapBuf));
+ Assert(flags & VISIBILITYMAP_VALID_BITS);
/* Check that we have the right heap page pinned, if present */
if (BufferIsValid(heapBuf) && BufferGetBlockNumber(heapBuf) != heapBlk)
@@ -272,11 +303,11 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map = PageGetContents(page);
LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
- if (!(map[mapByte] & (1 << mapBit)))
+ if (flags != (map[mapByte] & (flags << mapBit)))
{
START_CRIT_SECTION();
- map[mapByte] |= (1 << mapBit);
+ map[mapByte] |= (flags << mapBit);
MarkBufferDirty(vmBuf);
if (RelationNeedsWAL(rel))
@@ -285,7 +316,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Assert(!InRecovery);
recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
- cutoff_xid);
+ cutoff_xid, flags);
/*
* If data checksums are enabled (or wal_log_hints=on), we
@@ -295,11 +326,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Page heapPage = BufferGetPage(heapBuf);
- /* caller is expected to set PD_ALL_VISIBLE first */
- Assert(PageIsAllVisible(heapPage));
+ /*
+ * Caller is expected to set PD_ALL_VISIBLE or
+ * PD_ALL_FROZEN first.
+ */
+ Assert(((flags | VISIBILITYMAP_ALL_VISIBLE) && PageIsAllVisible(heapPage)) ||
+ ((flags | VISIBILITYMAP_ALL_FROZEN) && PageIsAllFrozen(heapPage)));
PageSetLSN(heapPage, recptr);
}
}
+
PageSetLSN(page, recptr);
}
@@ -310,15 +346,17 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
}
/*
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_get_status - get status of bits
*
- * Are all tuples on heapBlk visible to all, according to the visibility map?
+ * Are all tuples on heapBlk visible to all or are marked frozen, according
+ * to the visibility map?
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by an
- * earlier call to visibilitymap_pin or visibilitymap_test on the same
+ * earlier call to visibilitymap_pin or visibilitymap_get_status on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk, or InvalidBuffer. The caller is responsible for
- * releasing *buf after it's done testing and setting bits.
+ * releasing *buf after it's done testing and setting bits, and must pass flags
+ * for which it needs to check the value in visibility map.
*
* NOTE: This function is typically called without a lock on the heap page,
* so somebody else could change the bit just after we look at it. In fact,
@@ -327,17 +365,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
* we might see the old value. It is the caller's responsibility to deal with
* all concurrency issues!
*/
-bool
-visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
+uint8
+visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *buf)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
uint8 mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- bool result;
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_test %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_get_status %s, block %d", RelationGetRelationName(rel), heapBlk);
#endif
/* Reuse the old pinned buffer if possible */
@@ -360,13 +397,11 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
map = PageGetContents(BufferGetPage(*buf));
/*
- * A single-bit read is atomic. There could be memory-ordering effects
+ * The double bits read is atomic. There could be memory-ordering effects
* here, but for performance reasons we make it the caller's job to worry
* about that.
*/
- result = (map[mapByte] & (1 << mapBit)) ? true : false;
-
- return result;
+ return ((map[mapByte] >> mapBit) & VISIBILITYMAP_VALID_BITS);
}
/*
@@ -374,13 +409,17 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
*
* Note: we ignore the possibility of race conditions when the table is being
* extended concurrently with the call. New pages added to the table aren't
- * going to be marked all-visible, so they won't affect the result.
+ * going to be marked all-visible or all-frozen, so they won't affect the result.
+ * The caller must set the flags which indicates what flag we want to count.
*/
BlockNumber
-visibilitymap_count(Relation rel)
+visibilitymap_count(Relation rel, BlockNumber *all_frozen)
{
- BlockNumber result = 0;
BlockNumber mapBlock;
+ BlockNumber all_visible = 0;
+
+ if (all_frozen)
+ *all_frozen = 0;
for (mapBlock = 0;; mapBlock++)
{
@@ -406,13 +445,15 @@ visibilitymap_count(Relation rel)
for (i = 0; i < MAPSIZE; i++)
{
- result += number_of_ones[map[i]];
+ all_visible += number_of_ones_for_visible[map[i]];
+ if (all_frozen)
+ *all_frozen += number_of_ones_for_frozen[map[i]];
}
ReleaseBuffer(mapBuffer);
}
- return result;
+ return all_visible;
}
/*
@@ -435,7 +476,7 @@ visibilitymap_truncate(Relation rel, BlockNumber nheapblocks)
uint8 truncBit = HEAPBLK_TO_MAPBIT(nheapblocks);
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_truncate %s %d", RelationGetRelationName(rel), nheapblocks);
+ elog(DEBUG1, "vm_truncate %s block %d", RelationGetRelationName(rel), nheapblocks);
#endif
RelationOpenSmgr(rel);
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e59b163..1cea026 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1813,7 +1813,7 @@ FormIndexDatum(IndexInfo *indexInfo,
* isprimary: if true, set relhaspkey true; else no change
* reltuples: if >= 0, set reltuples to this value; else no change
*
- * If reltuples >= 0, relpages and relallvisible are also updated (using
+ * If reltuples >= 0, relpages, relallvisible are also updated (using
* RelationGetNumberOfBlocks() and visibilitymap_count()).
*
* NOTE: an important side-effect of this operation is that an SI invalidation
@@ -1921,7 +1921,7 @@ index_update_stats(Relation rel,
BlockNumber relallvisible;
if (rd_rel->relkind != RELKIND_INDEX)
- relallvisible = visibilitymap_count(rel);
+ relallvisible = visibilitymap_count(rel, NULL);
else /* don't bother for indexes */
relallvisible = 0;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index ccc030f..8c555eb 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -444,6 +444,7 @@ CREATE VIEW pg_stat_all_tables AS
pg_stat_get_live_tuples(C.oid) AS n_live_tup,
pg_stat_get_dead_tuples(C.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(C.oid) AS n_mod_since_analyze,
+ pg_stat_get_frozen_pages(C.oid) AS n_frozen_page,
pg_stat_get_last_vacuum_time(C.oid) as last_vacuum,
pg_stat_get_last_autovacuum_time(C.oid) as last_autovacuum,
pg_stat_get_last_analyze_time(C.oid) as last_analyze,
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index ddb68ab..02a2c68 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -314,6 +314,8 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
+ BlockNumber relallvisible,
+ relallfrozen;
if (inh)
ereport(elevel,
@@ -564,6 +566,10 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
}
}
+ /* Calculate the number of all-visible and all-frozen bit */
+ if (!inh)
+ relallvisible = visibilitymap_count(onerel, &relallfrozen);
+
/*
* Update pages/tuples stats in pg_class ... but not if we're doing
* inherited stats.
@@ -572,7 +578,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
vac_update_relstats(onerel,
relpages,
totalrows,
- visibilitymap_count(onerel),
+ relallvisible,
hasindex,
InvalidTransactionId,
InvalidMultiXactId,
@@ -608,7 +614,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
* tracks per-table stats.
*/
if (!inh)
- pgstat_report_analyze(onerel, totalrows, totaldeadrows);
+ pgstat_report_analyze(onerel, totalrows, totaldeadrows, relallfrozen);
/* If this isn't part of VACUUM ANALYZE, let index AMs do cleanup */
if (!(options & VACOPT_VACUUM))
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7c4ef58..0a02a25 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -729,11 +729,11 @@ vac_estimate_reltuples(Relation relation, bool is_analyze,
* marked with xmin = our xid.
*
* In addition to fundamentally nontransactional statistics such as
- * relpages and relallvisible, we try to maintain certain lazily-updated
- * DDL flags such as relhasindex, by clearing them if no longer correct.
- * It's safe to do this in VACUUM, which can't run in parallel with
- * CREATE INDEX/RULE/TRIGGER and can't be part of a transaction block.
- * However, it's *not* safe to do it in an ANALYZE that's within an
+ * relpages, relallvisible, we try to maintain certain lazily-updated
+ * DDL flags such as relhasindex, by clearing them if no onger correct.
+ * It's safe to do this in VACUUM, which can't run in
+ * parallel with CREATE INDEX/RULE/TRIGGER and can't be part of a transaction
+ * block. However, it's *not* safe to do it in an ANALYZE that's within an
* outer transaction, because for example the current transaction might
* have dropped the last index; then we'd think relhasindex should be
* cleared, but if the transaction later rolls back this would be wrong.
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 2429889..994efb7 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -106,6 +106,8 @@ typedef struct LVRelStats
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* number of pages we examined */
BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber vmskipped_frozen_pages; /* # of pages we skipped by all-frozen bit
+ of visibility map */
double scanned_tuples; /* counts only tuples on scanned pages */
double old_rel_tuples; /* previous value of pg_class.reltuples */
double new_rel_tuples; /* new estimated total # of tuples */
@@ -156,7 +158,7 @@ static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
static int vac_cmp_itemptr(const void *left, const void *right);
static bool heap_page_is_all_visible(Relation rel, Buffer buf,
- TransactionId *visibility_cutoff_xid);
+ TransactionId *visibility_cutoff_xid, bool *all_frozen);
/*
@@ -188,7 +190,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
MultiXactId mxactFullScanLimit;
BlockNumber new_rel_pages;
double new_rel_tuples;
- BlockNumber new_rel_allvisible;
+ BlockNumber new_rel_allvisible,
+ new_rel_allfrozen;
double new_live_tuples;
TransactionId new_frozen_xid;
MultiXactId new_min_multi;
@@ -221,7 +224,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* We request a full scan if either the table's frozen Xid is now older
* than or equal to the requested Xid full-table scan limit; or if the
* table's minimum MultiXactId is older than or equal to the requested
- * mxid full-table scan limit.
+ * mxid full-table scan limit. During full scan, we could skip some pags
+ * according to all-frozen bit of visibility map.
*/
scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
xidFullScanLimit);
@@ -253,7 +257,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
- if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages + vacrelstats->vmskipped_frozen_pages)
+ < vacrelstats->rel_pages)
{
Assert(!scan_all);
scanned_all = false;
@@ -301,10 +306,13 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_tuples = vacrelstats->old_rel_tuples;
}
- new_rel_allvisible = visibilitymap_count(onerel);
+ new_rel_allvisible = visibilitymap_count(onerel, &new_rel_allfrozen);
if (new_rel_allvisible > new_rel_pages)
new_rel_allvisible = new_rel_pages;
+ if (new_rel_allfrozen > new_rel_pages)
+ new_rel_allfrozen = new_rel_pages;
+
new_frozen_xid = scanned_all ? FreezeLimit : InvalidTransactionId;
new_min_multi = scanned_all ? MultiXactCutoff : InvalidMultiXactId;
@@ -325,7 +333,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
pgstat_report_vacuum(RelationGetRelid(onerel),
onerel->rd_rel->relisshared,
new_live_tuples,
- vacrelstats->new_dead_tuples);
+ vacrelstats->new_dead_tuples,
+ new_rel_allfrozen);
/* and log the action if appropriate */
if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
@@ -360,10 +369,11 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel),
vacrelstats->num_index_scans);
- appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins\n"),
+ appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped according to vm\n"),
vacrelstats->pages_removed,
vacrelstats->rel_pages,
- vacrelstats->pinskipped_pages);
+ vacrelstats->pinskipped_pages,
+ vacrelstats->vmskipped_frozen_pages);
appendStringInfo(&buf,
_("tuples: %.0f removed, %.0f remain, %.0f are dead but not yet removable\n"),
vacrelstats->tuples_deleted,
@@ -486,9 +496,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* consecutive pages. Since we're reading sequentially, the OS should be
* doing readahead for us, so there's no gain in skipping a page now and
* then; that's likely to disable readahead and so be counterproductive.
- * Also, skipping even a single page means that we can't update
- * relfrozenxid, so we only want to do it if we can skip a goodly number
- * of pages.
+ * Also, skipping even a single page accorinding to all-visible bit of
+ * visibility map means that we can't update relfrozenxid, so we only want
+ * to do it if we can skip a goodly number. On the other hand, we count
+ * both how many pages we skipped according to all-frozen bit of visibility
+ * map and how many pages we freeze, so we can update relfrozenxid if
+ * the sum of two is as many as pages of table.
*
* Before entering the main loop, establish the invariant that
* next_not_all_visible_block is the next block number >= blkno that's not
@@ -498,24 +511,24 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* started skipping blocks, we may as well skip everything up to the next
* not-all-visible block.
*
- * Note: if scan_all is true, we won't actually skip any pages; but we
+ * Note: if scan_all is true, we might not actually skip any pages; but we
* maintain next_not_all_visible_block anyway, so as to set up the
* all_visible_according_to_vm flag correctly for each page.
*
- * Note: The value returned by visibilitymap_test could be slightly
+ * Note: The value returned by visibilitymap_get_status could be slightly
* out-of-date, since we make this test before reading the corresponding
* heap page or locking the buffer. This is OK. If we mistakenly think
- * that the page is all-visible when in fact the flag's just been cleared,
- * we might fail to vacuum the page. But it's OK to skip pages when
- * scan_all is not set, so no great harm done; the next vacuum will find
- * them. If we make the reverse mistake and vacuum a page unnecessarily,
- * it'll just be a no-op.
+ * that the page is all-visible/all-frozen when in fact the flag's just
+ * been cleared, we might fail to vacuum the page. But it's OK to skip
+ * pages when scan_all is not set, so no great harm done; the next vacuum
+ * will find them. If we make the reverse mistake and vacuum a page
+ * unnecessarily, it'll just be a no-op.
*/
for (next_not_all_visible_block = 0;
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer))
+ if (!VM_ALL_VISIBLE(onerel, next_not_all_visible_block, &vmbuffer))
break;
vacuum_delay_point();
}
@@ -533,9 +546,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
bool tupgone,
hastup;
int prev_dead_count;
- int nfrozen;
+ int nfrozen; /* # of tuples is frozen */
+ int nalready_frozen; /* # of tuples is already frozen */
+ int ntotal_frozen; /* # of frozen tuples in single page */
+ int ntup_per_page;
Size freespace;
bool all_visible_according_to_vm;
+ bool all_frozen_according_to_vm;
bool all_visible;
bool has_dead_tuples;
TransactionId visibility_cutoff_xid = InvalidTransactionId;
@@ -547,8 +564,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block,
- &vmbuffer))
+ if (!VM_ALL_VISIBLE(onerel, next_not_all_visible_block, &vmbuffer))
break;
vacuum_delay_point();
}
@@ -562,14 +578,29 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
skipping_all_visible_blocks = true;
else
skipping_all_visible_blocks = false;
+
all_visible_according_to_vm = false;
+ all_frozen_according_to_vm = false;
}
else
{
- /* Current block is all-visible */
- if (skipping_all_visible_blocks && !scan_all)
- continue;
+ /*
+ * This block is at least all-visible according to the visibility map.
+ * We check whether this block is all-frozen or not, to skip to
+ * vacuum this page even if scan_all is true.
+ */
+ bool all_frozen = VM_ALL_FROZEN(onerel, blkno, &vmbuffer);
+
+ if (scan_all && all_frozen)
+ {
+ vacrelstats->vmskipped_frozen_pages++;
+ continue;
+ }
+ else if (!scan_all && skipping_all_visible_blocks)
+ continue;
+
all_visible_according_to_vm = true;
+ all_frozen_according_to_vm = all_frozen;
}
vacuum_delay_point();
@@ -716,7 +747,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
empty_pages++;
freespace = PageGetHeapFreeSpace(page);
- /* empty pages are always all-visible */
+ /* empty pages are always all-visible and all-frozen */
if (!PageIsAllVisible(page))
{
START_CRIT_SECTION();
@@ -739,8 +770,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
log_newpage_buffer(buf, true);
PageSetAllVisible(page);
+ PageSetAllFrozen(page);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, InvalidTransactionId);
+ vmbuffer, InvalidTransactionId,
+ VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
END_CRIT_SECTION();
}
@@ -764,6 +797,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
all_visible = true;
has_dead_tuples = false;
nfrozen = 0;
+ nalready_frozen = 0;
+ ntup_per_page = 0;
hastup = false;
prev_dead_count = vacrelstats->num_dead_tuples;
maxoff = PageGetMaxOffsetNumber(page);
@@ -918,8 +953,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
{
num_tuples += 1;
+ ntup_per_page += 1;;
hastup = true;
+ /* Check whether this tuple is alrady frozen or not */
+ if (HeapTupleHeaderXminFrozen(tuple.t_data))
+ nalready_frozen += 1;
+
/*
* Each non-removable tuple must be checked to see if it needs
* freezing. Note we already have exclusive buffer lock.
@@ -931,9 +971,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
} /* scan along page */
/*
- * If we froze any tuples, mark the buffer dirty, and write a WAL
- * record recording the changes. We must log the changes to be
- * crash-safe against future truncation of CLOG.
+ * If we freeze any tuples, mark the buffer dirty, and write a WAL
+ * record recording the changes. We must log the changes to be crash-safe
+ * against future truncation of CLOG.
*/
if (nfrozen > 0)
{
@@ -966,6 +1006,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
END_CRIT_SECTION();
}
+ /* Compute the number of frozen tuples in a page */
+ ntotal_frozen = nfrozen + nalready_frozen;
+
/*
* If there are no indexes then we can vacuum the page right now
* instead of doing a second scan.
@@ -988,26 +1031,46 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
freespace = PageGetHeapFreeSpace(page);
- /* mark page all-visible, if appropriate */
- if (all_visible && !all_visible_according_to_vm)
+ /* This page is all visible */
+ if (all_visible)
{
- /*
- * It should never be the case that the visibility map page is set
- * while the page-level bit is clear, but the reverse is allowed
- * (if checksums are not enabled). Regardless, set the both bits
- * so that we get back in sync.
- *
- * NB: If the heap page is all-visible but the VM bit is not set,
- * we don't need to dirty the heap page. However, if checksums
- * are enabled, we do need to make sure that the heap page is
- * dirtied before passing it to visibilitymap_set(), because it
- * may be logged. Given that this situation should only happen in
- * rare cases after a crash, it is not worth optimizing.
- */
- PageSetAllVisible(page);
- MarkBufferDirty(buf);
- visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ /* mark page all-visible, if appropriate */
+ if (!all_visible_according_to_vm)
+ {
+ /*
+ * It should never be the case that the visibility map page is set
+ * while the page-level bit is clear, but the reverse is allowed
+ * (if checksums are not enabled). Regardless, set the both bits
+ * so that we get back in sync.
+ *
+ * NB: If the heap page is all-visible but the VM bit is not set,
+ * we don't need to dirty the heap page. However, if checksums
+ * are enabled, we do need to make sure that the heap page is
+ * dirtied before passing it to visibilitymap_set(), because it
+ * may be logged. Given that this situation should only happen in
+ * rare cases after a crash, it is not worth optimizing.
+ */
+ PageSetAllVisible(page);
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+ }
+
+ /* mark page all-frozen, if all tuples are frozen and not marked yet */
+ if ((ntotal_frozen == ntup_per_page) && !all_frozen_according_to_vm)
+ {
+ Assert(PageIsAllVisible(page));
+
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ {
+ MarkBufferDirty(buf);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+ vmbuffer, visibility_cutoff_xid, flags);
+ }
}
/*
@@ -1018,9 +1081,14 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* that something bad has happened.
*/
else if (all_visible_according_to_vm && !PageIsAllVisible(page)
- && visibilitymap_test(onerel, blkno, &vmbuffer))
+ && VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
{
- elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
+ /* If the all-frozen is set then all-visible must be set */
+ if (all_frozen_according_to_vm)
+ Assert(VM_ALL_FROZEN(onerel, blkno, &vmbuffer) &&
+ VM_ALL_VISIBLE(onerel, blkno, &vmbuffer));
+
+ elog(WARNING, "page is not marked all-visible (and all-frozen) but visibility map bit(s) is set in relation \"%s\" page %u",
relname, blkno);
visibilitymap_clear(onerel, blkno, vmbuffer);
}
@@ -1028,19 +1096,25 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
/*
* It's possible for the value returned by GetOldestXmin() to move
* backwards, so it's not wrong for us to see tuples that appear to
- * not be visible to everyone yet, while PD_ALL_VISIBLE is already
- * set. The real safe xmin value never moves backwards, but
- * GetOldestXmin() is conservative and sometimes returns a value
+ * not be visible to everyone yet, while PD_ALL_VISIBLE (and PD_ALL_FROZEN)
+ * are already set. The real safe xmin value never moves backwards,
+ * but GetOldestXmin() is conservative and sometimes returns a value
* that's unnecessarily small, so if we see that contradiction it just
* means that the tuples that we think are not visible to everyone yet
- * actually are, and the PD_ALL_VISIBLE flag is correct.
+ * actually are, and the PD_ALL_VISIBLE (and PD_ALL_FROZEN) flags are
+ * correct.
*
* There should never be dead tuples on a page with PD_ALL_VISIBLE
* set, however.
*/
else if (PageIsAllVisible(page) && has_dead_tuples)
{
- elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
+ /* If the all-frozen is set then all-visible must be set */
+ if (PageIsAllFrozen(page))
+ Assert(VM_ALL_FROZEN(onerel, blkno, &vmbuffer) &&
+ VM_ALL_VISIBLE(onerel, blkno, &vmbuffer));
+
+ elog(WARNING, "page containing dead tuples is marked as all-visible (and all-frozen) in relation \"%s\" page %u",
relname, blkno);
PageClearAllVisible(page);
MarkBufferDirty(buf);
@@ -1114,6 +1188,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
RelationGetRelationName(onerel),
tups_vacuumed, vacuumed_pages)));
+ /* Report how many frozen pages vacuum skipped according to visibility map */
+ ereport(elevel,
+ (errmsg_plural("skipped %d frozen page according to visibility map",
+ "skipped %d frozen pages according to visibility map",
+ vacrelstats->vmskipped_frozen_pages,
+ vacrelstats->vmskipped_frozen_pages)));
+
/*
* This is pretty messy, but we split it up so that we can skip emitting
* individual parts of the message when not applicable.
@@ -1230,6 +1311,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
OffsetNumber unused[MaxOffsetNumber];
int uncnt = 0;
TransactionId visibility_cutoff_xid;
+ bool all_frozen;
START_CRIT_SECTION();
@@ -1281,19 +1363,34 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
* dirty, exclusively locked, and, if needed, a full page image has been
* emitted in the log_heap_clean() above.
*/
- if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid))
+ if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid, &all_frozen))
PageSetAllVisible(page);
/*
* All the changes to the heap page have been done. If the all-visible
- * flag is now set, also set the VM bit.
+ * flag is now set, also set the VM all-visible bit.
+ * Also, if this page is all-frozen, set the VM all-frozen bit and flag.
*/
- if (PageIsAllVisible(page) &&
- !visibilitymap_test(onerel, blkno, vmbuffer))
+ if (PageIsAllVisible(page))
{
+ uint8 vm_status = visibilitymap_get_status(onerel, blkno, vmbuffer);
+ uint8 flags = 0;
+
+ if (!(vm_status & VISIBILITYMAP_ALL_VISIBLE))
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+
+ /* Set the VM all-frozen bit to flag, if needed */
+ if (all_frozen && !(vm_status & VISIBILITYMAP_ALL_FROZEN))
+ {
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
Assert(BufferIsValid(*vmbuffer));
- visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
- visibility_cutoff_xid);
+
+ if (vm_status != flags)
+ visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
+ visibility_cutoff_xid, flags);
}
return tupindex;
@@ -1783,10 +1880,12 @@ vac_cmp_itemptr(const void *left, const void *right)
/*
* Check if every tuple in the given page is visible to all current and future
* transactions. Also return the visibility_cutoff_xid which is the highest
- * xmin amongst the visible tuples.
+ * xmin amongst the visible tuples, and all_frozen which implies that all tuples
+ * of this page are frozen.
*/
static bool
-heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid)
+heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid,
+ bool *all_frozen)
{
Page page = BufferGetPage(buf);
BlockNumber blockno = BufferGetBlockNumber(buf);
@@ -1795,6 +1894,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
bool all_visible = true;
*visibility_cutoff_xid = InvalidTransactionId;
+ *all_frozen = true;
/*
* This is a stripped down version of the line pointer scan in
@@ -1818,11 +1918,12 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/*
* Dead line pointers can have index pointers pointing to them. So
- * they can't be treated as visible
+ * they can't be treated as visible and frozen.
*/
if (ItemIdIsDead(itemid))
{
all_visible = false;
+ *all_frozen = false;
break;
}
@@ -1859,6 +1960,10 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
+
+ /* Check whether this tuple is alrady frozen or not */
+ if (!HeapTupleHeaderXminFrozen(tuple.t_data))
+ *all_frozen = false;
}
break;
@@ -1867,6 +1972,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
case HEAPTUPLE_INSERT_IN_PROGRESS:
case HEAPTUPLE_DELETE_IN_PROGRESS:
all_visible = false;
+ *all_frozen = false;
break;
default:
@@ -1875,5 +1981,8 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
}
} /* scan along page */
+ if (!all_visible)
+ *all_frozen = false;
+
return all_visible;
}
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 9f54c46..e345177 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -85,7 +85,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
* which all tuples are known visible to everybody. In any case,
* we'll use the index tuple not the heap tuple as the data source.
*
- * Note on Memory Ordering Effects: visibilitymap_test does not lock
+ * Note on Memory Ordering Effects: visibilitymap_get_stattus does not lock
* the visibility map buffer, and therefore the result we read here
* could be slightly stale. However, it can't be stale enough to
* matter.
@@ -114,9 +114,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
* It's worth going through this complexity to avoid needing to lock
* the VM buffer, which could cause significant contention.
*/
- if (!visibilitymap_test(scandesc->heapRelation,
- ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
+ if (!VM_ALL_VISIBLE(scandesc->heapRelation, ItemPointerGetBlockNumber(tid),
+ &node->ioss_VMBuffer))
{
/*
* Rats, we have to visit the heap to check visibility.
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 990486c..d100a7d 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -468,7 +468,7 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count)
* estimates based on the correlation squared (XXX is that appropriate?).
*
* If it's an index-only scan, then we will not need to fetch any heap
- * pages for which the visibility map shows all tuples are visible.
+ * pages for which the visibility map map shows all tuples are visible.
* Hence, reduce the estimated number of heap fetches accordingly.
* We use the measured fraction of the entire heap that is all-visible,
* which might not be particularly relevant to the subset of the heap
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index ab018c4..ca7257a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1329,7 +1329,8 @@ pgstat_report_autovac(Oid dboid)
*/
void
pgstat_report_vacuum(Oid tableoid, bool shared,
- PgStat_Counter livetuples, PgStat_Counter deadtuples)
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages)
{
PgStat_MsgVacuum msg;
@@ -1343,6 +1344,7 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
msg.m_vacuumtime = GetCurrentTimestamp();
msg.m_live_tuples = livetuples;
msg.m_dead_tuples = deadtuples;
+ msg.m_frozen_pages = frozenpages;
pgstat_send(&msg, sizeof(msg));
}
@@ -1354,7 +1356,8 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
*/
void
pgstat_report_analyze(Relation rel,
- PgStat_Counter livetuples, PgStat_Counter deadtuples)
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages)
{
PgStat_MsgAnalyze msg;
@@ -1394,6 +1397,7 @@ pgstat_report_analyze(Relation rel,
msg.m_analyzetime = GetCurrentTimestamp();
msg.m_live_tuples = livetuples;
msg.m_dead_tuples = deadtuples;
+ msg.m_frozen_pages = frozenpages;
pgstat_send(&msg, sizeof(msg));
}
@@ -3702,6 +3706,7 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
result->n_live_tuples = 0;
result->n_dead_tuples = 0;
result->changes_since_analyze = 0;
+ result->n_frozen_pages = 0;
result->blocks_fetched = 0;
result->blocks_hit = 0;
result->vacuum_timestamp = 0;
@@ -5069,6 +5074,7 @@ pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
tabentry->n_live_tuples = msg->m_live_tuples;
tabentry->n_dead_tuples = msg->m_dead_tuples;
+ tabentry->n_frozen_pages = msg->m_frozen_pages;
if (msg->m_autovacuum)
{
@@ -5103,6 +5109,7 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
tabentry->n_live_tuples = msg->m_live_tuples;
tabentry->n_dead_tuples = msg->m_dead_tuples;
+ tabentry->n_frozen_pages = msg->m_frozen_pages;
/*
* We reset changes_since_analyze to zero, forgetting any changes that
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index f7c9bf6..98c14f9 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -46,6 +46,7 @@ extern Datum pg_stat_get_vacuum_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_autovacuum_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_analyze_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_autoanalyze_count(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_frozen_pages(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_function_calls(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_function_total_time(PG_FUNCTION_ARGS);
@@ -450,6 +451,21 @@ pg_stat_get_autoanalyze_count(PG_FUNCTION_ARGS)
}
Datum
+pg_stat_get_frozen_pages(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int32 result;
+ PgStat_StatTabEntry *tabentry;
+
+ if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+ result = 0;
+ else
+ result = (int32) (tabentry->n_frozen_pages);
+
+ PG_RETURN_INT32(result);
+}
+
+Datum
pg_stat_get_function_calls(PG_FUNCTION_ARGS)
{
Oid funcid = PG_GETARG_OID(0);
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 37eb832..74d5cc7 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -10,6 +10,7 @@
#include "postgres_fe.h"
#include "pg_upgrade.h"
+#include "storage/bufpage.h"
#include <fcntl.h>
@@ -21,6 +22,43 @@ static int copy_file(const char *fromfile, const char *tofile, bool force);
static int win32_pghardlink(const char *src, const char *dst);
#endif
+static const char *rewriteVisibilitymap(const char *fromfile, const char *tofile,
+ bool force);
+
+/* table for fast rewriting vm file in order to add all-frozen information */
+static const uint16 rewrite_vm_table[256] = {
+ 0, 1, 4, 5, 16, 17, 20, 21, 64, 65, 68, 69, 80, 81, 84, 85,
+ 256, 257, 260, 261, 272, 273, 276, 277, 320, 321, 324, 325, 336, 337, 340, 341,
+ 1024, 1025, 1028, 1029, 1040, 1041, 1044, 1045, 1088, 1089, 1092, 1093, 1104, 1105, 1108, 1109,
+ 1280, 1281, 1284, 1285, 1296, 1297, 1300, 1301, 1344, 1345, 1348, 1349, 1360, 1361, 1364, 1365,
+ 4096, 4097, 4100, 4101, 4112, 4113, 4116, 4117, 4160, 4161, 4164, 4165, 4176, 4177, 4180, 4181,
+ 4352, 4353, 4356, 4357, 4368, 4369, 4372, 4373, 4416, 4417, 4420, 4421, 4432, 4433, 4436, 4437,
+ 5120, 5121, 5124, 5125, 5136, 5137, 5140, 5141, 5184, 5185, 5188, 5189, 5200, 5201, 5204, 5205,
+ 5376, 5377, 5380, 5381, 5392, 5393, 5396, 5397, 5440, 5441, 5444, 5445, 5456, 5457, 5460, 5461,
+ 16384, 16385, 16388, 16389, 16400, 16401, 16404, 16405, 16448, 16449, 16452, 16453, 16464, 16465, 16468, 16469,
+ 16640, 16641, 16644, 16645, 16656, 16657, 16660, 16661, 16704, 16705, 16708, 16709, 16720, 16721, 16724, 16725,
+ 17408, 17409, 17412, 17413, 17424, 17425, 17428, 17429, 17472, 17473, 17476, 17477, 17488, 17489, 17492, 17493,
+ 17664, 17665, 17668, 17669, 17680, 17681, 17684, 17685, 17728, 17729, 17732, 17733, 17744, 17745, 17748, 17749,
+ 20480, 20481, 20484, 20485, 20496, 20497, 20500, 20501, 20544, 20545, 20548, 20549, 20560, 20561, 20564, 20565,
+ 20736, 20737, 20740, 20741, 20752, 20753, 20756, 20757, 20800, 20801, 20804, 20805, 20816, 20817, 20820, 20821,
+ 21504, 21505, 21508, 21509, 21520, 21521, 21524, 21525, 21568, 21569, 21572, 21573, 21584, 21585, 21588, 21589,
+ 21760, 21761, 21764, 21765, 21776, 21777, 21780, 21781, 21824, 21825, 21828, 21829, 21840, 21841, 21844, 21845
+};
+
+/*
+ * copyOrRewriteFile()
+ * This function copies file or rewrite visibility map file.
+ * If rewrite_vm is true, we have to rewrite visibility map regardless value of pageConverter.
+ */
+const char *
+copyOrRewriteFile(pageCnvCtx *pageConverter,
+ const char *src, const char *dst, bool force, bool rewrite_vm)
+{
+ if (rewrite_vm)
+ return rewriteVisibilitymap(src, dst, force);
+ else
+ return copyAndUpdateFile(pageConverter, src, dst, force);
+}
/*
* copyAndUpdateFile()
@@ -115,12 +153,14 @@ copyAndUpdateFile(pageCnvCtx *pageConverter,
*/
const char *
linkAndUpdateFile(pageCnvCtx *pageConverter,
- const char *src, const char *dst)
+ const char *src, const char *dst, bool rewrite_vm)
{
if (pageConverter != NULL)
return "Cannot in-place update this cluster, page-by-page conversion is required";
- if (pg_link_file(src, dst) == -1)
+ if (rewrite_vm)
+ return rewriteVisibilitymap(src, dst, true);
+ else if (pg_link_file(src, dst) == -1)
return getErrorText(errno);
else
return NULL;
@@ -205,6 +245,100 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
#endif
+/*
+ * rewriteVisibilitymap()
+ *
+ * A additional bit that indicates that all tuples on page is complety
+ * frozen is added into visibility map. So the format of visibility map
+ * has been changed.
+ * Copies a visibility map file while adding all-frozen bit(0) into each bit.
+ */
+static const char *
+rewriteVisibilitymap(const char *fromfile, const char *tofile, bool force)
+{
+#define REWRITE_BUF_SIZE (50 * BLCKSZ)
+#define BITS_PER_HEAPBLOCK 2
+
+ int src_fd = 0;
+ int dst_fd = 0;
+ uint16 vm_bits;
+ ssize_t nbytes;
+ char *buffer = NULL;
+ int ret = 0;
+
+ /* Reset errno */
+ errno = 0;
+
+ if ((fromfile == NULL) || (tofile == NULL))
+ return getErrorText(EINVAL);
+
+ if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
+ goto err;
+
+ if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+ goto err;
+
+ buffer = (char *) pg_malloc(REWRITE_BUF_SIZE);
+
+ /* Copy page header data in advance */
+ if ((nbytes = read(src_fd, buffer, MAXALIGN(SizeOfPageHeaderData))) <= 0)
+ goto err;
+
+ if (write(dst_fd, buffer, nbytes) != nbytes)
+ {
+ /* if write didn't set errno, assume problem is no disk space */
+ if (errno == 0)
+ errno = ENOSPC;
+ goto err;
+ }
+
+ /* perform data rewriting i.e read src srouce, write to destination */
+ while (true)
+ {
+ ssize_t nbytes = read(src_fd, buffer, REWRITE_BUF_SIZE);
+ char *cur, *end;
+
+ if (nbytes < 0)
+ {
+ ret = -1;
+ break;
+ }
+
+ if (nbytes == 0)
+ break;
+
+ cur = buffer;
+ end = buffer + nbytes;
+
+ /* Rewrite a byte and write dest_fd per BITS_PER_HEAPBLOCK bytes */
+ while (end > cur)
+ {
+ /* Get rewritten bit from table and its string representation */
+ vm_bits = rewrite_vm_table[(uint8) *cur];
+
+ if (write(dst_fd, &vm_bits, BITS_PER_HEAPBLOCK) != BITS_PER_HEAPBLOCK)
+ {
+ ret = -1;
+ break;
+ }
+ cur++;
+ }
+ }
+
+err:
+
+ if (!buffer)
+ pg_free(buffer);
+
+ if (src_fd != 0)
+ close(src_fd);
+
+ if (dst_fd != 0)
+ close(dst_fd);
+
+ return (errno == 0) ? NULL : getErrorText(errno);
+}
+
void
check_hard_link(void)
{
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index fa4661b..4943d9d 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -110,6 +110,10 @@ extern char *output_files[];
#define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031
/*
+ * The format of visibility map is changed with this 9.6 commit,
+ */
+#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201511191
+/*
* pg_multixact format changed in 9.3 commit 0ac5ad5134f2769ccbaefec73844f85,
* ("Improve concurrency of foreign key locking") which also updated catalog
* version to this value. pg_upgrade behavior depends on whether old and new
@@ -394,10 +398,12 @@ const pageCnvCtx *setupPageConverter(void);
typedef void *pageCnvCtx;
#endif
+const char *copyOrRewriteFile(pageCnvCtx *pageConverter, const char *src,
+ const char *dst, bool force, bool rewrite_vm);
const char *copyAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
const char *dst, bool force);
const char *linkAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
- const char *dst);
+ const char *dst, bool rewrite_vm);
void check_hard_link(void);
FILE *fopen_priv(const char *path, const char *mode);
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index c22df42..b3322e9 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -18,7 +18,7 @@
static void transfer_single_new_db(pageCnvCtx *pageConverter,
FileNameMap *maps, int size, char *old_tablespace);
static void transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *suffix);
+ const char *suffix, bool vm_need_rewrite);
/*
@@ -171,6 +171,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
{
int mapnum;
bool vm_crashsafe_match = true;
+ bool vm_need_rewrite = false;
/*
* Do the old and new cluster disagree on the crash-safetiness of the vm
@@ -180,13 +181,20 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
vm_crashsafe_match = false;
+ /*
+ * Do we need to rewrite visibilitymap?
+ */
+ if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ vm_need_rewrite = true;
+
for (mapnum = 0; mapnum < size; mapnum++)
{
if (old_tablespace == NULL ||
strcmp(maps[mapnum].old_tablespace, old_tablespace) == 0)
{
/* transfer primary file */
- transfer_relfile(pageConverter, &maps[mapnum], "");
+ transfer_relfile(pageConverter, &maps[mapnum], "", false);
/* fsm/vm files added in PG 8.4 */
if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)
@@ -194,9 +202,14 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
/*
* Copy/link any fsm and vm files, if they exist
*/
- transfer_relfile(pageConverter, &maps[mapnum], "_fsm");
+ transfer_relfile(pageConverter, &maps[mapnum], "_fsm", false);
if (vm_crashsafe_match)
- transfer_relfile(pageConverter, &maps[mapnum], "_vm");
+ {
+ if (vm_need_rewrite)
+ transfer_relfile(pageConverter, &maps[mapnum], "_vm", true);
+ else
+ transfer_relfile(pageConverter, &maps[mapnum], "_vm", false);
+ }
}
}
}
@@ -210,7 +223,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
*/
static void
transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *type_suffix)
+ const char *type_suffix, bool vm_need_rewrite)
{
const char *msg;
char old_file[MAXPGPATH];
@@ -276,7 +289,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
+ if ((msg = copyOrRewriteFile(pageConverter, old_file, new_file, true, vm_need_rewrite)) != NULL)
pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
@@ -284,7 +297,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "linking \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = linkAndUpdateFile(pageConverter, old_file, new_file)) != NULL)
+ if ((msg = linkAndUpdateFile(pageConverter, old_file, new_file, vm_need_rewrite)) != NULL)
pg_fatal("error while creating link for relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
diff --git a/src/bin/pg_upgrade/test.sh b/src/bin/pg_upgrade/test.sh
index aa7f399..6b058d4 100644
--- a/src/bin/pg_upgrade/test.sh
+++ b/src/bin/pg_upgrade/test.sh
@@ -174,6 +174,11 @@ if "$MAKE" -C "$oldsrc" installcheck; then
mv "$temp_root"/dump1.sql "$temp_root"/dump1.sql.orig
sed "s;$oldsrc;$newsrc;g" "$temp_root"/dump1.sql.orig >"$temp_root"/dump1.sql
fi
+
+ vm_sql="SELECT c.relname, c.relallvisible FROM pg_class as c, pg_namespace as n WHERE c.relnamespace = n.oid AND n.nspname NOT IN ('information_schema', 'pg_toast', 'pg_catalog') ORDER BY c.relname;"
+ # Test for rewriting visibility map
+ vacuumdb -d regression || visibilitymap_vacuum1_status=$?
+ psql -d regression -c "$vm_sql" > "$temp_root"/vm_test1.txt || visibilitymap_test1_status=$?
else
make_installcheck_status=$?
fi
@@ -188,6 +193,14 @@ if [ -n "$pg_dumpall1_status" ]; then
echo "pg_dumpall of pre-upgrade database cluster failed"
exit 1
fi
+if [ -n "$visibilitymap_vacuum1_status" ];then
+ echo "VACUUM of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+if [ -n "$visibilitymap_test1_status" ];then
+ echo "SELECT of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
PGDATA=$BASE_PGDATA
@@ -203,6 +216,8 @@ case $testhost in
esac
pg_dumpall -f "$temp_root"/dump2.sql || pg_dumpall2_status=$?
+vacuumdb -d regression || visibilitymap_vacuum2_status=$?
+psql -d regression -c "$vm_sql" > "$temp_root"/vm_test2.txt || visibilitymap_test2_status=$?
pg_ctl -m fast stop
# no need to echo commands anymore
@@ -214,11 +229,26 @@ if [ -n "$pg_dumpall2_status" ]; then
exit 1
fi
+if [ -n "$visibilitymap_vacuum2_status" ];then
+ echo "VACUUM of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
+if [ -n "$visibilitymap_test2_status" ];then
+ echo "SELECT of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
case $testhost in
MINGW*) cmd /c delete_old_cluster.bat ;;
*) sh ./delete_old_cluster.sh ;;
esac
+if ! diff "$temp_root"/vm_test1.txt "$temp_root"/vm_test2.txt >/dev/null; then
+ echo "Visibility map rewriting test failed"
+ exit 1
+fi
+
if diff "$temp_root"/dump1.sql "$temp_root"/dump2.sql >/dev/null; then
echo PASSED
exit 0
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index caa0f14..c55d232 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -320,9 +320,10 @@ typedef struct xl_heap_freeze_page
typedef struct xl_heap_visible
{
TransactionId cutoff_xid;
+ uint8 flags;
} xl_heap_visible;
-#define SizeOfHeapVisible (offsetof(xl_heap_visible, cutoff_xid) + sizeof(TransactionId))
+#define SizeOfHeapVisible (offsetof(xl_heap_visible, flags) + sizeof(uint8))
typedef struct xl_heap_new_cid
{
@@ -389,6 +390,6 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
xl_heap_freeze_tuple *xlrec_tp);
extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
- Buffer vm_buffer, TransactionId cutoff_xid);
+ Buffer pim_buffer, TransactionId cutoff_xid, uint8 flags);
#endif /* HEAPAM_XLOG_H */
diff --git a/src/include/access/hio.h b/src/include/access/hio.h
index b3b91e7..a200e5e 100644
--- a/src/include/access/hio.h
+++ b/src/include/access/hio.h
@@ -40,6 +40,6 @@ extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
Buffer otherBuffer, int options,
BulkInsertState bistate,
- Buffer *vmbuffer, Buffer *vmbuffer_other);
+ Buffer *pimbuffer, Buffer *pimbuffer_other);
#endif /* HIO_H */
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 0c0e0ef..5f032ab 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -19,15 +19,28 @@
#include "storage/buf.h"
#include "utils/relcache.h"
+/* Flags for bit map */
+#define VISIBILITYMAP_ALL_VISIBLE 0x01
+#define VISIBILITYMAP_ALL_FROZEN 0x02
+
+#define VISIBILITYMAP_VALID_BITS 0x03 /* OR of all valid visiblitymap flags bits */
+
+/* Macros for visibilitymap test */
+#define VM_ALL_VISIBLE(r, b, v) \
+ ((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_VISIBLE) != 0)
+#define VM_ALL_FROZEN(r, b, v) \
+ ((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_FROZEN) != 0)
+
extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk,
Buffer vmbuf);
extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
Buffer *vmbuf);
extern bool visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid);
-extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
-extern BlockNumber visibilitymap_count(Relation rel);
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags);
+extern uint8 visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
+extern BlockNumber visibilitymap_count(Relation rel, BlockNumber *all_frozen);
extern void visibilitymap_truncate(Relation rel, BlockNumber nheapblocks);
#endif /* VISIBILITYMAP_H */
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index eba4150..6df8298 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 201511071
+#define CATALOG_VERSION_NO 201511191
#endif
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index d8640db..9a77d7d 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2779,6 +2779,8 @@ DATA(insert OID = 3056 ( pg_stat_get_analyze_count PGNSP PGUID 12 1 0 0 0 f f f
DESCR("statistics: number of manual analyzes for a table");
DATA(insert OID = 3057 ( pg_stat_get_autoanalyze_count PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_autoanalyze_count _null_ _null_ _null_ ));
DESCR("statistics: number of auto analyzes for a table");
+DATA(insert OID = 6015 ( pg_stat_get_frozen_pages PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 23 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_frozen_pages _null_ _null_ _null_ ));
+DESCR("statistics: number of frozen pages of table");
DATA(insert OID = 1936 ( pg_stat_get_backend_idset PGNSP PGUID 12 1 100 0 0 f f f f t t s r 0 0 23 "" _null_ _null_ _null_ _null_ _null_ pg_stat_get_backend_idset _null_ _null_ _null_ ));
DESCR("statistics: currently active backend IDs");
DATA(insert OID = 2022 ( pg_stat_get_activity PGNSP PGUID 12 1 100 0 0 f f f f f t s r 1 0 2249 "23" "{23,26,23,26,25,25,25,16,1184,1184,1184,1184,869,25,23,28,28,16,25,25,23,16,25}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,datid,pid,usesysid,application_name,state,query,waiting,xact_start,query_start,backend_start,state_change,client_addr,client_hostname,client_port,backend_xid,backend_xmin,ssl,sslversion,sslcipher,sslbits,sslcompression,sslclientdn}" _null_ _null_ pg_stat_get_activity _null_ _null_ _null_ ));
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index eb3591a..6165500 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1363,7 +1363,7 @@ typedef struct IndexScanState
* RuntimeContext expr context for evaling runtime Skeys
* RelationDesc index relation descriptor
* ScanDesc index scan descriptor
- * VMBuffer buffer in use for visibility map testing, if any
+ * PIMBuffer buffer in use for visibility map testing, if any
* HeapFetches number of tuples we were forced to fetch from heap
* ----------------
*/
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 9ecc163..614ca5a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -355,6 +355,7 @@ typedef struct PgStat_MsgVacuum
TimestampTz m_vacuumtime;
PgStat_Counter m_live_tuples;
PgStat_Counter m_dead_tuples;
+ int32 m_frozen_pages;
} PgStat_MsgVacuum;
@@ -372,6 +373,7 @@ typedef struct PgStat_MsgAnalyze
TimestampTz m_analyzetime;
PgStat_Counter m_live_tuples;
PgStat_Counter m_dead_tuples;
+ int32 m_frozen_pages;
} PgStat_MsgAnalyze;
@@ -551,7 +553,7 @@ typedef union PgStat_Msg
* ------------------------------------------------------------
*/
-#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9D
+#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9E
/* ----------
* PgStat_StatDBEntry The collector's data per database
@@ -615,6 +617,8 @@ typedef struct PgStat_StatTabEntry
PgStat_Counter n_dead_tuples;
PgStat_Counter changes_since_analyze;
+ int32 n_frozen_pages;
+
PgStat_Counter blocks_fetched;
PgStat_Counter blocks_hit;
@@ -917,9 +921,11 @@ extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type t
extern void pgstat_report_autovac(Oid dboid);
extern void pgstat_report_vacuum(Oid tableoid, bool shared,
- PgStat_Counter livetuples, PgStat_Counter deadtuples);
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages);
extern void pgstat_report_analyze(Relation rel,
- PgStat_Counter livetuples, PgStat_Counter deadtuples);
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages);
extern void pgstat_report_recovery_conflict(int reason);
extern void pgstat_report_deadlock(void);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index a2f78ee..102aa81 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -178,8 +178,10 @@ typedef PageHeaderData *PageHeader;
* tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
+#define PD_ALL_FROZEN 0x0008 /* all tuples on page are completely
+ frozen */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x000F /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -367,7 +369,12 @@ typedef PageHeaderData *PageHeader;
#define PageSetAllVisible(page) \
(((PageHeader) (page))->pd_flags |= PD_ALL_VISIBLE)
#define PageClearAllVisible(page) \
- (((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
+ (((PageHeader) (page))->pd_flags &= ~(PD_ALL_VISIBLE | PD_ALL_FROZEN))
+
+#define PageIsAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags & PD_ALL_FROZEN)
+#define PageSetAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags |= PD_ALL_FROZEN)
#define PageIsPrunable(page, oldestxmin) \
( \
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 80374e4..b259e65 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1683,6 +1683,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,
pg_stat_get_live_tuples(c.oid) AS n_live_tup,
pg_stat_get_dead_tuples(c.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(c.oid) AS n_mod_since_analyze,
+ pg_stat_get_frozen_pages(c.oid) AS n_frozen_page,
pg_stat_get_last_vacuum_time(c.oid) AS last_vacuum,
pg_stat_get_last_autovacuum_time(c.oid) AS last_autovacuum,
pg_stat_get_last_analyze_time(c.oid) AS last_analyze,
@@ -1795,6 +1796,7 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
+ pg_stat_all_tables.n_frozen_page,
pg_stat_all_tables.last_vacuum,
pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,
@@ -1838,6 +1840,7 @@ pg_stat_user_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
+ pg_stat_all_tables.n_frozen_page,
pg_stat_all_tables.last_vacuum,
pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index f5be70f..dd49786 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -127,6 +127,8 @@ SELECT count(*) FROM tenk2 WHERE unique1 = 1;
1
(1 row)
+-- do VACUUM FREEZE
+VACUUM FREEZE tenk2;
-- force the rate-limiting logic in pgstat_report_tabstat() to time out
-- and send a message
SELECT pg_sleep(1.0);
@@ -175,6 +177,14 @@ SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages,
t | t
(1 row)
+SELECT n_frozen_page = (pg_relation_size('tenk2') / current_setting('block_size')::int)
+ FROM pg_stat_user_tables
+ WHERE relname ='tenk2';
+ ?column?
+----------
+ t
+(1 row)
+
SELECT pr.snap_ts < pg_stat_get_snapshot_timestamp() as snapshot_newer
FROM prevstats AS pr;
snapshot_newer
diff --git a/src/test/regress/expected/visibilitymap.out b/src/test/regress/expected/visibilitymap.out
new file mode 100644
index 0000000..3be0354
--- /dev/null
+++ b/src/test/regress/expected/visibilitymap.out
@@ -0,0 +1,22 @@
+--
+-- Visibility Map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+-- All pages are become all-visible
+VACUUM FREEZE vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+ ?column?
+----------
+ t
+(1 row)
+
+-- Check whether vacuum skips all-frozen pages
+\set VERBOSITY terse
+VACUUM FREEZE VERBOSE vmtest;
+INFO: vacuuming "public.vmtest"
+INFO: index "vmtest_pkey" now contains 10000 row versions in 30 pages
+INFO: skipped 45 frozen pages according to visibility map
+INFO: "vmtest": found 0 removable, 0 nonremovable row versions in 0 out of 45 pages
+\set VERBOSITY default
+DROP TABLE vmtest;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 3987b4c..5253a29 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -110,3 +110,6 @@ test: event_trigger
# run stats by itself because its delay may be insufficient under heavy load
test: stats
+
+# page info map and vacuum test cannot run concurrently with any test that runs SQL
+test: visibilitymap
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 379f272..c5fd695 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -160,3 +160,4 @@ test: with
test: xml
test: event_trigger
test: stats
+test: visibilitymap
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index cd2d592..b3c640f 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -120,6 +120,8 @@ ROLLBACK;
SELECT count(*) FROM tenk2;
-- do an indexscan
SELECT count(*) FROM tenk2 WHERE unique1 = 1;
+-- do VACUUM FREEZE
+VACUUM FREEZE tenk2;
-- force the rate-limiting logic in pgstat_report_tabstat() to time out
-- and send a message
@@ -145,6 +147,10 @@ SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages,
FROM pg_statio_user_tables AS st, pg_class AS cl, prevstats AS pr
WHERE st.relname='tenk2' AND cl.relname='tenk2';
+SELECT n_frozen_page = (pg_relation_size('tenk2') / current_setting('block_size')::int)
+ FROM pg_stat_user_tables
+ WHERE relname ='tenk2';
+
SELECT pr.snap_ts < pg_stat_get_snapshot_timestamp() as snapshot_newer
FROM prevstats AS pr;
diff --git a/src/test/regress/sql/visibilitymap.sql b/src/test/regress/sql/visibilitymap.sql
new file mode 100644
index 0000000..365570b
--- /dev/null
+++ b/src/test/regress/sql/visibilitymap.sql
@@ -0,0 +1,16 @@
+--
+-- Visibility Map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+
+-- All pages are become all-visible
+VACUUM FREEZE vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+
+-- Check whether vacuum skips all-frozen pages
+\set VERBOSITY terse
+VACUUM FREEZE VERBOSE vmtest;
+\set VERBOSITY default
+
+DROP TABLE vmtest;
On Thu, Nov 19, 2015 at 6:44 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Thu, Nov 19, 2015 at 5:54 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Wed, Nov 18, 2015 at 11:18 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
I get an error when running pg_upgrade from 9.4 to 9.6-this
error while copying relation "mediawiki.archive"
("/tmp/data/base/16414/21043_vm" to
"/tmp/data_fm/base/16400/21043_vm"): No such file or directoryOK, so the problem seems to be that rewriteVisibilitymap can get
called with errno already set to a nonzero value.It never clears it, and then fails at the end despite that no error
has actually occurred.Just setting it to 0 at the top of the function seems to be correct
thing to do. Or does it need to save the old value and restore it?Thank you for testing!
I think that the former is better, so attached latest patch.But now when I want to do the upgrade faster, I run into this:
"This utility cannot upgrade from PostgreSQL version from 9.5 or
before to 9.6 or later with link mode."Is this really an acceptable a tradeoff? Surely we can arrange to
link everything else and rewrite just the _vm, which is a tiny portion
of the data directory. I don't think that -k promises to link
everything it possibly can.I agree.
I've changed the patch so that.
pg_upgarde creates new _vm file and rewrites it even if upgrading to
9.6 with link mode.
The rewrite code thinks that only the first page of a vm has a header
of size SizeOfPageHeaderData, and the rest of the pages have a zero
size header. So the resulting _vm is corrupt.
After pg_upgrade, doing a vacuum freeze verbose gives:
WARNING: invalid page in block 1 of relation base/16402/22430_vm;
zeroing out page
WARNING: invalid page in block 1 of relation base/16402/22430_vm;
zeroing out page
WARNING: invalid page in block 2 of relation base/16402/22430_vm;
zeroing out page
WARNING: invalid page in block 2 of relation base/16402/22430_vm;
zeroing out page
WARNING: invalid page in block 3 of relation base/16402/22430_vm;
zeroing out page
WARNING: invalid page in block 3 of relation base/16402/22430_vm;
zeroing out page
WARNING: invalid page in block 4 of relation base/16402/22430_vm;
zeroing out page
WARNING: invalid page in block 4 of relation base/16402/22430_vm;
zeroing out page
WARNING: invalid page in block 5 of relation base/16402/22430_vm;
zeroing out page
WARNING: invalid page in block 5 of relation base/16402/22430_vm;
zeroing out page
WARNING: invalid page in block 6 of relation base/16402/22430_vm;
zeroing out page
WARNING: invalid page in block 6 of relation base/16402/22430_vm;
zeroing out page
WARNING: invalid page in block 7 of relation base/16402/22430_vm;
zeroing out page
WARNING: invalid page in block 7 of relation base/16402/22430_vm;
zeroing out page
WARNING: invalid page in block 8 of relation base/16402/22430_vm;
zeroing out page
WARNING: invalid page in block 8 of relation base/16402/22430_vm;
zeroing out page
Cheers,
Jeff
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Nov 21, 2015 at 6:50 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Thu, Nov 19, 2015 at 6:44 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Thu, Nov 19, 2015 at 5:54 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Wed, Nov 18, 2015 at 11:18 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
I get an error when running pg_upgrade from 9.4 to 9.6-this
error while copying relation "mediawiki.archive"
("/tmp/data/base/16414/21043_vm" to
"/tmp/data_fm/base/16400/21043_vm"): No such file or directoryOK, so the problem seems to be that rewriteVisibilitymap can get
called with errno already set to a nonzero value.It never clears it, and then fails at the end despite that no error
has actually occurred.Just setting it to 0 at the top of the function seems to be correct
thing to do. Or does it need to save the old value and restore it?Thank you for testing!
I think that the former is better, so attached latest patch.But now when I want to do the upgrade faster, I run into this:
"This utility cannot upgrade from PostgreSQL version from 9.5 or
before to 9.6 or later with link mode."Is this really an acceptable a tradeoff? Surely we can arrange to
link everything else and rewrite just the _vm, which is a tiny portion
of the data directory. I don't think that -k promises to link
everything it possibly can.I agree.
I've changed the patch so that.
pg_upgarde creates new _vm file and rewrites it even if upgrading to
9.6 with link mode.The rewrite code thinks that only the first page of a vm has a header
of size SizeOfPageHeaderData, and the rest of the pages have a zero
size header. So the resulting _vm is corrupt.After pg_upgrade, doing a vacuum freeze verbose gives:
WARNING: invalid page in block 1 of relation base/16402/22430_vm;
zeroing out page
WARNING: invalid page in block 1 of relation base/16402/22430_vm;
zeroing out page
WARNING: invalid page in block 2 of relation base/16402/22430_vm;
zeroing out page
WARNING: invalid page in block 2 of relation base/16402/22430_vm;
zeroing out page
WARNING: invalid page in block 3 of relation base/16402/22430_vm;
zeroing out page
WARNING: invalid page in block 3 of relation base/16402/22430_vm;
zeroing out page
WARNING: invalid page in block 4 of relation base/16402/22430_vm;
zeroing out page
WARNING: invalid page in block 4 of relation base/16402/22430_vm;
zeroing out page
WARNING: invalid page in block 5 of relation base/16402/22430_vm;
zeroing out page
WARNING: invalid page in block 5 of relation base/16402/22430_vm;
zeroing out page
WARNING: invalid page in block 6 of relation base/16402/22430_vm;
zeroing out page
WARNING: invalid page in block 6 of relation base/16402/22430_vm;
zeroing out page
WARNING: invalid page in block 7 of relation base/16402/22430_vm;
zeroing out page
WARNING: invalid page in block 7 of relation base/16402/22430_vm;
zeroing out page
WARNING: invalid page in block 8 of relation base/16402/22430_vm;
zeroing out page
WARNING: invalid page in block 8 of relation base/16402/22430_vm;
zeroing out page
Thank you for taking the time to review this patch!
The updated version patch is attached.
Regards,
--
Masahiko Sawada
Attachments:
000_add_frozen_bit_into_visibilitymap_v26.patchapplication/octet-stream; name=000_add_frozen_bit_into_visibilitymap_v26.patchDownload
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 22c5f7a..e8ebfe9 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -87,7 +87,7 @@ statapprox_heap(Relation rel, output_type *stat)
* If the page has only visible tuples, then we can find out the free
* space from the FSM and move on.
*/
- if (visibilitymap_test(rel, blkno, &vmbuffer))
+ if (VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
{
freespace = GetRecordedFreeSpace(rel, blkno);
stat->tuple_len += BLCKSZ - freespace;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 6e14851..c75a166 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5905,7 +5905,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs an eager freezing if the table's
<structname>pg_class</>.<structfield>relfrozenxid</> field has reached
the age specified by this setting. The default is 150 million
transactions. Although users can set this value anywhere from zero to
@@ -5949,7 +5949,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs an eager freezing if the table's
<structname>pg_class</>.<structfield>relminmxid</> field has reached
the age specified by this setting. The default is 150 million multixacts.
Although users can set this value anywhere from zero to two billions,
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 5204b34..5a43c28 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -352,9 +352,9 @@
Vacuum maintains a <link linkend="storage-vm">visibility map</> for each
table to keep track of which pages contain only tuples that are known to be
visible to all active transactions (and all future transactions, until the
- page is again modified). This has two purposes. First, vacuum
- itself can skip such pages on the next run, since there is nothing to
- clean up.
+ page is again modified), and pages contain only frozen tuples.
+ This has two purposes. First, vacuum itself can skip such pages
+ on the next run, since there is nothing to clean up.
</para>
<para>
@@ -438,28 +438,25 @@
</para>
<para>
- <command>VACUUM</> normally skips pages that don't have any dead row
- versions, but those pages might still have row versions with old XID
- values. To ensure all old row versions have been frozen, a
- scan of the whole table is needed.
+ <command>VACUUM</> skips pages that don't have any dead row
+ versions, and pages that have only frozen rows. To ensure all old
+ row versions have been frozen, a scan of all unfrozen pages is needed.
<xref linkend="guc-vacuum-freeze-table-age"> controls when
- <command>VACUUM</> does that: a whole table sweep is forced if
- the table hasn't been fully scanned for <varname>vacuum_freeze_table_age</>
- minus <varname>vacuum_freeze_min_age</> transactions. Setting it to 0
- forces <command>VACUUM</> to always scan all pages, effectively ignoring
- the visibility map.
+ <command>VACUUM</> does that: a whole-table freezing is forced if
+ the table hasn't been ensured all row versions are frozen for
+ <varname>vacuum_freeze_table_age</> minus <varname>vacuum_freeze_min_age</>
+ transcations.
</para>
<para>
The maximum time that a table can go unvacuumed is two billion
transactions minus the <varname>vacuum_freeze_min_age</> value at
- the time <command>VACUUM</> last scanned the whole table. If it were to go
- unvacuumed for longer than
- that, data loss could result. To ensure that this does not happen,
- autovacuum is invoked on any table that might contain unfrozen rows with
- XIDs older than the age specified by the configuration parameter <xref
- linkend="guc-autovacuum-freeze-max-age">. (This will happen even if
- autovacuum is disabled.)
+ the time <command>VACUUM</> last scanned unfrozen pages. If it were to go
+ unvacuumed for longer than that, data loss could result. To ensure
+ that this does not happen, autovacuum is invoked on any table that might
+ contain unfrozen rows with XIDs older than the age specified by the
+ configuration parameter <xref linkend="guc-autovacuum-freeze-max-age">.
+ (This will happen even if autovacuum is disabled.)
</para>
<para>
@@ -490,8 +487,7 @@
a regularly scheduled <command>VACUUM</> or an autovacuum triggered by
normal delete and update activity is run in that window. Setting it too
close could lead to anti-wraparound autovacuums, even though the table
- was recently vacuumed to reclaim space, whereas lower values lead to more
- frequent whole-table scans.
+ was recently vacuumed to reclaim space.
</para>
<para>
@@ -527,7 +523,7 @@
<structname>pg_database</>. In particular,
the <structfield>relfrozenxid</> column of a table's
<structname>pg_class</> row contains the freeze cutoff XID that was used
- by the last whole-table <command>VACUUM</> for that table. All rows
+ by the last whole-table freezing for that table. All rows
inserted by transactions with XIDs older than this cutoff XID are
guaranteed to have been frozen. Similarly,
the <structfield>datfrozenxid</> column of a database's
@@ -555,17 +551,19 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
<command>VACUUM</> normally
only scans pages that have been modified since the last vacuum, but
<structfield>relfrozenxid</> can only be advanced when the whole table is
- scanned. The whole table is scanned when <structfield>relfrozenxid</> is
- more than <varname>vacuum_freeze_table_age</> transactions old, when
- <command>VACUUM</>'s <literal>FREEZE</> option is used, or when all pages
- happen to
- require vacuuming to remove dead row versions. When <command>VACUUM</>
- scans the whole table, after it's finished <literal>age(relfrozenxid)</>
- should be a little more than the <varname>vacuum_freeze_min_age</> setting
- that was used (more by the number of transactions started since the
- <command>VACUUM</> started). If no whole-table-scanning <command>VACUUM</>
- is issued on the table until <varname>autovacuum_freeze_max_age</> is
- reached, an autovacuum will soon be forced for the table.
+ frozen. The whole-table freezing is occuerred only when all pages happen to
+ require freezing to freeze rows. In other cases such as where
+ <structfield>relfrozenxid</> is more than <varname>vacuum_freeze_table_age</>
+ transcations old, where <command>VACUUM</>'s <literal>FREEZE</> option is used,
+ <command>VACUUM</> can skip the pages that all tuples on the page itself are
+ marked as frozen.
+ When all pages of table are eventually marked as frozen by <command>VACUUM</>,
+ after it's finished <literal>age(relfrozenxid)</> should be a little more
+ than the <varname>vacuum_freeze_min_age</> setting that was used (more by
+ the number of transcations started since the <command>VACUUM</> started).
+ If the advancing of <structfield>relfrozenxid</> is not happend until
+ <varname>autovacuum_freeze_max_age</> is reached, an autovacuum will soon
+ be forced for the table.
</para>
<para>
@@ -642,28 +640,28 @@ HINT: Stop the postmaster and vacuum that database in single-user mode.
<structname>pg_class</>.<structfield>relminmxid</> stores the oldest
possible multixact ID still appearing in any tuple of that table.
If this value is older than
- <xref linkend="guc-vacuum-multixact-freeze-table-age">, a whole-table
+ <xref linkend="guc-vacuum-multixact-freeze-table-age">, a table
scan is forced. <function>mxid_age()</> can be used on
<structname>pg_class</>.<structfield>relminmxid</> to find its age.
</para>
<para>
- Whole-table <command>VACUUM</> scans, regardless of
- what causes them, enable advancing the value for that table.
+ When <command>VACUUM</> scans all unfrozen pages, regardless of what causes
+ them, enable advancing the value for that table.
Eventually, as all tables in all databases are scanned and their
oldest multixact values are advanced, on-disk storage for older
multixacts can be removed.
</para>
<para>
- As a safety device, a whole-table vacuum scan will occur for any table
- whose multixact-age is greater than
- <xref linkend="guc-autovacuum-multixact-freeze-max-age">. Whole-table
+ As a safety device, a vacuum scan will occur for any table whose
+ multixact-age is greater than
+ <xref linkend="guc-autovacuum-multixact-freeze-max-age">. A
vacuum scans will also occur progressively for all tables, starting with
those that have the oldest multixact-age, if the amount of used member
storage space exceeds the amount 50% of the addressable storage space.
- Both of these kinds of whole-table scans will occur even if autovacuum is
- nominally disabled.
+ Both of these kinds of scans will occur even if autovacuum is nominally
+ disabled.
</para>
</sect3>
</sect2>
@@ -743,8 +741,8 @@ vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuple
<command>UPDATE</command> and <command>DELETE</command> operation. (It
is only semi-accurate because some information might be lost under heavy
load.) If the <structfield>relfrozenxid</> value of the table is more
- than <varname>vacuum_freeze_table_age</> transactions old, the whole
- table is scanned to freeze old tuples and advance
+ than <varname>vacuum_freeze_table_age</> transactions old, the table is
+ scanned to freeze old tuples and advance
<structfield>relfrozenxid</>, otherwise only pages that have been modified
since the last vacuum are scanned.
</para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e64b7ef..1908a4d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1332,6 +1332,11 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
<entry>Estimated number of rows modified since this table was last analyzed</entry>
</row>
<row>
+ <entry><structfield>n_frozen_page</></entry>
+ <entry><type>integer</></entry>
+ <entry>Number of frozen pages</entry>
+ </row>
+ <row>
<entry><structfield>last_vacuum</></entry>
<entry><type>timestamp with time zone</></entry>
<entry>Last time at which this table was manually vacuumed
diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index eb113c2..275b69c 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -657,6 +657,12 @@ psql --username postgres --file script.sql postgres
</para>
<para>
+ Since the format of visibility map has been changed in version 9.6,
+ <application>pg_upgrade</> creates and rewrite new <literal>'_vm'</literal>
+ file even if upgrading from 9.5 or before to 9.6 or later with link mode (-k).
+ </para>
+
+ <para>
All failure, rebuild, and reindex cases will be reported by
<application>pg_upgrade</> if they affect your installation;
post-upgrade scripts to rebuild tables and indexes will be
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 164d08c..5dc8b04 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -623,18 +623,20 @@ can be used to examine the information stored in free space maps.
<para>
Each heap relation has a Visibility Map
(VM) to keep track of which pages contain only tuples that are known to be
-visible to all active transactions. It's stored
-alongside the main relation data in a separate relation fork, named after the
-filenode number of the relation, plus a <literal>_vm</> suffix. For example,
-if the filenode of a relation is 12345, the VM is stored in a file called
-<filename>12345_vm</>, in the same directory as the main relation file.
+visible to all active transactions, and pages contain only unfrozen tuples.
+It's stored alongside the main relation data in a separate relation fork,
+named after the filenode number of the relation, plus a <literal>_vm</> suffix.
+For example, if the filenode of a relation is 12345, the VM is stored in a file
+called <filename>12345_vm</>, in the same directory as the main relation file.
Note that indexes do not have VMs.
</para>
<para>
-The visibility map simply stores one bit per heap page. A set bit means
-that all tuples on the page are known to be visible to all transactions.
-This means that the page does not contain any tuples that need to be vacuumed.
+The visibility map stores two bits per heap page: all-visible, all-frozen.
+A set all-visible bit means that all tuples on the page are known to be visible
+to all transactions. A set all-frozen bit means that all tuples on the page are
+completely marked as frozen. This means that the page does not contain any tuples
+that need to be vacuumed, even if a scan of whole table is required.
This information can also be used by <firstterm>index-only scans</> to answer
queries using only the index tuple.
</para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9ff7a41..651dd0e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3034,9 +3034,9 @@ heap_delete(Relation relation, ItemPointer tid,
/*
* If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, we'll have to unlock and
- * re-lock, to avoid holding the buffer lock across an I/O. That's a bit
- * unfortunate, but hopefully shouldn't happen often.
+ * visible or all frozen while we were busy locking the buffer, we'll
+ * have to unlock and re-lock, to avoid holding the buffer lock across an
+ * I/O. That's a bit unfortunate, but hopefully shouldn't happen often.
*/
if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
{
@@ -7205,7 +7205,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
*/
XLogRecPtr
log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
- TransactionId cutoff_xid)
+ TransactionId cutoff_xid, uint8 vmflags)
{
xl_heap_visible xlrec;
XLogRecPtr recptr;
@@ -7215,6 +7215,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
Assert(BufferIsValid(vm_buffer));
xlrec.cutoff_xid = cutoff_xid;
+ xlrec.flags = vmflags;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
@@ -7804,7 +7805,12 @@ heap_xlog_visible(XLogReaderState *record)
* the subsequent update won't be replayed to clear the flag.
*/
page = BufferGetPage(buffer);
- PageSetAllVisible(page);
+
+ if (xlrec->flags & VISIBILITYMAP_ALL_VISIBLE)
+ PageSetAllVisible(page);
+ if (xlrec->flags & VISIBILITYMAP_ALL_FROZEN)
+ PageSetAllFrozen(page);
+
MarkBufferDirty(buffer);
}
else if (action == BLK_RESTORED)
@@ -7856,7 +7862,7 @@ heap_xlog_visible(XLogReaderState *record)
*/
if (lsn > PageGetLSN(vmpage))
visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
- xlrec->cutoff_xid);
+ xlrec->cutoff_xid, xlrec->flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 6db73bf..0fe49eb 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -119,7 +119,7 @@ ReadBufferBI(Relation relation, BlockNumber targetBlock,
* be less than buffer2.
*/
static void
-GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
+GetVisibilitymapPins(Relation relation, Buffer buffer1, Buffer buffer2,
BlockNumber block1, BlockNumber block2,
Buffer *vmbuffer1, Buffer *vmbuffer2)
{
@@ -380,11 +380,11 @@ RelationGetBufferForTuple(Relation relation, Size len,
* done.
*/
if (otherBuffer == InvalidBuffer || buffer <= otherBuffer)
- GetVisibilityMapPins(relation, buffer, otherBuffer,
+ GetVisibilitymapPins(relation, buffer, otherBuffer,
targetBlock, otherBlock, vmbuffer,
vmbuffer_other);
else
- GetVisibilityMapPins(relation, otherBuffer, buffer,
+ GetVisibilitymapPins(relation, otherBuffer, buffer,
otherBlock, targetBlock, vmbuffer_other,
vmbuffer);
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 7c38772..387a0d6 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -15,39 +15,46 @@
* visibilitymap_pin - pin a map page for setting a bit
* visibilitymap_pin_ok - check whether correct map page is already pinned
* visibilitymap_set - set a bit in a previously pinned page
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_get_status - get status of bits
* visibilitymap_count - count number of bits set in visibility map
* visibilitymap_truncate - truncate the visibility map
*
* NOTES
*
- * The visibility map is a bitmap with one bit per heap page. A set bit means
- * that all tuples on the page are known visible to all transactions, and
- * therefore the page doesn't need to be vacuumed. The map is conservative in
- * the sense that we make sure that whenever a bit is set, we know the
- * condition is true, but if a bit is not set, it might or might not be true.
+ * The visibility map is a bitmap with two bits (all-visible and all-frozen)
+ * per heap page. A set all-visible bit means that all tuples on the page are
+ * known visible to all transactions, and therefore the page doesn't need to
+ * be vacuumed. A set all-frozen bit means that all tuples on the page are
+ * completely frozen, and therefore the page doesn't need to be vacuumed even
+ * if whole table scanning vacuum is required (e.g. anti-wraparound vacuum).
+ * The all-frozen bit must be set only when the page is already all-visible.
+ *
+ * The map is conservative in the sense that we make sure that whenever a bit
+ * is set, we know the condition is true, but if a bit is not set, it might or
+ * might not be true.
*
* Clearing a visibility map bit is not separately WAL-logged. The callers
* must make sure that whenever a bit is cleared, the bit is cleared on WAL
- * replay of the updating operation as well.
+ * replay of the updating operation as well. And all-frozen bit must be
+ * cleared with all-visible at the same time.
*
* When we *set* a visibility map during VACUUM, we must write WAL. This may
* seem counterintuitive, since the bit is basically a hint: if it is clear,
- * it may still be the case that every tuple on the page is visible to all
- * transactions; we just don't know that for certain. The difficulty is that
- * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
- * on the page itself, and the visibility map bit. If a crash occurs after the
- * visibility map page makes it to disk and before the updated heap page makes
- * it to disk, redo must set the bit on the heap page. Otherwise, the next
- * insert, update, or delete on the heap page will fail to realize that the
- * visibility map bit must be cleared, possibly causing index-only scans to
- * return wrong answers.
+ * it may still be the case that every tuple on the page is all-visible or
+ * all-frozen we just don't know that for certain. The difficulty is that
+ * there are two bits which are typically set together: the PD_ALL_VISIBLE
+ * or PD_ALL_FROZEN bit on the page itself, and the corresponding visibility map
+ * bit. If a crash occurs after the visibility map page makes it to disk and before
+ * the updated heap page makes it to disk, redo must set the bit on the heap page.
+ * Otherwise, the next insert, update, or delete on the heap page will fail to
+ * realize that the visibility map bit must be cleared, possibly causing index-only
+ * scans to return wrong answers.
*
* VACUUM will normally skip pages for which the visibility map bit is set;
* such pages can't contain any dead tuples and therefore don't need vacuuming.
- * The visibility map is not used for anti-wraparound vacuums, because
- * an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
- * present in the table, even on pages that don't have any dead tuples.
+ * The visibility map has the all-frozen bit which indicates all tuples on
+ * corresponding page has been completely frozen, so the visibility map is also
+ * used for anti-wraparound vacuum, even if freezing of tuples is required.
*
* LOCKING
*
@@ -58,14 +65,14 @@
* section that logs the page modification. However, we don't want to hold
* the buffer lock over any I/O that may be required to read in the visibility
* map page. To avoid this, we examine the heap page before locking it;
- * if the page-level PD_ALL_VISIBLE bit is set, we pin the visibility map
- * bit. Then, we lock the buffer. But this creates a race condition: there
- * is a possibility that in the time it takes to lock the buffer, the
- * PD_ALL_VISIBLE bit gets set. If that happens, we have to unlock the
- * buffer, pin the visibility map page, and relock the buffer. This shouldn't
- * happen often, because only VACUUM currently sets visibility map bits,
- * and the race will only occur if VACUUM processes a given page at almost
- * exactly the same time that someone tries to further modify it.
+ * if the page-level PD_ALL_VISIBLE or PD_ALL_FROZEN bit is set, we pin the
+ * visibility map bit. Then, we lock the buffer. But this creates a race
+ * condition: there is a possibility that in the time it takes to lock the
+ * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that happens,
+ * we have to unlock the buffer, pin the visibility map page, and relock the
+ * buffer. This shouldn't happen often, because only VACUUM currently sets
+ * visibility map bits, and the race will only occur if VACUUM processes a given
+ * page at almost exactly the same time that someone tries to further modify it.
*
* To set a bit, you need to hold a lock on the heap page. That prevents
* the race condition where VACUUM sees that all tuples on the page are
@@ -101,11 +108,14 @@
*/
#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
-/* Number of bits allocated for each heap block. */
-#define BITS_PER_HEAPBLOCK 1
+/*
+ * Number of bits allocated for each heap block.
+ * One for all-visible, other for all-frozen.
+*/
+#define BITS_PER_HEAPBLOCK 2
/* Number of heap blocks we can represent in one byte. */
-#define HEAPBLOCKS_PER_BYTE 8
+#define HEAPBLOCKS_PER_BYTE 4
/* Number of heap blocks we can represent in one visibility map page. */
#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
@@ -113,26 +123,44 @@
/* Mapping from heap block number to the right bit in the visibility map */
#define HEAPBLK_TO_MAPBLOCK(x) ((x) / HEAPBLOCKS_PER_PAGE)
#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
-#define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
-
-/* table for fast counting of set bits */
-static const uint8 number_of_ones[256] = {
- 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
+#define HEAPBLK_TO_MAPBIT(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)
+
+/* tables for fast counting of set bits for visible and freeze */
+static const uint8 number_of_ones_for_visible[256] = {
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4
+};
+static const uint8 number_of_ones_for_frozen[256] = {
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4
};
/* prototypes for internal routines */
@@ -141,7 +169,7 @@ static void vm_extend(Relation rel, BlockNumber nvmblocks);
/*
- * visibilitymap_clear - clear a bit in visibility map
+ * visibilitymap_clear - clear all bits in visibility map
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -153,11 +181,11 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
int mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- uint8 mask = 1 << mapBit;
+ uint8 mask = VISIBILITYMAP_VALID_BITS << mapBit;
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_clear %s block %d", RelationGetRelationName(rel), heapBlk);
#endif
if (!BufferIsValid(buf) || BufferGetBlockNumber(buf) != mapBlock)
@@ -186,7 +214,7 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
* visibilitymap_set to actually set the bit.
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by
- * an earlier call to visibilitymap_pin or visibilitymap_test on the same
+ * an earlier call to visibilitymap_pin or visibilitymap_get_status on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk.
*
@@ -212,7 +240,7 @@ visibilitymap_pin(Relation rel, BlockNumber heapBlk, Buffer *buf)
* visibilitymap_pin_ok - do we already have the correct page pinned?
*
* On entry, buf should be InvalidBuffer or a valid buffer returned by
- * an earlier call to visibilitymap_pin or visibilitymap_test on the same
+ * an earlier call to visibilitymap_pin or visibilitymap_get_status on the same
* relation. The return value indicates whether the buffer covers the
* given heapBlk.
*/
@@ -225,7 +253,7 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
}
/*
- * visibilitymap_set - set a bit on a previously pinned page
+ * visibilitymap_set - set bit(s) on a previously pinned page
*
* recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
* or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
@@ -234,10 +262,11 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
* marked all-visible; it is needed for Hot Standby, and can be
* InvalidTransactionId if the page contains no tuples.
*
- * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
- * this function. Except in recovery, caller should also pass the heap
- * buffer. When checksums are enabled and we're not in recovery, we must add
- * the heap buffer to the WAL chain to protect it from being torn.
+ * Caller is expected to set the heap page's PD_ALL_VISIBLE or PD_ALL_FROZEN
+ * bit before calling this function. Except in recovery, caller should also
+ * pass the heap buffer and flags which indicates what flag we want to set.
+ * When checksums are enabled and we're not in recovery, we must add the heap
+ * buffer to the WAL chain to protect it from being torn.
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -245,7 +274,8 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
*/
void
visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid)
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -254,11 +284,12 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_set %s block %d, flag %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
Assert(InRecovery || BufferIsValid(heapBuf));
+ Assert(flags & VISIBILITYMAP_VALID_BITS);
/* Check that we have the right heap page pinned, if present */
if (BufferIsValid(heapBuf) && BufferGetBlockNumber(heapBuf) != heapBlk)
@@ -272,11 +303,11 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map = PageGetContents(page);
LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
- if (!(map[mapByte] & (1 << mapBit)))
+ if (flags != (map[mapByte] & (flags << mapBit)))
{
START_CRIT_SECTION();
- map[mapByte] |= (1 << mapBit);
+ map[mapByte] |= (flags << mapBit);
MarkBufferDirty(vmBuf);
if (RelationNeedsWAL(rel))
@@ -285,7 +316,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Assert(!InRecovery);
recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
- cutoff_xid);
+ cutoff_xid, flags);
/*
* If data checksums are enabled (or wal_log_hints=on), we
@@ -295,11 +326,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Page heapPage = BufferGetPage(heapBuf);
- /* caller is expected to set PD_ALL_VISIBLE first */
- Assert(PageIsAllVisible(heapPage));
+ /*
+ * Caller is expected to set PD_ALL_VISIBLE or
+ * PD_ALL_FROZEN first.
+ */
+ Assert(((flags | VISIBILITYMAP_ALL_VISIBLE) && PageIsAllVisible(heapPage)) ||
+ ((flags | VISIBILITYMAP_ALL_FROZEN) && PageIsAllFrozen(heapPage)));
PageSetLSN(heapPage, recptr);
}
}
+
PageSetLSN(page, recptr);
}
@@ -310,15 +346,17 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
}
/*
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_get_status - get status of bits
*
- * Are all tuples on heapBlk visible to all, according to the visibility map?
+ * Are all tuples on heapBlk visible to all or are marked frozen, according
+ * to the visibility map?
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by an
- * earlier call to visibilitymap_pin or visibilitymap_test on the same
+ * earlier call to visibilitymap_pin or visibilitymap_get_status on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk, or InvalidBuffer. The caller is responsible for
- * releasing *buf after it's done testing and setting bits.
+ * releasing *buf after it's done testing and setting bits, and must pass flags
+ * for which it needs to check the value in visibility map.
*
* NOTE: This function is typically called without a lock on the heap page,
* so somebody else could change the bit just after we look at it. In fact,
@@ -327,17 +365,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
* we might see the old value. It is the caller's responsibility to deal with
* all concurrency issues!
*/
-bool
-visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
+uint8
+visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *buf)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
uint8 mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- bool result;
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_test %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_get_status %s, block %d", RelationGetRelationName(rel), heapBlk);
#endif
/* Reuse the old pinned buffer if possible */
@@ -360,13 +397,11 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
map = PageGetContents(BufferGetPage(*buf));
/*
- * A single-bit read is atomic. There could be memory-ordering effects
+ * The double bits read is atomic. There could be memory-ordering effects
* here, but for performance reasons we make it the caller's job to worry
* about that.
*/
- result = (map[mapByte] & (1 << mapBit)) ? true : false;
-
- return result;
+ return ((map[mapByte] >> mapBit) & VISIBILITYMAP_VALID_BITS);
}
/*
@@ -374,13 +409,17 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
*
* Note: we ignore the possibility of race conditions when the table is being
* extended concurrently with the call. New pages added to the table aren't
- * going to be marked all-visible, so they won't affect the result.
+ * going to be marked all-visible or all-frozen, so they won't affect the result.
+ * The caller must set the flags which indicates what flag we want to count.
*/
BlockNumber
-visibilitymap_count(Relation rel)
+visibilitymap_count(Relation rel, BlockNumber *all_frozen)
{
- BlockNumber result = 0;
BlockNumber mapBlock;
+ BlockNumber all_visible = 0;
+
+ if (all_frozen)
+ *all_frozen = 0;
for (mapBlock = 0;; mapBlock++)
{
@@ -406,13 +445,15 @@ visibilitymap_count(Relation rel)
for (i = 0; i < MAPSIZE; i++)
{
- result += number_of_ones[map[i]];
+ all_visible += number_of_ones_for_visible[map[i]];
+ if (all_frozen)
+ *all_frozen += number_of_ones_for_frozen[map[i]];
}
ReleaseBuffer(mapBuffer);
}
- return result;
+ return all_visible;
}
/*
@@ -435,7 +476,7 @@ visibilitymap_truncate(Relation rel, BlockNumber nheapblocks)
uint8 truncBit = HEAPBLK_TO_MAPBIT(nheapblocks);
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_truncate %s %d", RelationGetRelationName(rel), nheapblocks);
+ elog(DEBUG1, "vm_truncate %s block %d", RelationGetRelationName(rel), nheapblocks);
#endif
RelationOpenSmgr(rel);
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e59b163..1cea026 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1813,7 +1813,7 @@ FormIndexDatum(IndexInfo *indexInfo,
* isprimary: if true, set relhaspkey true; else no change
* reltuples: if >= 0, set reltuples to this value; else no change
*
- * If reltuples >= 0, relpages and relallvisible are also updated (using
+ * If reltuples >= 0, relpages, relallvisible are also updated (using
* RelationGetNumberOfBlocks() and visibilitymap_count()).
*
* NOTE: an important side-effect of this operation is that an SI invalidation
@@ -1921,7 +1921,7 @@ index_update_stats(Relation rel,
BlockNumber relallvisible;
if (rd_rel->relkind != RELKIND_INDEX)
- relallvisible = visibilitymap_count(rel);
+ relallvisible = visibilitymap_count(rel, NULL);
else /* don't bother for indexes */
relallvisible = 0;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index ccc030f..8c555eb 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -444,6 +444,7 @@ CREATE VIEW pg_stat_all_tables AS
pg_stat_get_live_tuples(C.oid) AS n_live_tup,
pg_stat_get_dead_tuples(C.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(C.oid) AS n_mod_since_analyze,
+ pg_stat_get_frozen_pages(C.oid) AS n_frozen_page,
pg_stat_get_last_vacuum_time(C.oid) as last_vacuum,
pg_stat_get_last_autovacuum_time(C.oid) as last_autovacuum,
pg_stat_get_last_analyze_time(C.oid) as last_analyze,
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index ddb68ab..02a2c68 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -314,6 +314,8 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
+ BlockNumber relallvisible,
+ relallfrozen;
if (inh)
ereport(elevel,
@@ -564,6 +566,10 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
}
}
+ /* Calculate the number of all-visible and all-frozen bit */
+ if (!inh)
+ relallvisible = visibilitymap_count(onerel, &relallfrozen);
+
/*
* Update pages/tuples stats in pg_class ... but not if we're doing
* inherited stats.
@@ -572,7 +578,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
vac_update_relstats(onerel,
relpages,
totalrows,
- visibilitymap_count(onerel),
+ relallvisible,
hasindex,
InvalidTransactionId,
InvalidMultiXactId,
@@ -608,7 +614,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
* tracks per-table stats.
*/
if (!inh)
- pgstat_report_analyze(onerel, totalrows, totaldeadrows);
+ pgstat_report_analyze(onerel, totalrows, totaldeadrows, relallfrozen);
/* If this isn't part of VACUUM ANALYZE, let index AMs do cleanup */
if (!(options & VACOPT_VACUUM))
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7c4ef58..0a02a25 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -729,11 +729,11 @@ vac_estimate_reltuples(Relation relation, bool is_analyze,
* marked with xmin = our xid.
*
* In addition to fundamentally nontransactional statistics such as
- * relpages and relallvisible, we try to maintain certain lazily-updated
- * DDL flags such as relhasindex, by clearing them if no longer correct.
- * It's safe to do this in VACUUM, which can't run in parallel with
- * CREATE INDEX/RULE/TRIGGER and can't be part of a transaction block.
- * However, it's *not* safe to do it in an ANALYZE that's within an
+ * relpages, relallvisible, we try to maintain certain lazily-updated
+ * DDL flags such as relhasindex, by clearing them if no onger correct.
+ * It's safe to do this in VACUUM, which can't run in
+ * parallel with CREATE INDEX/RULE/TRIGGER and can't be part of a transaction
+ * block. However, it's *not* safe to do it in an ANALYZE that's within an
* outer transaction, because for example the current transaction might
* have dropped the last index; then we'd think relhasindex should be
* cleared, but if the transaction later rolls back this would be wrong.
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 2429889..994efb7 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -106,6 +106,8 @@ typedef struct LVRelStats
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* number of pages we examined */
BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber vmskipped_frozen_pages; /* # of pages we skipped by all-frozen bit
+ of visibility map */
double scanned_tuples; /* counts only tuples on scanned pages */
double old_rel_tuples; /* previous value of pg_class.reltuples */
double new_rel_tuples; /* new estimated total # of tuples */
@@ -156,7 +158,7 @@ static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
static int vac_cmp_itemptr(const void *left, const void *right);
static bool heap_page_is_all_visible(Relation rel, Buffer buf,
- TransactionId *visibility_cutoff_xid);
+ TransactionId *visibility_cutoff_xid, bool *all_frozen);
/*
@@ -188,7 +190,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
MultiXactId mxactFullScanLimit;
BlockNumber new_rel_pages;
double new_rel_tuples;
- BlockNumber new_rel_allvisible;
+ BlockNumber new_rel_allvisible,
+ new_rel_allfrozen;
double new_live_tuples;
TransactionId new_frozen_xid;
MultiXactId new_min_multi;
@@ -221,7 +224,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* We request a full scan if either the table's frozen Xid is now older
* than or equal to the requested Xid full-table scan limit; or if the
* table's minimum MultiXactId is older than or equal to the requested
- * mxid full-table scan limit.
+ * mxid full-table scan limit. During full scan, we could skip some pags
+ * according to all-frozen bit of visibility map.
*/
scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
xidFullScanLimit);
@@ -253,7 +257,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
- if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages + vacrelstats->vmskipped_frozen_pages)
+ < vacrelstats->rel_pages)
{
Assert(!scan_all);
scanned_all = false;
@@ -301,10 +306,13 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_tuples = vacrelstats->old_rel_tuples;
}
- new_rel_allvisible = visibilitymap_count(onerel);
+ new_rel_allvisible = visibilitymap_count(onerel, &new_rel_allfrozen);
if (new_rel_allvisible > new_rel_pages)
new_rel_allvisible = new_rel_pages;
+ if (new_rel_allfrozen > new_rel_pages)
+ new_rel_allfrozen = new_rel_pages;
+
new_frozen_xid = scanned_all ? FreezeLimit : InvalidTransactionId;
new_min_multi = scanned_all ? MultiXactCutoff : InvalidMultiXactId;
@@ -325,7 +333,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
pgstat_report_vacuum(RelationGetRelid(onerel),
onerel->rd_rel->relisshared,
new_live_tuples,
- vacrelstats->new_dead_tuples);
+ vacrelstats->new_dead_tuples,
+ new_rel_allfrozen);
/* and log the action if appropriate */
if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
@@ -360,10 +369,11 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel),
vacrelstats->num_index_scans);
- appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins\n"),
+ appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped according to vm\n"),
vacrelstats->pages_removed,
vacrelstats->rel_pages,
- vacrelstats->pinskipped_pages);
+ vacrelstats->pinskipped_pages,
+ vacrelstats->vmskipped_frozen_pages);
appendStringInfo(&buf,
_("tuples: %.0f removed, %.0f remain, %.0f are dead but not yet removable\n"),
vacrelstats->tuples_deleted,
@@ -486,9 +496,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* consecutive pages. Since we're reading sequentially, the OS should be
* doing readahead for us, so there's no gain in skipping a page now and
* then; that's likely to disable readahead and so be counterproductive.
- * Also, skipping even a single page means that we can't update
- * relfrozenxid, so we only want to do it if we can skip a goodly number
- * of pages.
+ * Also, skipping even a single page accorinding to all-visible bit of
+ * visibility map means that we can't update relfrozenxid, so we only want
+ * to do it if we can skip a goodly number. On the other hand, we count
+ * both how many pages we skipped according to all-frozen bit of visibility
+ * map and how many pages we freeze, so we can update relfrozenxid if
+ * the sum of two is as many as pages of table.
*
* Before entering the main loop, establish the invariant that
* next_not_all_visible_block is the next block number >= blkno that's not
@@ -498,24 +511,24 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* started skipping blocks, we may as well skip everything up to the next
* not-all-visible block.
*
- * Note: if scan_all is true, we won't actually skip any pages; but we
+ * Note: if scan_all is true, we might not actually skip any pages; but we
* maintain next_not_all_visible_block anyway, so as to set up the
* all_visible_according_to_vm flag correctly for each page.
*
- * Note: The value returned by visibilitymap_test could be slightly
+ * Note: The value returned by visibilitymap_get_status could be slightly
* out-of-date, since we make this test before reading the corresponding
* heap page or locking the buffer. This is OK. If we mistakenly think
- * that the page is all-visible when in fact the flag's just been cleared,
- * we might fail to vacuum the page. But it's OK to skip pages when
- * scan_all is not set, so no great harm done; the next vacuum will find
- * them. If we make the reverse mistake and vacuum a page unnecessarily,
- * it'll just be a no-op.
+ * that the page is all-visible/all-frozen when in fact the flag's just
+ * been cleared, we might fail to vacuum the page. But it's OK to skip
+ * pages when scan_all is not set, so no great harm done; the next vacuum
+ * will find them. If we make the reverse mistake and vacuum a page
+ * unnecessarily, it'll just be a no-op.
*/
for (next_not_all_visible_block = 0;
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer))
+ if (!VM_ALL_VISIBLE(onerel, next_not_all_visible_block, &vmbuffer))
break;
vacuum_delay_point();
}
@@ -533,9 +546,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
bool tupgone,
hastup;
int prev_dead_count;
- int nfrozen;
+ int nfrozen; /* # of tuples is frozen */
+ int nalready_frozen; /* # of tuples is already frozen */
+ int ntotal_frozen; /* # of frozen tuples in single page */
+ int ntup_per_page;
Size freespace;
bool all_visible_according_to_vm;
+ bool all_frozen_according_to_vm;
bool all_visible;
bool has_dead_tuples;
TransactionId visibility_cutoff_xid = InvalidTransactionId;
@@ -547,8 +564,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block,
- &vmbuffer))
+ if (!VM_ALL_VISIBLE(onerel, next_not_all_visible_block, &vmbuffer))
break;
vacuum_delay_point();
}
@@ -562,14 +578,29 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
skipping_all_visible_blocks = true;
else
skipping_all_visible_blocks = false;
+
all_visible_according_to_vm = false;
+ all_frozen_according_to_vm = false;
}
else
{
- /* Current block is all-visible */
- if (skipping_all_visible_blocks && !scan_all)
- continue;
+ /*
+ * This block is at least all-visible according to the visibility map.
+ * We check whether this block is all-frozen or not, to skip to
+ * vacuum this page even if scan_all is true.
+ */
+ bool all_frozen = VM_ALL_FROZEN(onerel, blkno, &vmbuffer);
+
+ if (scan_all && all_frozen)
+ {
+ vacrelstats->vmskipped_frozen_pages++;
+ continue;
+ }
+ else if (!scan_all && skipping_all_visible_blocks)
+ continue;
+
all_visible_according_to_vm = true;
+ all_frozen_according_to_vm = all_frozen;
}
vacuum_delay_point();
@@ -716,7 +747,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
empty_pages++;
freespace = PageGetHeapFreeSpace(page);
- /* empty pages are always all-visible */
+ /* empty pages are always all-visible and all-frozen */
if (!PageIsAllVisible(page))
{
START_CRIT_SECTION();
@@ -739,8 +770,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
log_newpage_buffer(buf, true);
PageSetAllVisible(page);
+ PageSetAllFrozen(page);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, InvalidTransactionId);
+ vmbuffer, InvalidTransactionId,
+ VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
END_CRIT_SECTION();
}
@@ -764,6 +797,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
all_visible = true;
has_dead_tuples = false;
nfrozen = 0;
+ nalready_frozen = 0;
+ ntup_per_page = 0;
hastup = false;
prev_dead_count = vacrelstats->num_dead_tuples;
maxoff = PageGetMaxOffsetNumber(page);
@@ -918,8 +953,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
{
num_tuples += 1;
+ ntup_per_page += 1;;
hastup = true;
+ /* Check whether this tuple is alrady frozen or not */
+ if (HeapTupleHeaderXminFrozen(tuple.t_data))
+ nalready_frozen += 1;
+
/*
* Each non-removable tuple must be checked to see if it needs
* freezing. Note we already have exclusive buffer lock.
@@ -931,9 +971,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
} /* scan along page */
/*
- * If we froze any tuples, mark the buffer dirty, and write a WAL
- * record recording the changes. We must log the changes to be
- * crash-safe against future truncation of CLOG.
+ * If we freeze any tuples, mark the buffer dirty, and write a WAL
+ * record recording the changes. We must log the changes to be crash-safe
+ * against future truncation of CLOG.
*/
if (nfrozen > 0)
{
@@ -966,6 +1006,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
END_CRIT_SECTION();
}
+ /* Compute the number of frozen tuples in a page */
+ ntotal_frozen = nfrozen + nalready_frozen;
+
/*
* If there are no indexes then we can vacuum the page right now
* instead of doing a second scan.
@@ -988,26 +1031,46 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
freespace = PageGetHeapFreeSpace(page);
- /* mark page all-visible, if appropriate */
- if (all_visible && !all_visible_according_to_vm)
+ /* This page is all visible */
+ if (all_visible)
{
- /*
- * It should never be the case that the visibility map page is set
- * while the page-level bit is clear, but the reverse is allowed
- * (if checksums are not enabled). Regardless, set the both bits
- * so that we get back in sync.
- *
- * NB: If the heap page is all-visible but the VM bit is not set,
- * we don't need to dirty the heap page. However, if checksums
- * are enabled, we do need to make sure that the heap page is
- * dirtied before passing it to visibilitymap_set(), because it
- * may be logged. Given that this situation should only happen in
- * rare cases after a crash, it is not worth optimizing.
- */
- PageSetAllVisible(page);
- MarkBufferDirty(buf);
- visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ /* mark page all-visible, if appropriate */
+ if (!all_visible_according_to_vm)
+ {
+ /*
+ * It should never be the case that the visibility map page is set
+ * while the page-level bit is clear, but the reverse is allowed
+ * (if checksums are not enabled). Regardless, set the both bits
+ * so that we get back in sync.
+ *
+ * NB: If the heap page is all-visible but the VM bit is not set,
+ * we don't need to dirty the heap page. However, if checksums
+ * are enabled, we do need to make sure that the heap page is
+ * dirtied before passing it to visibilitymap_set(), because it
+ * may be logged. Given that this situation should only happen in
+ * rare cases after a crash, it is not worth optimizing.
+ */
+ PageSetAllVisible(page);
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+ }
+
+ /* mark page all-frozen, if all tuples are frozen and not marked yet */
+ if ((ntotal_frozen == ntup_per_page) && !all_frozen_according_to_vm)
+ {
+ Assert(PageIsAllVisible(page));
+
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ {
+ MarkBufferDirty(buf);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+ vmbuffer, visibility_cutoff_xid, flags);
+ }
}
/*
@@ -1018,9 +1081,14 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* that something bad has happened.
*/
else if (all_visible_according_to_vm && !PageIsAllVisible(page)
- && visibilitymap_test(onerel, blkno, &vmbuffer))
+ && VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
{
- elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
+ /* If the all-frozen is set then all-visible must be set */
+ if (all_frozen_according_to_vm)
+ Assert(VM_ALL_FROZEN(onerel, blkno, &vmbuffer) &&
+ VM_ALL_VISIBLE(onerel, blkno, &vmbuffer));
+
+ elog(WARNING, "page is not marked all-visible (and all-frozen) but visibility map bit(s) is set in relation \"%s\" page %u",
relname, blkno);
visibilitymap_clear(onerel, blkno, vmbuffer);
}
@@ -1028,19 +1096,25 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
/*
* It's possible for the value returned by GetOldestXmin() to move
* backwards, so it's not wrong for us to see tuples that appear to
- * not be visible to everyone yet, while PD_ALL_VISIBLE is already
- * set. The real safe xmin value never moves backwards, but
- * GetOldestXmin() is conservative and sometimes returns a value
+ * not be visible to everyone yet, while PD_ALL_VISIBLE (and PD_ALL_FROZEN)
+ * are already set. The real safe xmin value never moves backwards,
+ * but GetOldestXmin() is conservative and sometimes returns a value
* that's unnecessarily small, so if we see that contradiction it just
* means that the tuples that we think are not visible to everyone yet
- * actually are, and the PD_ALL_VISIBLE flag is correct.
+ * actually are, and the PD_ALL_VISIBLE (and PD_ALL_FROZEN) flags are
+ * correct.
*
* There should never be dead tuples on a page with PD_ALL_VISIBLE
* set, however.
*/
else if (PageIsAllVisible(page) && has_dead_tuples)
{
- elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
+ /* If the all-frozen is set then all-visible must be set */
+ if (PageIsAllFrozen(page))
+ Assert(VM_ALL_FROZEN(onerel, blkno, &vmbuffer) &&
+ VM_ALL_VISIBLE(onerel, blkno, &vmbuffer));
+
+ elog(WARNING, "page containing dead tuples is marked as all-visible (and all-frozen) in relation \"%s\" page %u",
relname, blkno);
PageClearAllVisible(page);
MarkBufferDirty(buf);
@@ -1114,6 +1188,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
RelationGetRelationName(onerel),
tups_vacuumed, vacuumed_pages)));
+ /* Report how many frozen pages vacuum skipped according to visibility map */
+ ereport(elevel,
+ (errmsg_plural("skipped %d frozen page according to visibility map",
+ "skipped %d frozen pages according to visibility map",
+ vacrelstats->vmskipped_frozen_pages,
+ vacrelstats->vmskipped_frozen_pages)));
+
/*
* This is pretty messy, but we split it up so that we can skip emitting
* individual parts of the message when not applicable.
@@ -1230,6 +1311,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
OffsetNumber unused[MaxOffsetNumber];
int uncnt = 0;
TransactionId visibility_cutoff_xid;
+ bool all_frozen;
START_CRIT_SECTION();
@@ -1281,19 +1363,34 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
* dirty, exclusively locked, and, if needed, a full page image has been
* emitted in the log_heap_clean() above.
*/
- if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid))
+ if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid, &all_frozen))
PageSetAllVisible(page);
/*
* All the changes to the heap page have been done. If the all-visible
- * flag is now set, also set the VM bit.
+ * flag is now set, also set the VM all-visible bit.
+ * Also, if this page is all-frozen, set the VM all-frozen bit and flag.
*/
- if (PageIsAllVisible(page) &&
- !visibilitymap_test(onerel, blkno, vmbuffer))
+ if (PageIsAllVisible(page))
{
+ uint8 vm_status = visibilitymap_get_status(onerel, blkno, vmbuffer);
+ uint8 flags = 0;
+
+ if (!(vm_status & VISIBILITYMAP_ALL_VISIBLE))
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+
+ /* Set the VM all-frozen bit to flag, if needed */
+ if (all_frozen && !(vm_status & VISIBILITYMAP_ALL_FROZEN))
+ {
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
Assert(BufferIsValid(*vmbuffer));
- visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
- visibility_cutoff_xid);
+
+ if (vm_status != flags)
+ visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
+ visibility_cutoff_xid, flags);
}
return tupindex;
@@ -1783,10 +1880,12 @@ vac_cmp_itemptr(const void *left, const void *right)
/*
* Check if every tuple in the given page is visible to all current and future
* transactions. Also return the visibility_cutoff_xid which is the highest
- * xmin amongst the visible tuples.
+ * xmin amongst the visible tuples, and all_frozen which implies that all tuples
+ * of this page are frozen.
*/
static bool
-heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid)
+heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid,
+ bool *all_frozen)
{
Page page = BufferGetPage(buf);
BlockNumber blockno = BufferGetBlockNumber(buf);
@@ -1795,6 +1894,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
bool all_visible = true;
*visibility_cutoff_xid = InvalidTransactionId;
+ *all_frozen = true;
/*
* This is a stripped down version of the line pointer scan in
@@ -1818,11 +1918,12 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/*
* Dead line pointers can have index pointers pointing to them. So
- * they can't be treated as visible
+ * they can't be treated as visible and frozen.
*/
if (ItemIdIsDead(itemid))
{
all_visible = false;
+ *all_frozen = false;
break;
}
@@ -1859,6 +1960,10 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
+
+ /* Check whether this tuple is alrady frozen or not */
+ if (!HeapTupleHeaderXminFrozen(tuple.t_data))
+ *all_frozen = false;
}
break;
@@ -1867,6 +1972,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
case HEAPTUPLE_INSERT_IN_PROGRESS:
case HEAPTUPLE_DELETE_IN_PROGRESS:
all_visible = false;
+ *all_frozen = false;
break;
default:
@@ -1875,5 +1981,8 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
}
} /* scan along page */
+ if (!all_visible)
+ *all_frozen = false;
+
return all_visible;
}
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 9f54c46..e345177 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -85,7 +85,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
* which all tuples are known visible to everybody. In any case,
* we'll use the index tuple not the heap tuple as the data source.
*
- * Note on Memory Ordering Effects: visibilitymap_test does not lock
+ * Note on Memory Ordering Effects: visibilitymap_get_stattus does not lock
* the visibility map buffer, and therefore the result we read here
* could be slightly stale. However, it can't be stale enough to
* matter.
@@ -114,9 +114,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
* It's worth going through this complexity to avoid needing to lock
* the VM buffer, which could cause significant contention.
*/
- if (!visibilitymap_test(scandesc->heapRelation,
- ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
+ if (!VM_ALL_VISIBLE(scandesc->heapRelation, ItemPointerGetBlockNumber(tid),
+ &node->ioss_VMBuffer))
{
/*
* Rats, we have to visit the heap to check visibility.
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 990486c..d100a7d 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -468,7 +468,7 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count)
* estimates based on the correlation squared (XXX is that appropriate?).
*
* If it's an index-only scan, then we will not need to fetch any heap
- * pages for which the visibility map shows all tuples are visible.
+ * pages for which the visibility map map shows all tuples are visible.
* Hence, reduce the estimated number of heap fetches accordingly.
* We use the measured fraction of the entire heap that is all-visible,
* which might not be particularly relevant to the subset of the heap
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index ab018c4..ca7257a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1329,7 +1329,8 @@ pgstat_report_autovac(Oid dboid)
*/
void
pgstat_report_vacuum(Oid tableoid, bool shared,
- PgStat_Counter livetuples, PgStat_Counter deadtuples)
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages)
{
PgStat_MsgVacuum msg;
@@ -1343,6 +1344,7 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
msg.m_vacuumtime = GetCurrentTimestamp();
msg.m_live_tuples = livetuples;
msg.m_dead_tuples = deadtuples;
+ msg.m_frozen_pages = frozenpages;
pgstat_send(&msg, sizeof(msg));
}
@@ -1354,7 +1356,8 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
*/
void
pgstat_report_analyze(Relation rel,
- PgStat_Counter livetuples, PgStat_Counter deadtuples)
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages)
{
PgStat_MsgAnalyze msg;
@@ -1394,6 +1397,7 @@ pgstat_report_analyze(Relation rel,
msg.m_analyzetime = GetCurrentTimestamp();
msg.m_live_tuples = livetuples;
msg.m_dead_tuples = deadtuples;
+ msg.m_frozen_pages = frozenpages;
pgstat_send(&msg, sizeof(msg));
}
@@ -3702,6 +3706,7 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
result->n_live_tuples = 0;
result->n_dead_tuples = 0;
result->changes_since_analyze = 0;
+ result->n_frozen_pages = 0;
result->blocks_fetched = 0;
result->blocks_hit = 0;
result->vacuum_timestamp = 0;
@@ -5069,6 +5074,7 @@ pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
tabentry->n_live_tuples = msg->m_live_tuples;
tabentry->n_dead_tuples = msg->m_dead_tuples;
+ tabentry->n_frozen_pages = msg->m_frozen_pages;
if (msg->m_autovacuum)
{
@@ -5103,6 +5109,7 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
tabentry->n_live_tuples = msg->m_live_tuples;
tabentry->n_dead_tuples = msg->m_dead_tuples;
+ tabentry->n_frozen_pages = msg->m_frozen_pages;
/*
* We reset changes_since_analyze to zero, forgetting any changes that
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index f7c9bf6..98c14f9 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -46,6 +46,7 @@ extern Datum pg_stat_get_vacuum_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_autovacuum_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_analyze_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_autoanalyze_count(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_frozen_pages(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_function_calls(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_function_total_time(PG_FUNCTION_ARGS);
@@ -450,6 +451,21 @@ pg_stat_get_autoanalyze_count(PG_FUNCTION_ARGS)
}
Datum
+pg_stat_get_frozen_pages(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int32 result;
+ PgStat_StatTabEntry *tabentry;
+
+ if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+ result = 0;
+ else
+ result = (int32) (tabentry->n_frozen_pages);
+
+ PG_RETURN_INT32(result);
+}
+
+Datum
pg_stat_get_function_calls(PG_FUNCTION_ARGS)
{
Oid funcid = PG_GETARG_OID(0);
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 37eb832..38b404d 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -10,6 +10,7 @@
#include "postgres_fe.h"
#include "pg_upgrade.h"
+#include "storage/bufpage.h"
#include <fcntl.h>
@@ -21,6 +22,43 @@ static int copy_file(const char *fromfile, const char *tofile, bool force);
static int win32_pghardlink(const char *src, const char *dst);
#endif
+static const char *rewriteVisibilitymap(const char *fromfile, const char *tofile,
+ bool force);
+
+/* table for fast rewriting vm file in order to add all-frozen information */
+static const uint16 rewrite_vm_table[256] = {
+ 0, 1, 4, 5, 16, 17, 20, 21, 64, 65, 68, 69, 80, 81, 84, 85,
+ 256, 257, 260, 261, 272, 273, 276, 277, 320, 321, 324, 325, 336, 337, 340, 341,
+ 1024, 1025, 1028, 1029, 1040, 1041, 1044, 1045, 1088, 1089, 1092, 1093, 1104, 1105, 1108, 1109,
+ 1280, 1281, 1284, 1285, 1296, 1297, 1300, 1301, 1344, 1345, 1348, 1349, 1360, 1361, 1364, 1365,
+ 4096, 4097, 4100, 4101, 4112, 4113, 4116, 4117, 4160, 4161, 4164, 4165, 4176, 4177, 4180, 4181,
+ 4352, 4353, 4356, 4357, 4368, 4369, 4372, 4373, 4416, 4417, 4420, 4421, 4432, 4433, 4436, 4437,
+ 5120, 5121, 5124, 5125, 5136, 5137, 5140, 5141, 5184, 5185, 5188, 5189, 5200, 5201, 5204, 5205,
+ 5376, 5377, 5380, 5381, 5392, 5393, 5396, 5397, 5440, 5441, 5444, 5445, 5456, 5457, 5460, 5461,
+ 16384, 16385, 16388, 16389, 16400, 16401, 16404, 16405, 16448, 16449, 16452, 16453, 16464, 16465, 16468, 16469,
+ 16640, 16641, 16644, 16645, 16656, 16657, 16660, 16661, 16704, 16705, 16708, 16709, 16720, 16721, 16724, 16725,
+ 17408, 17409, 17412, 17413, 17424, 17425, 17428, 17429, 17472, 17473, 17476, 17477, 17488, 17489, 17492, 17493,
+ 17664, 17665, 17668, 17669, 17680, 17681, 17684, 17685, 17728, 17729, 17732, 17733, 17744, 17745, 17748, 17749,
+ 20480, 20481, 20484, 20485, 20496, 20497, 20500, 20501, 20544, 20545, 20548, 20549, 20560, 20561, 20564, 20565,
+ 20736, 20737, 20740, 20741, 20752, 20753, 20756, 20757, 20800, 20801, 20804, 20805, 20816, 20817, 20820, 20821,
+ 21504, 21505, 21508, 21509, 21520, 21521, 21524, 21525, 21568, 21569, 21572, 21573, 21584, 21585, 21588, 21589,
+ 21760, 21761, 21764, 21765, 21776, 21777, 21780, 21781, 21824, 21825, 21828, 21829, 21840, 21841, 21844, 21845
+};
+
+/*
+ * copyOrRewriteFile()
+ * This function copies file or rewrite visibility map file.
+ * If rewrite_vm is true, we have to rewrite visibility map regardless value of pageConverter.
+ */
+const char *
+copyOrRewriteFile(pageCnvCtx *pageConverter,
+ const char *src, const char *dst, bool force, bool rewrite_vm)
+{
+ if (rewrite_vm)
+ return rewriteVisibilitymap(src, dst, force);
+ else
+ return copyAndUpdateFile(pageConverter, src, dst, force);
+}
/*
* copyAndUpdateFile()
@@ -115,12 +153,14 @@ copyAndUpdateFile(pageCnvCtx *pageConverter,
*/
const char *
linkAndUpdateFile(pageCnvCtx *pageConverter,
- const char *src, const char *dst)
+ const char *src, const char *dst, bool rewrite_vm)
{
if (pageConverter != NULL)
return "Cannot in-place update this cluster, page-by-page conversion is required";
- if (pg_link_file(src, dst) == -1)
+ if (rewrite_vm)
+ return rewriteVisibilitymap(src, dst, true);
+ else if (pg_link_file(src, dst) == -1)
return getErrorText(errno);
else
return NULL;
@@ -205,6 +245,95 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
#endif
+/*
+ * rewriteVisibilitymap()
+ *
+ * Copies a visibility map file while adding all-frozen bit(0) into each bit.
+ */
+static const char *
+
+rewriteVisibilitymap(const char *fromfile, const char *tofile, bool force)
+{
+#define BITS_PER_HEAPBLOCK 2
+
+ int src_fd = 0;
+ int dst_fd = 0;
+ char buffer[BLCKSZ];
+ ssize_t bytesRead;
+ int rewriteVmBytesPerPage = (BLCKSZ - SizeOfPageHeaderData) / 2;
+
+ /* Reset errno */
+ errno = 0;
+
+ if ((fromfile == NULL) || (tofile == NULL))
+ return getErrorText(EINVAL);
+
+ if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
+ goto err;
+
+ if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+ goto err;
+
+ /* Perform data rewriting per page */
+ while ((bytesRead = read(src_fd, buffer, BLCKSZ)) == BLCKSZ)
+ {
+ char *cur, *end, *blkend;
+ char pageheader[SizeOfPageHeaderData];
+ uint16 vm_bits;
+
+ /* Save the page header data */
+ memcpy(pageheader, buffer, SizeOfPageHeaderData);
+
+ cur = buffer;
+ end = buffer + SizeOfPageHeaderData + rewriteVmBytesPerPage;
+ blkend = buffer + bytesRead;
+
+ while (blkend > end)
+ {
+ /* Copy page header data in advance */
+ if (write(dst_fd, pageheader, SizeOfPageHeaderData) != SizeOfPageHeaderData)
+ {
+ /* If write didn't set errno, assume problem is no disk space */
+ if (errno == 0)
+ errno = ENOSPC;
+ goto err;
+ }
+
+ cur += SizeOfPageHeaderData;
+
+ /* Rewrite visibility map bit one by one */
+ while (end > cur)
+ {
+ /* Get rewritten bit from table and its string representation */
+ vm_bits = rewrite_vm_table[(uint8) *cur];
+
+ if (write(dst_fd, &vm_bits, BITS_PER_HEAPBLOCK) != BITS_PER_HEAPBLOCK)
+ {
+ if (errno == 0)
+ errno = ENOSPC;
+ goto err;
+ }
+ cur++;
+ }
+
+ end += rewriteVmBytesPerPage;
+ }
+ }
+
+err:
+
+ if (!buffer)
+ pg_free(buffer);
+
+ if (src_fd != 0)
+ close(src_fd);
+
+ if (dst_fd != 0)
+ close(dst_fd);
+
+ return (errno == 0) ? NULL : getErrorText(errno);
+}
+
void
check_hard_link(void)
{
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index fa4661b..d4e60ba 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -110,6 +110,10 @@ extern char *output_files[];
#define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031
/*
+ * The format of visibility map is changed with this 9.6 commit,
+ */
+#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201511221
+/*
* pg_multixact format changed in 9.3 commit 0ac5ad5134f2769ccbaefec73844f85,
* ("Improve concurrency of foreign key locking") which also updated catalog
* version to this value. pg_upgrade behavior depends on whether old and new
@@ -394,10 +398,12 @@ const pageCnvCtx *setupPageConverter(void);
typedef void *pageCnvCtx;
#endif
+const char *copyOrRewriteFile(pageCnvCtx *pageConverter, const char *src,
+ const char *dst, bool force, bool rewrite_vm);
const char *copyAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
const char *dst, bool force);
const char *linkAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
- const char *dst);
+ const char *dst, bool rewrite_vm);
void check_hard_link(void);
FILE *fopen_priv(const char *path, const char *mode);
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index c22df42..b3322e9 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -18,7 +18,7 @@
static void transfer_single_new_db(pageCnvCtx *pageConverter,
FileNameMap *maps, int size, char *old_tablespace);
static void transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *suffix);
+ const char *suffix, bool vm_need_rewrite);
/*
@@ -171,6 +171,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
{
int mapnum;
bool vm_crashsafe_match = true;
+ bool vm_need_rewrite = false;
/*
* Do the old and new cluster disagree on the crash-safetiness of the vm
@@ -180,13 +181,20 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
vm_crashsafe_match = false;
+ /*
+ * Do we need to rewrite visibilitymap?
+ */
+ if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ vm_need_rewrite = true;
+
for (mapnum = 0; mapnum < size; mapnum++)
{
if (old_tablespace == NULL ||
strcmp(maps[mapnum].old_tablespace, old_tablespace) == 0)
{
/* transfer primary file */
- transfer_relfile(pageConverter, &maps[mapnum], "");
+ transfer_relfile(pageConverter, &maps[mapnum], "", false);
/* fsm/vm files added in PG 8.4 */
if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)
@@ -194,9 +202,14 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
/*
* Copy/link any fsm and vm files, if they exist
*/
- transfer_relfile(pageConverter, &maps[mapnum], "_fsm");
+ transfer_relfile(pageConverter, &maps[mapnum], "_fsm", false);
if (vm_crashsafe_match)
- transfer_relfile(pageConverter, &maps[mapnum], "_vm");
+ {
+ if (vm_need_rewrite)
+ transfer_relfile(pageConverter, &maps[mapnum], "_vm", true);
+ else
+ transfer_relfile(pageConverter, &maps[mapnum], "_vm", false);
+ }
}
}
}
@@ -210,7 +223,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
*/
static void
transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *type_suffix)
+ const char *type_suffix, bool vm_need_rewrite)
{
const char *msg;
char old_file[MAXPGPATH];
@@ -276,7 +289,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
+ if ((msg = copyOrRewriteFile(pageConverter, old_file, new_file, true, vm_need_rewrite)) != NULL)
pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
@@ -284,7 +297,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "linking \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = linkAndUpdateFile(pageConverter, old_file, new_file)) != NULL)
+ if ((msg = linkAndUpdateFile(pageConverter, old_file, new_file, vm_need_rewrite)) != NULL)
pg_fatal("error while creating link for relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
diff --git a/src/bin/pg_upgrade/test.sh b/src/bin/pg_upgrade/test.sh
index aa7f399..6b058d4 100644
--- a/src/bin/pg_upgrade/test.sh
+++ b/src/bin/pg_upgrade/test.sh
@@ -174,6 +174,11 @@ if "$MAKE" -C "$oldsrc" installcheck; then
mv "$temp_root"/dump1.sql "$temp_root"/dump1.sql.orig
sed "s;$oldsrc;$newsrc;g" "$temp_root"/dump1.sql.orig >"$temp_root"/dump1.sql
fi
+
+ vm_sql="SELECT c.relname, c.relallvisible FROM pg_class as c, pg_namespace as n WHERE c.relnamespace = n.oid AND n.nspname NOT IN ('information_schema', 'pg_toast', 'pg_catalog') ORDER BY c.relname;"
+ # Test for rewriting visibility map
+ vacuumdb -d regression || visibilitymap_vacuum1_status=$?
+ psql -d regression -c "$vm_sql" > "$temp_root"/vm_test1.txt || visibilitymap_test1_status=$?
else
make_installcheck_status=$?
fi
@@ -188,6 +193,14 @@ if [ -n "$pg_dumpall1_status" ]; then
echo "pg_dumpall of pre-upgrade database cluster failed"
exit 1
fi
+if [ -n "$visibilitymap_vacuum1_status" ];then
+ echo "VACUUM of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+if [ -n "$visibilitymap_test1_status" ];then
+ echo "SELECT of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
PGDATA=$BASE_PGDATA
@@ -203,6 +216,8 @@ case $testhost in
esac
pg_dumpall -f "$temp_root"/dump2.sql || pg_dumpall2_status=$?
+vacuumdb -d regression || visibilitymap_vacuum2_status=$?
+psql -d regression -c "$vm_sql" > "$temp_root"/vm_test2.txt || visibilitymap_test2_status=$?
pg_ctl -m fast stop
# no need to echo commands anymore
@@ -214,11 +229,26 @@ if [ -n "$pg_dumpall2_status" ]; then
exit 1
fi
+if [ -n "$visibilitymap_vacuum2_status" ];then
+ echo "VACUUM of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
+if [ -n "$visibilitymap_test2_status" ];then
+ echo "SELECT of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
case $testhost in
MINGW*) cmd /c delete_old_cluster.bat ;;
*) sh ./delete_old_cluster.sh ;;
esac
+if ! diff "$temp_root"/vm_test1.txt "$temp_root"/vm_test2.txt >/dev/null; then
+ echo "Visibility map rewriting test failed"
+ exit 1
+fi
+
if diff "$temp_root"/dump1.sql "$temp_root"/dump2.sql >/dev/null; then
echo PASSED
exit 0
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index caa0f14..c55d232 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -320,9 +320,10 @@ typedef struct xl_heap_freeze_page
typedef struct xl_heap_visible
{
TransactionId cutoff_xid;
+ uint8 flags;
} xl_heap_visible;
-#define SizeOfHeapVisible (offsetof(xl_heap_visible, cutoff_xid) + sizeof(TransactionId))
+#define SizeOfHeapVisible (offsetof(xl_heap_visible, flags) + sizeof(uint8))
typedef struct xl_heap_new_cid
{
@@ -389,6 +390,6 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
xl_heap_freeze_tuple *xlrec_tp);
extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
- Buffer vm_buffer, TransactionId cutoff_xid);
+ Buffer pim_buffer, TransactionId cutoff_xid, uint8 flags);
#endif /* HEAPAM_XLOG_H */
diff --git a/src/include/access/hio.h b/src/include/access/hio.h
index b3b91e7..a200e5e 100644
--- a/src/include/access/hio.h
+++ b/src/include/access/hio.h
@@ -40,6 +40,6 @@ extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
Buffer otherBuffer, int options,
BulkInsertState bistate,
- Buffer *vmbuffer, Buffer *vmbuffer_other);
+ Buffer *pimbuffer, Buffer *pimbuffer_other);
#endif /* HIO_H */
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 0c0e0ef..5f032ab 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -19,15 +19,28 @@
#include "storage/buf.h"
#include "utils/relcache.h"
+/* Flags for bit map */
+#define VISIBILITYMAP_ALL_VISIBLE 0x01
+#define VISIBILITYMAP_ALL_FROZEN 0x02
+
+#define VISIBILITYMAP_VALID_BITS 0x03 /* OR of all valid visiblitymap flags bits */
+
+/* Macros for visibilitymap test */
+#define VM_ALL_VISIBLE(r, b, v) \
+ ((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_VISIBLE) != 0)
+#define VM_ALL_FROZEN(r, b, v) \
+ ((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_FROZEN) != 0)
+
extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk,
Buffer vmbuf);
extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
Buffer *vmbuf);
extern bool visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid);
-extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
-extern BlockNumber visibilitymap_count(Relation rel);
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags);
+extern uint8 visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
+extern BlockNumber visibilitymap_count(Relation rel, BlockNumber *all_frozen);
extern void visibilitymap_truncate(Relation rel, BlockNumber nheapblocks);
#endif /* VISIBILITYMAP_H */
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index eba4150..df70e01 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 201511071
+#define CATALOG_VERSION_NO 201511221
#endif
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index d8640db..9a77d7d 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2779,6 +2779,8 @@ DATA(insert OID = 3056 ( pg_stat_get_analyze_count PGNSP PGUID 12 1 0 0 0 f f f
DESCR("statistics: number of manual analyzes for a table");
DATA(insert OID = 3057 ( pg_stat_get_autoanalyze_count PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_autoanalyze_count _null_ _null_ _null_ ));
DESCR("statistics: number of auto analyzes for a table");
+DATA(insert OID = 6015 ( pg_stat_get_frozen_pages PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 23 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_frozen_pages _null_ _null_ _null_ ));
+DESCR("statistics: number of frozen pages of table");
DATA(insert OID = 1936 ( pg_stat_get_backend_idset PGNSP PGUID 12 1 100 0 0 f f f f t t s r 0 0 23 "" _null_ _null_ _null_ _null_ _null_ pg_stat_get_backend_idset _null_ _null_ _null_ ));
DESCR("statistics: currently active backend IDs");
DATA(insert OID = 2022 ( pg_stat_get_activity PGNSP PGUID 12 1 100 0 0 f f f f f t s r 1 0 2249 "23" "{23,26,23,26,25,25,25,16,1184,1184,1184,1184,869,25,23,28,28,16,25,25,23,16,25}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,datid,pid,usesysid,application_name,state,query,waiting,xact_start,query_start,backend_start,state_change,client_addr,client_hostname,client_port,backend_xid,backend_xmin,ssl,sslversion,sslcipher,sslbits,sslcompression,sslclientdn}" _null_ _null_ pg_stat_get_activity _null_ _null_ _null_ ));
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index eb3591a..6165500 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1363,7 +1363,7 @@ typedef struct IndexScanState
* RuntimeContext expr context for evaling runtime Skeys
* RelationDesc index relation descriptor
* ScanDesc index scan descriptor
- * VMBuffer buffer in use for visibility map testing, if any
+ * PIMBuffer buffer in use for visibility map testing, if any
* HeapFetches number of tuples we were forced to fetch from heap
* ----------------
*/
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 9ecc163..614ca5a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -355,6 +355,7 @@ typedef struct PgStat_MsgVacuum
TimestampTz m_vacuumtime;
PgStat_Counter m_live_tuples;
PgStat_Counter m_dead_tuples;
+ int32 m_frozen_pages;
} PgStat_MsgVacuum;
@@ -372,6 +373,7 @@ typedef struct PgStat_MsgAnalyze
TimestampTz m_analyzetime;
PgStat_Counter m_live_tuples;
PgStat_Counter m_dead_tuples;
+ int32 m_frozen_pages;
} PgStat_MsgAnalyze;
@@ -551,7 +553,7 @@ typedef union PgStat_Msg
* ------------------------------------------------------------
*/
-#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9D
+#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9E
/* ----------
* PgStat_StatDBEntry The collector's data per database
@@ -615,6 +617,8 @@ typedef struct PgStat_StatTabEntry
PgStat_Counter n_dead_tuples;
PgStat_Counter changes_since_analyze;
+ int32 n_frozen_pages;
+
PgStat_Counter blocks_fetched;
PgStat_Counter blocks_hit;
@@ -917,9 +921,11 @@ extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type t
extern void pgstat_report_autovac(Oid dboid);
extern void pgstat_report_vacuum(Oid tableoid, bool shared,
- PgStat_Counter livetuples, PgStat_Counter deadtuples);
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages);
extern void pgstat_report_analyze(Relation rel,
- PgStat_Counter livetuples, PgStat_Counter deadtuples);
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages);
extern void pgstat_report_recovery_conflict(int reason);
extern void pgstat_report_deadlock(void);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index a2f78ee..102aa81 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -178,8 +178,10 @@ typedef PageHeaderData *PageHeader;
* tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
+#define PD_ALL_FROZEN 0x0008 /* all tuples on page are completely
+ frozen */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x000F /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -367,7 +369,12 @@ typedef PageHeaderData *PageHeader;
#define PageSetAllVisible(page) \
(((PageHeader) (page))->pd_flags |= PD_ALL_VISIBLE)
#define PageClearAllVisible(page) \
- (((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
+ (((PageHeader) (page))->pd_flags &= ~(PD_ALL_VISIBLE | PD_ALL_FROZEN))
+
+#define PageIsAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags & PD_ALL_FROZEN)
+#define PageSetAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags |= PD_ALL_FROZEN)
#define PageIsPrunable(page, oldestxmin) \
( \
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 80374e4..b259e65 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1683,6 +1683,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,
pg_stat_get_live_tuples(c.oid) AS n_live_tup,
pg_stat_get_dead_tuples(c.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(c.oid) AS n_mod_since_analyze,
+ pg_stat_get_frozen_pages(c.oid) AS n_frozen_page,
pg_stat_get_last_vacuum_time(c.oid) AS last_vacuum,
pg_stat_get_last_autovacuum_time(c.oid) AS last_autovacuum,
pg_stat_get_last_analyze_time(c.oid) AS last_analyze,
@@ -1795,6 +1796,7 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
+ pg_stat_all_tables.n_frozen_page,
pg_stat_all_tables.last_vacuum,
pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,
@@ -1838,6 +1840,7 @@ pg_stat_user_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
+ pg_stat_all_tables.n_frozen_page,
pg_stat_all_tables.last_vacuum,
pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index f5be70f..dd49786 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -127,6 +127,8 @@ SELECT count(*) FROM tenk2 WHERE unique1 = 1;
1
(1 row)
+-- do VACUUM FREEZE
+VACUUM FREEZE tenk2;
-- force the rate-limiting logic in pgstat_report_tabstat() to time out
-- and send a message
SELECT pg_sleep(1.0);
@@ -175,6 +177,14 @@ SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages,
t | t
(1 row)
+SELECT n_frozen_page = (pg_relation_size('tenk2') / current_setting('block_size')::int)
+ FROM pg_stat_user_tables
+ WHERE relname ='tenk2';
+ ?column?
+----------
+ t
+(1 row)
+
SELECT pr.snap_ts < pg_stat_get_snapshot_timestamp() as snapshot_newer
FROM prevstats AS pr;
snapshot_newer
diff --git a/src/test/regress/expected/visibilitymap.out b/src/test/regress/expected/visibilitymap.out
new file mode 100644
index 0000000..3be0354
--- /dev/null
+++ b/src/test/regress/expected/visibilitymap.out
@@ -0,0 +1,22 @@
+--
+-- Visibility Map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+-- All pages are become all-visible
+VACUUM FREEZE vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+ ?column?
+----------
+ t
+(1 row)
+
+-- Check whether vacuum skips all-frozen pages
+\set VERBOSITY terse
+VACUUM FREEZE VERBOSE vmtest;
+INFO: vacuuming "public.vmtest"
+INFO: index "vmtest_pkey" now contains 10000 row versions in 30 pages
+INFO: skipped 45 frozen pages according to visibility map
+INFO: "vmtest": found 0 removable, 0 nonremovable row versions in 0 out of 45 pages
+\set VERBOSITY default
+DROP TABLE vmtest;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 3987b4c..5253a29 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -110,3 +110,6 @@ test: event_trigger
# run stats by itself because its delay may be insufficient under heavy load
test: stats
+
+# page info map and vacuum test cannot run concurrently with any test that runs SQL
+test: visibilitymap
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 379f272..c5fd695 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -160,3 +160,4 @@ test: with
test: xml
test: event_trigger
test: stats
+test: visibilitymap
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index cd2d592..b3c640f 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -120,6 +120,8 @@ ROLLBACK;
SELECT count(*) FROM tenk2;
-- do an indexscan
SELECT count(*) FROM tenk2 WHERE unique1 = 1;
+-- do VACUUM FREEZE
+VACUUM FREEZE tenk2;
-- force the rate-limiting logic in pgstat_report_tabstat() to time out
-- and send a message
@@ -145,6 +147,10 @@ SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages,
FROM pg_statio_user_tables AS st, pg_class AS cl, prevstats AS pr
WHERE st.relname='tenk2' AND cl.relname='tenk2';
+SELECT n_frozen_page = (pg_relation_size('tenk2') / current_setting('block_size')::int)
+ FROM pg_stat_user_tables
+ WHERE relname ='tenk2';
+
SELECT pr.snap_ts < pg_stat_get_snapshot_timestamp() as snapshot_newer
FROM prevstats AS pr;
diff --git a/src/test/regress/sql/visibilitymap.sql b/src/test/regress/sql/visibilitymap.sql
new file mode 100644
index 0000000..365570b
--- /dev/null
+++ b/src/test/regress/sql/visibilitymap.sql
@@ -0,0 +1,16 @@
+--
+-- Visibility Map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+
+-- All pages are become all-visible
+VACUUM FREEZE vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+
+-- Check whether vacuum skips all-frozen pages
+\set VERBOSITY terse
+VACUUM FREEZE VERBOSE vmtest;
+\set VERBOSITY default
+
+DROP TABLE vmtest;
On Sun, Nov 22, 2015 at 8:16 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Thank you for taking the time to review this patch!
The updated version patch is attached.
I am skeptical about just copying the old page header to be two new
page headers. I don't know what the implications for this will be on
pd_lsn. Since pg_upgrade can only run on a cluster that was cleanly
shutdown, I think that just copying it from the old page to both new
pages it turns into might be fine. But pd_checksum will certainly be
wrong, breaking pg_upgrade for cases where checksums are turned on in.
It needs to be recomputed on both new pages. It looks like there is
no precedence for doing that in pg_upgrade so this will be breaking
new ground.
Cheers,
Jeff
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Nov 23, 2015 at 6:27 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Sun, Nov 22, 2015 at 8:16 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Thank you for taking the time to review this patch!
The updated version patch is attached.I am skeptical about just copying the old page header to be two new
page headers. I don't know what the implications for this will be on
pd_lsn. Since pg_upgrade can only run on a cluster that was cleanly
shutdown, I think that just copying it from the old page to both new
pages it turns into might be fine. But pd_checksum will certainly be
wrong, breaking pg_upgrade for cases where checksums are turned on in.
It needs to be recomputed on both new pages. It looks like there is
no precedence for doing that in pg_upgrade so this will be breaking
new ground.
Yeah, we need to consider to compute checksum if enabled.
I've changed the patch, and attached.
Please review it.
Regards,
--
Masahiko Sawada
Attachments:
000_add_frozen_bit_into_visibilitymap_v27.patchapplication/octet-stream; name=000_add_frozen_bit_into_visibilitymap_v27.patchDownload
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 22c5f7a..e8ebfe9 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -87,7 +87,7 @@ statapprox_heap(Relation rel, output_type *stat)
* If the page has only visible tuples, then we can find out the free
* space from the FSM and move on.
*/
- if (visibilitymap_test(rel, blkno, &vmbuffer))
+ if (VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
{
freespace = GetRecordedFreeSpace(rel, blkno);
stat->tuple_len += BLCKSZ - freespace;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 392eb70..c43443a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5916,7 +5916,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs an eager freezing if the table's
<structname>pg_class</>.<structfield>relfrozenxid</> field has reached
the age specified by this setting. The default is 150 million
transactions. Although users can set this value anywhere from zero to
@@ -5960,7 +5960,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs an eager freezing if the table's
<structname>pg_class</>.<structfield>relminmxid</> field has reached
the age specified by this setting. The default is 150 million multixacts.
Although users can set this value anywhere from zero to two billions,
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 5204b34..5a43c28 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -352,9 +352,9 @@
Vacuum maintains a <link linkend="storage-vm">visibility map</> for each
table to keep track of which pages contain only tuples that are known to be
visible to all active transactions (and all future transactions, until the
- page is again modified). This has two purposes. First, vacuum
- itself can skip such pages on the next run, since there is nothing to
- clean up.
+ page is again modified), and pages contain only frozen tuples.
+ This has two purposes. First, vacuum itself can skip such pages
+ on the next run, since there is nothing to clean up.
</para>
<para>
@@ -438,28 +438,25 @@
</para>
<para>
- <command>VACUUM</> normally skips pages that don't have any dead row
- versions, but those pages might still have row versions with old XID
- values. To ensure all old row versions have been frozen, a
- scan of the whole table is needed.
+ <command>VACUUM</> skips pages that don't have any dead row
+ versions, and pages that have only frozen rows. To ensure all old
+ row versions have been frozen, a scan of all unfrozen pages is needed.
<xref linkend="guc-vacuum-freeze-table-age"> controls when
- <command>VACUUM</> does that: a whole table sweep is forced if
- the table hasn't been fully scanned for <varname>vacuum_freeze_table_age</>
- minus <varname>vacuum_freeze_min_age</> transactions. Setting it to 0
- forces <command>VACUUM</> to always scan all pages, effectively ignoring
- the visibility map.
+ <command>VACUUM</> does that: a whole-table freezing is forced if
+ the table hasn't been ensured all row versions are frozen for
+ <varname>vacuum_freeze_table_age</> minus <varname>vacuum_freeze_min_age</>
+ transcations.
</para>
<para>
The maximum time that a table can go unvacuumed is two billion
transactions minus the <varname>vacuum_freeze_min_age</> value at
- the time <command>VACUUM</> last scanned the whole table. If it were to go
- unvacuumed for longer than
- that, data loss could result. To ensure that this does not happen,
- autovacuum is invoked on any table that might contain unfrozen rows with
- XIDs older than the age specified by the configuration parameter <xref
- linkend="guc-autovacuum-freeze-max-age">. (This will happen even if
- autovacuum is disabled.)
+ the time <command>VACUUM</> last scanned unfrozen pages. If it were to go
+ unvacuumed for longer than that, data loss could result. To ensure
+ that this does not happen, autovacuum is invoked on any table that might
+ contain unfrozen rows with XIDs older than the age specified by the
+ configuration parameter <xref linkend="guc-autovacuum-freeze-max-age">.
+ (This will happen even if autovacuum is disabled.)
</para>
<para>
@@ -490,8 +487,7 @@
a regularly scheduled <command>VACUUM</> or an autovacuum triggered by
normal delete and update activity is run in that window. Setting it too
close could lead to anti-wraparound autovacuums, even though the table
- was recently vacuumed to reclaim space, whereas lower values lead to more
- frequent whole-table scans.
+ was recently vacuumed to reclaim space.
</para>
<para>
@@ -527,7 +523,7 @@
<structname>pg_database</>. In particular,
the <structfield>relfrozenxid</> column of a table's
<structname>pg_class</> row contains the freeze cutoff XID that was used
- by the last whole-table <command>VACUUM</> for that table. All rows
+ by the last whole-table freezing for that table. All rows
inserted by transactions with XIDs older than this cutoff XID are
guaranteed to have been frozen. Similarly,
the <structfield>datfrozenxid</> column of a database's
@@ -555,17 +551,19 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
<command>VACUUM</> normally
only scans pages that have been modified since the last vacuum, but
<structfield>relfrozenxid</> can only be advanced when the whole table is
- scanned. The whole table is scanned when <structfield>relfrozenxid</> is
- more than <varname>vacuum_freeze_table_age</> transactions old, when
- <command>VACUUM</>'s <literal>FREEZE</> option is used, or when all pages
- happen to
- require vacuuming to remove dead row versions. When <command>VACUUM</>
- scans the whole table, after it's finished <literal>age(relfrozenxid)</>
- should be a little more than the <varname>vacuum_freeze_min_age</> setting
- that was used (more by the number of transactions started since the
- <command>VACUUM</> started). If no whole-table-scanning <command>VACUUM</>
- is issued on the table until <varname>autovacuum_freeze_max_age</> is
- reached, an autovacuum will soon be forced for the table.
+ frozen. The whole-table freezing is occuerred only when all pages happen to
+ require freezing to freeze rows. In other cases such as where
+ <structfield>relfrozenxid</> is more than <varname>vacuum_freeze_table_age</>
+ transcations old, where <command>VACUUM</>'s <literal>FREEZE</> option is used,
+ <command>VACUUM</> can skip the pages that all tuples on the page itself are
+ marked as frozen.
+ When all pages of table are eventually marked as frozen by <command>VACUUM</>,
+ after it's finished <literal>age(relfrozenxid)</> should be a little more
+ than the <varname>vacuum_freeze_min_age</> setting that was used (more by
+ the number of transcations started since the <command>VACUUM</> started).
+ If the advancing of <structfield>relfrozenxid</> is not happend until
+ <varname>autovacuum_freeze_max_age</> is reached, an autovacuum will soon
+ be forced for the table.
</para>
<para>
@@ -642,28 +640,28 @@ HINT: Stop the postmaster and vacuum that database in single-user mode.
<structname>pg_class</>.<structfield>relminmxid</> stores the oldest
possible multixact ID still appearing in any tuple of that table.
If this value is older than
- <xref linkend="guc-vacuum-multixact-freeze-table-age">, a whole-table
+ <xref linkend="guc-vacuum-multixact-freeze-table-age">, a table
scan is forced. <function>mxid_age()</> can be used on
<structname>pg_class</>.<structfield>relminmxid</> to find its age.
</para>
<para>
- Whole-table <command>VACUUM</> scans, regardless of
- what causes them, enable advancing the value for that table.
+ When <command>VACUUM</> scans all unfrozen pages, regardless of what causes
+ them, enable advancing the value for that table.
Eventually, as all tables in all databases are scanned and their
oldest multixact values are advanced, on-disk storage for older
multixacts can be removed.
</para>
<para>
- As a safety device, a whole-table vacuum scan will occur for any table
- whose multixact-age is greater than
- <xref linkend="guc-autovacuum-multixact-freeze-max-age">. Whole-table
+ As a safety device, a vacuum scan will occur for any table whose
+ multixact-age is greater than
+ <xref linkend="guc-autovacuum-multixact-freeze-max-age">. A
vacuum scans will also occur progressively for all tables, starting with
those that have the oldest multixact-age, if the amount of used member
storage space exceeds the amount 50% of the addressable storage space.
- Both of these kinds of whole-table scans will occur even if autovacuum is
- nominally disabled.
+ Both of these kinds of scans will occur even if autovacuum is nominally
+ disabled.
</para>
</sect3>
</sect2>
@@ -743,8 +741,8 @@ vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuple
<command>UPDATE</command> and <command>DELETE</command> operation. (It
is only semi-accurate because some information might be lost under heavy
load.) If the <structfield>relfrozenxid</> value of the table is more
- than <varname>vacuum_freeze_table_age</> transactions old, the whole
- table is scanned to freeze old tuples and advance
+ than <varname>vacuum_freeze_table_age</> transactions old, the table is
+ scanned to freeze old tuples and advance
<structfield>relfrozenxid</>, otherwise only pages that have been modified
since the last vacuum are scanned.
</para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e64b7ef..1908a4d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1332,6 +1332,11 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
<entry>Estimated number of rows modified since this table was last analyzed</entry>
</row>
<row>
+ <entry><structfield>n_frozen_page</></entry>
+ <entry><type>integer</></entry>
+ <entry>Number of frozen pages</entry>
+ </row>
+ <row>
<entry><structfield>last_vacuum</></entry>
<entry><type>timestamp with time zone</></entry>
<entry>Last time at which this table was manually vacuumed
diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index eb113c2..275b69c 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -657,6 +657,12 @@ psql --username postgres --file script.sql postgres
</para>
<para>
+ Since the format of visibility map has been changed in version 9.6,
+ <application>pg_upgrade</> creates and rewrite new <literal>'_vm'</literal>
+ file even if upgrading from 9.5 or before to 9.6 or later with link mode (-k).
+ </para>
+
+ <para>
All failure, rebuild, and reindex cases will be reported by
<application>pg_upgrade</> if they affect your installation;
post-upgrade scripts to rebuild tables and indexes will be
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 164d08c..5dc8b04 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -623,18 +623,20 @@ can be used to examine the information stored in free space maps.
<para>
Each heap relation has a Visibility Map
(VM) to keep track of which pages contain only tuples that are known to be
-visible to all active transactions. It's stored
-alongside the main relation data in a separate relation fork, named after the
-filenode number of the relation, plus a <literal>_vm</> suffix. For example,
-if the filenode of a relation is 12345, the VM is stored in a file called
-<filename>12345_vm</>, in the same directory as the main relation file.
+visible to all active transactions, and pages contain only unfrozen tuples.
+It's stored alongside the main relation data in a separate relation fork,
+named after the filenode number of the relation, plus a <literal>_vm</> suffix.
+For example, if the filenode of a relation is 12345, the VM is stored in a file
+called <filename>12345_vm</>, in the same directory as the main relation file.
Note that indexes do not have VMs.
</para>
<para>
-The visibility map simply stores one bit per heap page. A set bit means
-that all tuples on the page are known to be visible to all transactions.
-This means that the page does not contain any tuples that need to be vacuumed.
+The visibility map stores two bits per heap page: all-visible, all-frozen.
+A set all-visible bit means that all tuples on the page are known to be visible
+to all transactions. A set all-frozen bit means that all tuples on the page are
+completely marked as frozen. This means that the page does not contain any tuples
+that need to be vacuumed, even if a scan of whole table is required.
This information can also be used by <firstterm>index-only scans</> to answer
queries using only the index tuple.
</para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9ff7a41..651dd0e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3034,9 +3034,9 @@ heap_delete(Relation relation, ItemPointer tid,
/*
* If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, we'll have to unlock and
- * re-lock, to avoid holding the buffer lock across an I/O. That's a bit
- * unfortunate, but hopefully shouldn't happen often.
+ * visible or all frozen while we were busy locking the buffer, we'll
+ * have to unlock and re-lock, to avoid holding the buffer lock across an
+ * I/O. That's a bit unfortunate, but hopefully shouldn't happen often.
*/
if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
{
@@ -7205,7 +7205,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
*/
XLogRecPtr
log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
- TransactionId cutoff_xid)
+ TransactionId cutoff_xid, uint8 vmflags)
{
xl_heap_visible xlrec;
XLogRecPtr recptr;
@@ -7215,6 +7215,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
Assert(BufferIsValid(vm_buffer));
xlrec.cutoff_xid = cutoff_xid;
+ xlrec.flags = vmflags;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
@@ -7804,7 +7805,12 @@ heap_xlog_visible(XLogReaderState *record)
* the subsequent update won't be replayed to clear the flag.
*/
page = BufferGetPage(buffer);
- PageSetAllVisible(page);
+
+ if (xlrec->flags & VISIBILITYMAP_ALL_VISIBLE)
+ PageSetAllVisible(page);
+ if (xlrec->flags & VISIBILITYMAP_ALL_FROZEN)
+ PageSetAllFrozen(page);
+
MarkBufferDirty(buffer);
}
else if (action == BLK_RESTORED)
@@ -7856,7 +7862,7 @@ heap_xlog_visible(XLogReaderState *record)
*/
if (lsn > PageGetLSN(vmpage))
visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
- xlrec->cutoff_xid);
+ xlrec->cutoff_xid, xlrec->flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 6db73bf..0fe49eb 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -119,7 +119,7 @@ ReadBufferBI(Relation relation, BlockNumber targetBlock,
* be less than buffer2.
*/
static void
-GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
+GetVisibilitymapPins(Relation relation, Buffer buffer1, Buffer buffer2,
BlockNumber block1, BlockNumber block2,
Buffer *vmbuffer1, Buffer *vmbuffer2)
{
@@ -380,11 +380,11 @@ RelationGetBufferForTuple(Relation relation, Size len,
* done.
*/
if (otherBuffer == InvalidBuffer || buffer <= otherBuffer)
- GetVisibilityMapPins(relation, buffer, otherBuffer,
+ GetVisibilitymapPins(relation, buffer, otherBuffer,
targetBlock, otherBlock, vmbuffer,
vmbuffer_other);
else
- GetVisibilityMapPins(relation, otherBuffer, buffer,
+ GetVisibilitymapPins(relation, otherBuffer, buffer,
otherBlock, targetBlock, vmbuffer_other,
vmbuffer);
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 7c38772..387a0d6 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -15,39 +15,46 @@
* visibilitymap_pin - pin a map page for setting a bit
* visibilitymap_pin_ok - check whether correct map page is already pinned
* visibilitymap_set - set a bit in a previously pinned page
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_get_status - get status of bits
* visibilitymap_count - count number of bits set in visibility map
* visibilitymap_truncate - truncate the visibility map
*
* NOTES
*
- * The visibility map is a bitmap with one bit per heap page. A set bit means
- * that all tuples on the page are known visible to all transactions, and
- * therefore the page doesn't need to be vacuumed. The map is conservative in
- * the sense that we make sure that whenever a bit is set, we know the
- * condition is true, but if a bit is not set, it might or might not be true.
+ * The visibility map is a bitmap with two bits (all-visible and all-frozen)
+ * per heap page. A set all-visible bit means that all tuples on the page are
+ * known visible to all transactions, and therefore the page doesn't need to
+ * be vacuumed. A set all-frozen bit means that all tuples on the page are
+ * completely frozen, and therefore the page doesn't need to be vacuumed even
+ * if whole table scanning vacuum is required (e.g. anti-wraparound vacuum).
+ * The all-frozen bit must be set only when the page is already all-visible.
+ *
+ * The map is conservative in the sense that we make sure that whenever a bit
+ * is set, we know the condition is true, but if a bit is not set, it might or
+ * might not be true.
*
* Clearing a visibility map bit is not separately WAL-logged. The callers
* must make sure that whenever a bit is cleared, the bit is cleared on WAL
- * replay of the updating operation as well.
+ * replay of the updating operation as well. And all-frozen bit must be
+ * cleared with all-visible at the same time.
*
* When we *set* a visibility map during VACUUM, we must write WAL. This may
* seem counterintuitive, since the bit is basically a hint: if it is clear,
- * it may still be the case that every tuple on the page is visible to all
- * transactions; we just don't know that for certain. The difficulty is that
- * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
- * on the page itself, and the visibility map bit. If a crash occurs after the
- * visibility map page makes it to disk and before the updated heap page makes
- * it to disk, redo must set the bit on the heap page. Otherwise, the next
- * insert, update, or delete on the heap page will fail to realize that the
- * visibility map bit must be cleared, possibly causing index-only scans to
- * return wrong answers.
+ * it may still be the case that every tuple on the page is all-visible or
+ * all-frozen we just don't know that for certain. The difficulty is that
+ * there are two bits which are typically set together: the PD_ALL_VISIBLE
+ * or PD_ALL_FROZEN bit on the page itself, and the corresponding visibility map
+ * bit. If a crash occurs after the visibility map page makes it to disk and before
+ * the updated heap page makes it to disk, redo must set the bit on the heap page.
+ * Otherwise, the next insert, update, or delete on the heap page will fail to
+ * realize that the visibility map bit must be cleared, possibly causing index-only
+ * scans to return wrong answers.
*
* VACUUM will normally skip pages for which the visibility map bit is set;
* such pages can't contain any dead tuples and therefore don't need vacuuming.
- * The visibility map is not used for anti-wraparound vacuums, because
- * an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
- * present in the table, even on pages that don't have any dead tuples.
+ * The visibility map has the all-frozen bit which indicates all tuples on
+ * corresponding page has been completely frozen, so the visibility map is also
+ * used for anti-wraparound vacuum, even if freezing of tuples is required.
*
* LOCKING
*
@@ -58,14 +65,14 @@
* section that logs the page modification. However, we don't want to hold
* the buffer lock over any I/O that may be required to read in the visibility
* map page. To avoid this, we examine the heap page before locking it;
- * if the page-level PD_ALL_VISIBLE bit is set, we pin the visibility map
- * bit. Then, we lock the buffer. But this creates a race condition: there
- * is a possibility that in the time it takes to lock the buffer, the
- * PD_ALL_VISIBLE bit gets set. If that happens, we have to unlock the
- * buffer, pin the visibility map page, and relock the buffer. This shouldn't
- * happen often, because only VACUUM currently sets visibility map bits,
- * and the race will only occur if VACUUM processes a given page at almost
- * exactly the same time that someone tries to further modify it.
+ * if the page-level PD_ALL_VISIBLE or PD_ALL_FROZEN bit is set, we pin the
+ * visibility map bit. Then, we lock the buffer. But this creates a race
+ * condition: there is a possibility that in the time it takes to lock the
+ * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that happens,
+ * we have to unlock the buffer, pin the visibility map page, and relock the
+ * buffer. This shouldn't happen often, because only VACUUM currently sets
+ * visibility map bits, and the race will only occur if VACUUM processes a given
+ * page at almost exactly the same time that someone tries to further modify it.
*
* To set a bit, you need to hold a lock on the heap page. That prevents
* the race condition where VACUUM sees that all tuples on the page are
@@ -101,11 +108,14 @@
*/
#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
-/* Number of bits allocated for each heap block. */
-#define BITS_PER_HEAPBLOCK 1
+/*
+ * Number of bits allocated for each heap block.
+ * One for all-visible, other for all-frozen.
+*/
+#define BITS_PER_HEAPBLOCK 2
/* Number of heap blocks we can represent in one byte. */
-#define HEAPBLOCKS_PER_BYTE 8
+#define HEAPBLOCKS_PER_BYTE 4
/* Number of heap blocks we can represent in one visibility map page. */
#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
@@ -113,26 +123,44 @@
/* Mapping from heap block number to the right bit in the visibility map */
#define HEAPBLK_TO_MAPBLOCK(x) ((x) / HEAPBLOCKS_PER_PAGE)
#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
-#define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
-
-/* table for fast counting of set bits */
-static const uint8 number_of_ones[256] = {
- 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
+#define HEAPBLK_TO_MAPBIT(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)
+
+/* tables for fast counting of set bits for visible and freeze */
+static const uint8 number_of_ones_for_visible[256] = {
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4
+};
+static const uint8 number_of_ones_for_frozen[256] = {
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4
};
/* prototypes for internal routines */
@@ -141,7 +169,7 @@ static void vm_extend(Relation rel, BlockNumber nvmblocks);
/*
- * visibilitymap_clear - clear a bit in visibility map
+ * visibilitymap_clear - clear all bits in visibility map
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -153,11 +181,11 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
int mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- uint8 mask = 1 << mapBit;
+ uint8 mask = VISIBILITYMAP_VALID_BITS << mapBit;
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_clear %s block %d", RelationGetRelationName(rel), heapBlk);
#endif
if (!BufferIsValid(buf) || BufferGetBlockNumber(buf) != mapBlock)
@@ -186,7 +214,7 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
* visibilitymap_set to actually set the bit.
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by
- * an earlier call to visibilitymap_pin or visibilitymap_test on the same
+ * an earlier call to visibilitymap_pin or visibilitymap_get_status on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk.
*
@@ -212,7 +240,7 @@ visibilitymap_pin(Relation rel, BlockNumber heapBlk, Buffer *buf)
* visibilitymap_pin_ok - do we already have the correct page pinned?
*
* On entry, buf should be InvalidBuffer or a valid buffer returned by
- * an earlier call to visibilitymap_pin or visibilitymap_test on the same
+ * an earlier call to visibilitymap_pin or visibilitymap_get_status on the same
* relation. The return value indicates whether the buffer covers the
* given heapBlk.
*/
@@ -225,7 +253,7 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
}
/*
- * visibilitymap_set - set a bit on a previously pinned page
+ * visibilitymap_set - set bit(s) on a previously pinned page
*
* recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
* or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
@@ -234,10 +262,11 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
* marked all-visible; it is needed for Hot Standby, and can be
* InvalidTransactionId if the page contains no tuples.
*
- * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
- * this function. Except in recovery, caller should also pass the heap
- * buffer. When checksums are enabled and we're not in recovery, we must add
- * the heap buffer to the WAL chain to protect it from being torn.
+ * Caller is expected to set the heap page's PD_ALL_VISIBLE or PD_ALL_FROZEN
+ * bit before calling this function. Except in recovery, caller should also
+ * pass the heap buffer and flags which indicates what flag we want to set.
+ * When checksums are enabled and we're not in recovery, we must add the heap
+ * buffer to the WAL chain to protect it from being torn.
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -245,7 +274,8 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
*/
void
visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid)
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -254,11 +284,12 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_set %s block %d, flag %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
Assert(InRecovery || BufferIsValid(heapBuf));
+ Assert(flags & VISIBILITYMAP_VALID_BITS);
/* Check that we have the right heap page pinned, if present */
if (BufferIsValid(heapBuf) && BufferGetBlockNumber(heapBuf) != heapBlk)
@@ -272,11 +303,11 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map = PageGetContents(page);
LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
- if (!(map[mapByte] & (1 << mapBit)))
+ if (flags != (map[mapByte] & (flags << mapBit)))
{
START_CRIT_SECTION();
- map[mapByte] |= (1 << mapBit);
+ map[mapByte] |= (flags << mapBit);
MarkBufferDirty(vmBuf);
if (RelationNeedsWAL(rel))
@@ -285,7 +316,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Assert(!InRecovery);
recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
- cutoff_xid);
+ cutoff_xid, flags);
/*
* If data checksums are enabled (or wal_log_hints=on), we
@@ -295,11 +326,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Page heapPage = BufferGetPage(heapBuf);
- /* caller is expected to set PD_ALL_VISIBLE first */
- Assert(PageIsAllVisible(heapPage));
+ /*
+ * Caller is expected to set PD_ALL_VISIBLE or
+ * PD_ALL_FROZEN first.
+ */
+ Assert(((flags | VISIBILITYMAP_ALL_VISIBLE) && PageIsAllVisible(heapPage)) ||
+ ((flags | VISIBILITYMAP_ALL_FROZEN) && PageIsAllFrozen(heapPage)));
PageSetLSN(heapPage, recptr);
}
}
+
PageSetLSN(page, recptr);
}
@@ -310,15 +346,17 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
}
/*
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_get_status - get status of bits
*
- * Are all tuples on heapBlk visible to all, according to the visibility map?
+ * Are all tuples on heapBlk visible to all or are marked frozen, according
+ * to the visibility map?
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by an
- * earlier call to visibilitymap_pin or visibilitymap_test on the same
+ * earlier call to visibilitymap_pin or visibilitymap_get_status on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk, or InvalidBuffer. The caller is responsible for
- * releasing *buf after it's done testing and setting bits.
+ * releasing *buf after it's done testing and setting bits, and must pass flags
+ * for which it needs to check the value in visibility map.
*
* NOTE: This function is typically called without a lock on the heap page,
* so somebody else could change the bit just after we look at it. In fact,
@@ -327,17 +365,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
* we might see the old value. It is the caller's responsibility to deal with
* all concurrency issues!
*/
-bool
-visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
+uint8
+visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *buf)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
uint8 mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- bool result;
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_test %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_get_status %s, block %d", RelationGetRelationName(rel), heapBlk);
#endif
/* Reuse the old pinned buffer if possible */
@@ -360,13 +397,11 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
map = PageGetContents(BufferGetPage(*buf));
/*
- * A single-bit read is atomic. There could be memory-ordering effects
+ * The double bits read is atomic. There could be memory-ordering effects
* here, but for performance reasons we make it the caller's job to worry
* about that.
*/
- result = (map[mapByte] & (1 << mapBit)) ? true : false;
-
- return result;
+ return ((map[mapByte] >> mapBit) & VISIBILITYMAP_VALID_BITS);
}
/*
@@ -374,13 +409,17 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
*
* Note: we ignore the possibility of race conditions when the table is being
* extended concurrently with the call. New pages added to the table aren't
- * going to be marked all-visible, so they won't affect the result.
+ * going to be marked all-visible or all-frozen, so they won't affect the result.
+ * The caller must set the flags which indicates what flag we want to count.
*/
BlockNumber
-visibilitymap_count(Relation rel)
+visibilitymap_count(Relation rel, BlockNumber *all_frozen)
{
- BlockNumber result = 0;
BlockNumber mapBlock;
+ BlockNumber all_visible = 0;
+
+ if (all_frozen)
+ *all_frozen = 0;
for (mapBlock = 0;; mapBlock++)
{
@@ -406,13 +445,15 @@ visibilitymap_count(Relation rel)
for (i = 0; i < MAPSIZE; i++)
{
- result += number_of_ones[map[i]];
+ all_visible += number_of_ones_for_visible[map[i]];
+ if (all_frozen)
+ *all_frozen += number_of_ones_for_frozen[map[i]];
}
ReleaseBuffer(mapBuffer);
}
- return result;
+ return all_visible;
}
/*
@@ -435,7 +476,7 @@ visibilitymap_truncate(Relation rel, BlockNumber nheapblocks)
uint8 truncBit = HEAPBLK_TO_MAPBIT(nheapblocks);
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_truncate %s %d", RelationGetRelationName(rel), nheapblocks);
+ elog(DEBUG1, "vm_truncate %s block %d", RelationGetRelationName(rel), nheapblocks);
#endif
RelationOpenSmgr(rel);
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e59b163..1cea026 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1813,7 +1813,7 @@ FormIndexDatum(IndexInfo *indexInfo,
* isprimary: if true, set relhaspkey true; else no change
* reltuples: if >= 0, set reltuples to this value; else no change
*
- * If reltuples >= 0, relpages and relallvisible are also updated (using
+ * If reltuples >= 0, relpages, relallvisible are also updated (using
* RelationGetNumberOfBlocks() and visibilitymap_count()).
*
* NOTE: an important side-effect of this operation is that an SI invalidation
@@ -1921,7 +1921,7 @@ index_update_stats(Relation rel,
BlockNumber relallvisible;
if (rd_rel->relkind != RELKIND_INDEX)
- relallvisible = visibilitymap_count(rel);
+ relallvisible = visibilitymap_count(rel, NULL);
else /* don't bother for indexes */
relallvisible = 0;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index ccc030f..8c555eb 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -444,6 +444,7 @@ CREATE VIEW pg_stat_all_tables AS
pg_stat_get_live_tuples(C.oid) AS n_live_tup,
pg_stat_get_dead_tuples(C.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(C.oid) AS n_mod_since_analyze,
+ pg_stat_get_frozen_pages(C.oid) AS n_frozen_page,
pg_stat_get_last_vacuum_time(C.oid) as last_vacuum,
pg_stat_get_last_autovacuum_time(C.oid) as last_autovacuum,
pg_stat_get_last_analyze_time(C.oid) as last_analyze,
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index ddb68ab..02a2c68 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -314,6 +314,8 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
+ BlockNumber relallvisible,
+ relallfrozen;
if (inh)
ereport(elevel,
@@ -564,6 +566,10 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
}
}
+ /* Calculate the number of all-visible and all-frozen bit */
+ if (!inh)
+ relallvisible = visibilitymap_count(onerel, &relallfrozen);
+
/*
* Update pages/tuples stats in pg_class ... but not if we're doing
* inherited stats.
@@ -572,7 +578,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
vac_update_relstats(onerel,
relpages,
totalrows,
- visibilitymap_count(onerel),
+ relallvisible,
hasindex,
InvalidTransactionId,
InvalidMultiXactId,
@@ -608,7 +614,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
* tracks per-table stats.
*/
if (!inh)
- pgstat_report_analyze(onerel, totalrows, totaldeadrows);
+ pgstat_report_analyze(onerel, totalrows, totaldeadrows, relallfrozen);
/* If this isn't part of VACUUM ANALYZE, let index AMs do cleanup */
if (!(options & VACOPT_VACUUM))
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7c4ef58..0a02a25 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -729,11 +729,11 @@ vac_estimate_reltuples(Relation relation, bool is_analyze,
* marked with xmin = our xid.
*
* In addition to fundamentally nontransactional statistics such as
- * relpages and relallvisible, we try to maintain certain lazily-updated
- * DDL flags such as relhasindex, by clearing them if no longer correct.
- * It's safe to do this in VACUUM, which can't run in parallel with
- * CREATE INDEX/RULE/TRIGGER and can't be part of a transaction block.
- * However, it's *not* safe to do it in an ANALYZE that's within an
+ * relpages, relallvisible, we try to maintain certain lazily-updated
+ * DDL flags such as relhasindex, by clearing them if no onger correct.
+ * It's safe to do this in VACUUM, which can't run in
+ * parallel with CREATE INDEX/RULE/TRIGGER and can't be part of a transaction
+ * block. However, it's *not* safe to do it in an ANALYZE that's within an
* outer transaction, because for example the current transaction might
* have dropped the last index; then we'd think relhasindex should be
* cleared, but if the transaction later rolls back this would be wrong.
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 2429889..994efb7 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -106,6 +106,8 @@ typedef struct LVRelStats
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* number of pages we examined */
BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber vmskipped_frozen_pages; /* # of pages we skipped by all-frozen bit
+ of visibility map */
double scanned_tuples; /* counts only tuples on scanned pages */
double old_rel_tuples; /* previous value of pg_class.reltuples */
double new_rel_tuples; /* new estimated total # of tuples */
@@ -156,7 +158,7 @@ static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
static int vac_cmp_itemptr(const void *left, const void *right);
static bool heap_page_is_all_visible(Relation rel, Buffer buf,
- TransactionId *visibility_cutoff_xid);
+ TransactionId *visibility_cutoff_xid, bool *all_frozen);
/*
@@ -188,7 +190,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
MultiXactId mxactFullScanLimit;
BlockNumber new_rel_pages;
double new_rel_tuples;
- BlockNumber new_rel_allvisible;
+ BlockNumber new_rel_allvisible,
+ new_rel_allfrozen;
double new_live_tuples;
TransactionId new_frozen_xid;
MultiXactId new_min_multi;
@@ -221,7 +224,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* We request a full scan if either the table's frozen Xid is now older
* than or equal to the requested Xid full-table scan limit; or if the
* table's minimum MultiXactId is older than or equal to the requested
- * mxid full-table scan limit.
+ * mxid full-table scan limit. During full scan, we could skip some pags
+ * according to all-frozen bit of visibility map.
*/
scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
xidFullScanLimit);
@@ -253,7 +257,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
- if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages + vacrelstats->vmskipped_frozen_pages)
+ < vacrelstats->rel_pages)
{
Assert(!scan_all);
scanned_all = false;
@@ -301,10 +306,13 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_tuples = vacrelstats->old_rel_tuples;
}
- new_rel_allvisible = visibilitymap_count(onerel);
+ new_rel_allvisible = visibilitymap_count(onerel, &new_rel_allfrozen);
if (new_rel_allvisible > new_rel_pages)
new_rel_allvisible = new_rel_pages;
+ if (new_rel_allfrozen > new_rel_pages)
+ new_rel_allfrozen = new_rel_pages;
+
new_frozen_xid = scanned_all ? FreezeLimit : InvalidTransactionId;
new_min_multi = scanned_all ? MultiXactCutoff : InvalidMultiXactId;
@@ -325,7 +333,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
pgstat_report_vacuum(RelationGetRelid(onerel),
onerel->rd_rel->relisshared,
new_live_tuples,
- vacrelstats->new_dead_tuples);
+ vacrelstats->new_dead_tuples,
+ new_rel_allfrozen);
/* and log the action if appropriate */
if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
@@ -360,10 +369,11 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel),
vacrelstats->num_index_scans);
- appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins\n"),
+ appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped according to vm\n"),
vacrelstats->pages_removed,
vacrelstats->rel_pages,
- vacrelstats->pinskipped_pages);
+ vacrelstats->pinskipped_pages,
+ vacrelstats->vmskipped_frozen_pages);
appendStringInfo(&buf,
_("tuples: %.0f removed, %.0f remain, %.0f are dead but not yet removable\n"),
vacrelstats->tuples_deleted,
@@ -486,9 +496,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* consecutive pages. Since we're reading sequentially, the OS should be
* doing readahead for us, so there's no gain in skipping a page now and
* then; that's likely to disable readahead and so be counterproductive.
- * Also, skipping even a single page means that we can't update
- * relfrozenxid, so we only want to do it if we can skip a goodly number
- * of pages.
+ * Also, skipping even a single page accorinding to all-visible bit of
+ * visibility map means that we can't update relfrozenxid, so we only want
+ * to do it if we can skip a goodly number. On the other hand, we count
+ * both how many pages we skipped according to all-frozen bit of visibility
+ * map and how many pages we freeze, so we can update relfrozenxid if
+ * the sum of two is as many as pages of table.
*
* Before entering the main loop, establish the invariant that
* next_not_all_visible_block is the next block number >= blkno that's not
@@ -498,24 +511,24 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* started skipping blocks, we may as well skip everything up to the next
* not-all-visible block.
*
- * Note: if scan_all is true, we won't actually skip any pages; but we
+ * Note: if scan_all is true, we might not actually skip any pages; but we
* maintain next_not_all_visible_block anyway, so as to set up the
* all_visible_according_to_vm flag correctly for each page.
*
- * Note: The value returned by visibilitymap_test could be slightly
+ * Note: The value returned by visibilitymap_get_status could be slightly
* out-of-date, since we make this test before reading the corresponding
* heap page or locking the buffer. This is OK. If we mistakenly think
- * that the page is all-visible when in fact the flag's just been cleared,
- * we might fail to vacuum the page. But it's OK to skip pages when
- * scan_all is not set, so no great harm done; the next vacuum will find
- * them. If we make the reverse mistake and vacuum a page unnecessarily,
- * it'll just be a no-op.
+ * that the page is all-visible/all-frozen when in fact the flag's just
+ * been cleared, we might fail to vacuum the page. But it's OK to skip
+ * pages when scan_all is not set, so no great harm done; the next vacuum
+ * will find them. If we make the reverse mistake and vacuum a page
+ * unnecessarily, it'll just be a no-op.
*/
for (next_not_all_visible_block = 0;
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer))
+ if (!VM_ALL_VISIBLE(onerel, next_not_all_visible_block, &vmbuffer))
break;
vacuum_delay_point();
}
@@ -533,9 +546,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
bool tupgone,
hastup;
int prev_dead_count;
- int nfrozen;
+ int nfrozen; /* # of tuples is frozen */
+ int nalready_frozen; /* # of tuples is already frozen */
+ int ntotal_frozen; /* # of frozen tuples in single page */
+ int ntup_per_page;
Size freespace;
bool all_visible_according_to_vm;
+ bool all_frozen_according_to_vm;
bool all_visible;
bool has_dead_tuples;
TransactionId visibility_cutoff_xid = InvalidTransactionId;
@@ -547,8 +564,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block,
- &vmbuffer))
+ if (!VM_ALL_VISIBLE(onerel, next_not_all_visible_block, &vmbuffer))
break;
vacuum_delay_point();
}
@@ -562,14 +578,29 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
skipping_all_visible_blocks = true;
else
skipping_all_visible_blocks = false;
+
all_visible_according_to_vm = false;
+ all_frozen_according_to_vm = false;
}
else
{
- /* Current block is all-visible */
- if (skipping_all_visible_blocks && !scan_all)
- continue;
+ /*
+ * This block is at least all-visible according to the visibility map.
+ * We check whether this block is all-frozen or not, to skip to
+ * vacuum this page even if scan_all is true.
+ */
+ bool all_frozen = VM_ALL_FROZEN(onerel, blkno, &vmbuffer);
+
+ if (scan_all && all_frozen)
+ {
+ vacrelstats->vmskipped_frozen_pages++;
+ continue;
+ }
+ else if (!scan_all && skipping_all_visible_blocks)
+ continue;
+
all_visible_according_to_vm = true;
+ all_frozen_according_to_vm = all_frozen;
}
vacuum_delay_point();
@@ -716,7 +747,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
empty_pages++;
freespace = PageGetHeapFreeSpace(page);
- /* empty pages are always all-visible */
+ /* empty pages are always all-visible and all-frozen */
if (!PageIsAllVisible(page))
{
START_CRIT_SECTION();
@@ -739,8 +770,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
log_newpage_buffer(buf, true);
PageSetAllVisible(page);
+ PageSetAllFrozen(page);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, InvalidTransactionId);
+ vmbuffer, InvalidTransactionId,
+ VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
END_CRIT_SECTION();
}
@@ -764,6 +797,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
all_visible = true;
has_dead_tuples = false;
nfrozen = 0;
+ nalready_frozen = 0;
+ ntup_per_page = 0;
hastup = false;
prev_dead_count = vacrelstats->num_dead_tuples;
maxoff = PageGetMaxOffsetNumber(page);
@@ -918,8 +953,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
{
num_tuples += 1;
+ ntup_per_page += 1;;
hastup = true;
+ /* Check whether this tuple is alrady frozen or not */
+ if (HeapTupleHeaderXminFrozen(tuple.t_data))
+ nalready_frozen += 1;
+
/*
* Each non-removable tuple must be checked to see if it needs
* freezing. Note we already have exclusive buffer lock.
@@ -931,9 +971,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
} /* scan along page */
/*
- * If we froze any tuples, mark the buffer dirty, and write a WAL
- * record recording the changes. We must log the changes to be
- * crash-safe against future truncation of CLOG.
+ * If we freeze any tuples, mark the buffer dirty, and write a WAL
+ * record recording the changes. We must log the changes to be crash-safe
+ * against future truncation of CLOG.
*/
if (nfrozen > 0)
{
@@ -966,6 +1006,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
END_CRIT_SECTION();
}
+ /* Compute the number of frozen tuples in a page */
+ ntotal_frozen = nfrozen + nalready_frozen;
+
/*
* If there are no indexes then we can vacuum the page right now
* instead of doing a second scan.
@@ -988,26 +1031,46 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
freespace = PageGetHeapFreeSpace(page);
- /* mark page all-visible, if appropriate */
- if (all_visible && !all_visible_according_to_vm)
+ /* This page is all visible */
+ if (all_visible)
{
- /*
- * It should never be the case that the visibility map page is set
- * while the page-level bit is clear, but the reverse is allowed
- * (if checksums are not enabled). Regardless, set the both bits
- * so that we get back in sync.
- *
- * NB: If the heap page is all-visible but the VM bit is not set,
- * we don't need to dirty the heap page. However, if checksums
- * are enabled, we do need to make sure that the heap page is
- * dirtied before passing it to visibilitymap_set(), because it
- * may be logged. Given that this situation should only happen in
- * rare cases after a crash, it is not worth optimizing.
- */
- PageSetAllVisible(page);
- MarkBufferDirty(buf);
- visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ /* mark page all-visible, if appropriate */
+ if (!all_visible_according_to_vm)
+ {
+ /*
+ * It should never be the case that the visibility map page is set
+ * while the page-level bit is clear, but the reverse is allowed
+ * (if checksums are not enabled). Regardless, set the both bits
+ * so that we get back in sync.
+ *
+ * NB: If the heap page is all-visible but the VM bit is not set,
+ * we don't need to dirty the heap page. However, if checksums
+ * are enabled, we do need to make sure that the heap page is
+ * dirtied before passing it to visibilitymap_set(), because it
+ * may be logged. Given that this situation should only happen in
+ * rare cases after a crash, it is not worth optimizing.
+ */
+ PageSetAllVisible(page);
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+ }
+
+ /* mark page all-frozen, if all tuples are frozen and not marked yet */
+ if ((ntotal_frozen == ntup_per_page) && !all_frozen_according_to_vm)
+ {
+ Assert(PageIsAllVisible(page));
+
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ {
+ MarkBufferDirty(buf);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+ vmbuffer, visibility_cutoff_xid, flags);
+ }
}
/*
@@ -1018,9 +1081,14 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* that something bad has happened.
*/
else if (all_visible_according_to_vm && !PageIsAllVisible(page)
- && visibilitymap_test(onerel, blkno, &vmbuffer))
+ && VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
{
- elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
+ /* If the all-frozen is set then all-visible must be set */
+ if (all_frozen_according_to_vm)
+ Assert(VM_ALL_FROZEN(onerel, blkno, &vmbuffer) &&
+ VM_ALL_VISIBLE(onerel, blkno, &vmbuffer));
+
+ elog(WARNING, "page is not marked all-visible (and all-frozen) but visibility map bit(s) is set in relation \"%s\" page %u",
relname, blkno);
visibilitymap_clear(onerel, blkno, vmbuffer);
}
@@ -1028,19 +1096,25 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
/*
* It's possible for the value returned by GetOldestXmin() to move
* backwards, so it's not wrong for us to see tuples that appear to
- * not be visible to everyone yet, while PD_ALL_VISIBLE is already
- * set. The real safe xmin value never moves backwards, but
- * GetOldestXmin() is conservative and sometimes returns a value
+ * not be visible to everyone yet, while PD_ALL_VISIBLE (and PD_ALL_FROZEN)
+ * are already set. The real safe xmin value never moves backwards,
+ * but GetOldestXmin() is conservative and sometimes returns a value
* that's unnecessarily small, so if we see that contradiction it just
* means that the tuples that we think are not visible to everyone yet
- * actually are, and the PD_ALL_VISIBLE flag is correct.
+ * actually are, and the PD_ALL_VISIBLE (and PD_ALL_FROZEN) flags are
+ * correct.
*
* There should never be dead tuples on a page with PD_ALL_VISIBLE
* set, however.
*/
else if (PageIsAllVisible(page) && has_dead_tuples)
{
- elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
+ /* If the all-frozen is set then all-visible must be set */
+ if (PageIsAllFrozen(page))
+ Assert(VM_ALL_FROZEN(onerel, blkno, &vmbuffer) &&
+ VM_ALL_VISIBLE(onerel, blkno, &vmbuffer));
+
+ elog(WARNING, "page containing dead tuples is marked as all-visible (and all-frozen) in relation \"%s\" page %u",
relname, blkno);
PageClearAllVisible(page);
MarkBufferDirty(buf);
@@ -1114,6 +1188,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
RelationGetRelationName(onerel),
tups_vacuumed, vacuumed_pages)));
+ /* Report how many frozen pages vacuum skipped according to visibility map */
+ ereport(elevel,
+ (errmsg_plural("skipped %d frozen page according to visibility map",
+ "skipped %d frozen pages according to visibility map",
+ vacrelstats->vmskipped_frozen_pages,
+ vacrelstats->vmskipped_frozen_pages)));
+
/*
* This is pretty messy, but we split it up so that we can skip emitting
* individual parts of the message when not applicable.
@@ -1230,6 +1311,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
OffsetNumber unused[MaxOffsetNumber];
int uncnt = 0;
TransactionId visibility_cutoff_xid;
+ bool all_frozen;
START_CRIT_SECTION();
@@ -1281,19 +1363,34 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
* dirty, exclusively locked, and, if needed, a full page image has been
* emitted in the log_heap_clean() above.
*/
- if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid))
+ if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid, &all_frozen))
PageSetAllVisible(page);
/*
* All the changes to the heap page have been done. If the all-visible
- * flag is now set, also set the VM bit.
+ * flag is now set, also set the VM all-visible bit.
+ * Also, if this page is all-frozen, set the VM all-frozen bit and flag.
*/
- if (PageIsAllVisible(page) &&
- !visibilitymap_test(onerel, blkno, vmbuffer))
+ if (PageIsAllVisible(page))
{
+ uint8 vm_status = visibilitymap_get_status(onerel, blkno, vmbuffer);
+ uint8 flags = 0;
+
+ if (!(vm_status & VISIBILITYMAP_ALL_VISIBLE))
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+
+ /* Set the VM all-frozen bit to flag, if needed */
+ if (all_frozen && !(vm_status & VISIBILITYMAP_ALL_FROZEN))
+ {
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
Assert(BufferIsValid(*vmbuffer));
- visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
- visibility_cutoff_xid);
+
+ if (vm_status != flags)
+ visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
+ visibility_cutoff_xid, flags);
}
return tupindex;
@@ -1783,10 +1880,12 @@ vac_cmp_itemptr(const void *left, const void *right)
/*
* Check if every tuple in the given page is visible to all current and future
* transactions. Also return the visibility_cutoff_xid which is the highest
- * xmin amongst the visible tuples.
+ * xmin amongst the visible tuples, and all_frozen which implies that all tuples
+ * of this page are frozen.
*/
static bool
-heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid)
+heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid,
+ bool *all_frozen)
{
Page page = BufferGetPage(buf);
BlockNumber blockno = BufferGetBlockNumber(buf);
@@ -1795,6 +1894,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
bool all_visible = true;
*visibility_cutoff_xid = InvalidTransactionId;
+ *all_frozen = true;
/*
* This is a stripped down version of the line pointer scan in
@@ -1818,11 +1918,12 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/*
* Dead line pointers can have index pointers pointing to them. So
- * they can't be treated as visible
+ * they can't be treated as visible and frozen.
*/
if (ItemIdIsDead(itemid))
{
all_visible = false;
+ *all_frozen = false;
break;
}
@@ -1859,6 +1960,10 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
+
+ /* Check whether this tuple is alrady frozen or not */
+ if (!HeapTupleHeaderXminFrozen(tuple.t_data))
+ *all_frozen = false;
}
break;
@@ -1867,6 +1972,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
case HEAPTUPLE_INSERT_IN_PROGRESS:
case HEAPTUPLE_DELETE_IN_PROGRESS:
all_visible = false;
+ *all_frozen = false;
break;
default:
@@ -1875,5 +1981,8 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
}
} /* scan along page */
+ if (!all_visible)
+ *all_frozen = false;
+
return all_visible;
}
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 9f54c46..e345177 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -85,7 +85,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
* which all tuples are known visible to everybody. In any case,
* we'll use the index tuple not the heap tuple as the data source.
*
- * Note on Memory Ordering Effects: visibilitymap_test does not lock
+ * Note on Memory Ordering Effects: visibilitymap_get_stattus does not lock
* the visibility map buffer, and therefore the result we read here
* could be slightly stale. However, it can't be stale enough to
* matter.
@@ -114,9 +114,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
* It's worth going through this complexity to avoid needing to lock
* the VM buffer, which could cause significant contention.
*/
- if (!visibilitymap_test(scandesc->heapRelation,
- ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
+ if (!VM_ALL_VISIBLE(scandesc->heapRelation, ItemPointerGetBlockNumber(tid),
+ &node->ioss_VMBuffer))
{
/*
* Rats, we have to visit the heap to check visibility.
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 990486c..d100a7d 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -468,7 +468,7 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count)
* estimates based on the correlation squared (XXX is that appropriate?).
*
* If it's an index-only scan, then we will not need to fetch any heap
- * pages for which the visibility map shows all tuples are visible.
+ * pages for which the visibility map map shows all tuples are visible.
* Hence, reduce the estimated number of heap fetches accordingly.
* We use the measured fraction of the entire heap that is all-visible,
* which might not be particularly relevant to the subset of the heap
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index ab018c4..ca7257a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1329,7 +1329,8 @@ pgstat_report_autovac(Oid dboid)
*/
void
pgstat_report_vacuum(Oid tableoid, bool shared,
- PgStat_Counter livetuples, PgStat_Counter deadtuples)
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages)
{
PgStat_MsgVacuum msg;
@@ -1343,6 +1344,7 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
msg.m_vacuumtime = GetCurrentTimestamp();
msg.m_live_tuples = livetuples;
msg.m_dead_tuples = deadtuples;
+ msg.m_frozen_pages = frozenpages;
pgstat_send(&msg, sizeof(msg));
}
@@ -1354,7 +1356,8 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
*/
void
pgstat_report_analyze(Relation rel,
- PgStat_Counter livetuples, PgStat_Counter deadtuples)
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages)
{
PgStat_MsgAnalyze msg;
@@ -1394,6 +1397,7 @@ pgstat_report_analyze(Relation rel,
msg.m_analyzetime = GetCurrentTimestamp();
msg.m_live_tuples = livetuples;
msg.m_dead_tuples = deadtuples;
+ msg.m_frozen_pages = frozenpages;
pgstat_send(&msg, sizeof(msg));
}
@@ -3702,6 +3706,7 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
result->n_live_tuples = 0;
result->n_dead_tuples = 0;
result->changes_since_analyze = 0;
+ result->n_frozen_pages = 0;
result->blocks_fetched = 0;
result->blocks_hit = 0;
result->vacuum_timestamp = 0;
@@ -5069,6 +5074,7 @@ pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
tabentry->n_live_tuples = msg->m_live_tuples;
tabentry->n_dead_tuples = msg->m_dead_tuples;
+ tabentry->n_frozen_pages = msg->m_frozen_pages;
if (msg->m_autovacuum)
{
@@ -5103,6 +5109,7 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
tabentry->n_live_tuples = msg->m_live_tuples;
tabentry->n_dead_tuples = msg->m_dead_tuples;
+ tabentry->n_frozen_pages = msg->m_frozen_pages;
/*
* We reset changes_since_analyze to zero, forgetting any changes that
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index f7c9bf6..98c14f9 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -46,6 +46,7 @@ extern Datum pg_stat_get_vacuum_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_autovacuum_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_analyze_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_autoanalyze_count(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_frozen_pages(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_function_calls(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_function_total_time(PG_FUNCTION_ARGS);
@@ -450,6 +451,21 @@ pg_stat_get_autoanalyze_count(PG_FUNCTION_ARGS)
}
Datum
+pg_stat_get_frozen_pages(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int32 result;
+ PgStat_StatTabEntry *tabentry;
+
+ if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+ result = 0;
+ else
+ result = (int32) (tabentry->n_frozen_pages);
+
+ PG_RETURN_INT32(result);
+}
+
+Datum
pg_stat_get_function_calls(PG_FUNCTION_ARGS)
{
Oid funcid = PG_GETARG_OID(0);
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 37eb832..2256719 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -10,6 +10,9 @@
#include "postgres_fe.h"
#include "pg_upgrade.h"
+#include "storage/bufpage.h"
+#include "storage/checksum.h"
+#include "storage/checksum_impl.h"
#include <fcntl.h>
@@ -21,6 +24,43 @@ static int copy_file(const char *fromfile, const char *tofile, bool force);
static int win32_pghardlink(const char *src, const char *dst);
#endif
+static const char *rewriteVisibilitymap(const char *fromfile, const char *tofile,
+ bool force);
+
+/* table for fast rewriting vm file in order to add all-frozen information */
+static const uint16 rewrite_vm_table[256] = {
+ 0, 1, 4, 5, 16, 17, 20, 21, 64, 65, 68, 69, 80, 81, 84, 85,
+ 256, 257, 260, 261, 272, 273, 276, 277, 320, 321, 324, 325, 336, 337, 340, 341,
+ 1024, 1025, 1028, 1029, 1040, 1041, 1044, 1045, 1088, 1089, 1092, 1093, 1104, 1105, 1108, 1109,
+ 1280, 1281, 1284, 1285, 1296, 1297, 1300, 1301, 1344, 1345, 1348, 1349, 1360, 1361, 1364, 1365,
+ 4096, 4097, 4100, 4101, 4112, 4113, 4116, 4117, 4160, 4161, 4164, 4165, 4176, 4177, 4180, 4181,
+ 4352, 4353, 4356, 4357, 4368, 4369, 4372, 4373, 4416, 4417, 4420, 4421, 4432, 4433, 4436, 4437,
+ 5120, 5121, 5124, 5125, 5136, 5137, 5140, 5141, 5184, 5185, 5188, 5189, 5200, 5201, 5204, 5205,
+ 5376, 5377, 5380, 5381, 5392, 5393, 5396, 5397, 5440, 5441, 5444, 5445, 5456, 5457, 5460, 5461,
+ 16384, 16385, 16388, 16389, 16400, 16401, 16404, 16405, 16448, 16449, 16452, 16453, 16464, 16465, 16468, 16469,
+ 16640, 16641, 16644, 16645, 16656, 16657, 16660, 16661, 16704, 16705, 16708, 16709, 16720, 16721, 16724, 16725,
+ 17408, 17409, 17412, 17413, 17424, 17425, 17428, 17429, 17472, 17473, 17476, 17477, 17488, 17489, 17492, 17493,
+ 17664, 17665, 17668, 17669, 17680, 17681, 17684, 17685, 17728, 17729, 17732, 17733, 17744, 17745, 17748, 17749,
+ 20480, 20481, 20484, 20485, 20496, 20497, 20500, 20501, 20544, 20545, 20548, 20549, 20560, 20561, 20564, 20565,
+ 20736, 20737, 20740, 20741, 20752, 20753, 20756, 20757, 20800, 20801, 20804, 20805, 20816, 20817, 20820, 20821,
+ 21504, 21505, 21508, 21509, 21520, 21521, 21524, 21525, 21568, 21569, 21572, 21573, 21584, 21585, 21588, 21589,
+ 21760, 21761, 21764, 21765, 21776, 21777, 21780, 21781, 21824, 21825, 21828, 21829, 21840, 21841, 21844, 21845
+};
+
+/*
+ * copyOrRewriteFile()
+ * This function copies file or rewrite visibility map file.
+ * If rewrite_vm is true, we have to rewrite visibility map regardless value of pageConverter.
+ */
+const char *
+copyOrRewriteFile(pageCnvCtx *pageConverter,
+ const char *src, const char *dst, bool force, bool rewrite_vm)
+{
+ if (rewrite_vm)
+ return rewriteVisibilitymap(src, dst, force);
+ else
+ return copyAndUpdateFile(pageConverter, src, dst, force);
+}
/*
* copyAndUpdateFile()
@@ -115,12 +155,14 @@ copyAndUpdateFile(pageCnvCtx *pageConverter,
*/
const char *
linkAndUpdateFile(pageCnvCtx *pageConverter,
- const char *src, const char *dst)
+ const char *src, const char *dst, bool rewrite_vm)
{
if (pageConverter != NULL)
return "Cannot in-place update this cluster, page-by-page conversion is required";
- if (pg_link_file(src, dst) == -1)
+ if (rewrite_vm)
+ return rewriteVisibilitymap(src, dst, true);
+ else if (pg_link_file(src, dst) == -1)
return getErrorText(errno);
else
return NULL;
@@ -205,6 +247,99 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
#endif
+/*
+ * rewriteVisibilitymap()
+ *
+ * Copies a visibility map file while adding all-frozen bit(0) into each bit.
+ */
+static const char *
+
+rewriteVisibilitymap(const char *fromfile, const char *tofile, bool force)
+{
+#define BITS_PER_HEAPBLOCK 2
+
+ int src_fd = 0;
+ int dst_fd = 0;
+ char buffer[BLCKSZ];
+ ssize_t bytesRead;
+ int rewriteVmBytesPerPage = (BLCKSZ - SizeOfPageHeaderData) / 2;
+ BlockNumber blkno = 0;
+
+ /* Reset errno */
+ errno = 0;
+
+ if ((fromfile == NULL) || (tofile == NULL))
+ return getErrorText(EINVAL);
+
+ if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
+ goto err;
+
+ if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+ goto err;
+
+ /* Perform data rewriting per page */
+ while ((bytesRead = read(src_fd, buffer, BLCKSZ)) == BLCKSZ)
+ {
+ char *cur, *end, *blkend;
+ PageHeaderData pageheader;
+ uint16 vm_bits;
+
+ /* Save the page header data */
+ memcpy(&pageheader, buffer, SizeOfPageHeaderData);
+
+ cur = buffer;
+ end = buffer + SizeOfPageHeaderData + rewriteVmBytesPerPage;
+ blkend = buffer + bytesRead;
+
+ while (blkend >= end)
+ {
+ char vmbuf[BLCKSZ];
+ char *vmtmp = vmbuf;
+
+ /* Copy page header in advance */
+ memcpy(vmbuf, &pageheader, SizeOfPageHeaderData);
+
+ cur += SizeOfPageHeaderData;
+ vmtmp += SizeOfPageHeaderData;
+
+ /* Rewrite visibility map bit one by one */
+ while (end > cur)
+ {
+ /* Write rewritten bit from table and its string representation */
+ vm_bits = rewrite_vm_table[(uint8) *cur];
+ memcpy(vmtmp, &vm_bits, BITS_PER_HEAPBLOCK);
+
+ cur++;
+ vmtmp += BITS_PER_HEAPBLOCK;
+ }
+
+ /* Set new checksum for a visibility map page, If enabled */
+ if (old_cluster.controldata.data_checksum_version != 0 &&
+ new_cluster.controldata.data_checksum_version != 0)
+ ((PageHeader) vmbuf)->pd_checksum = pg_checksum_page(vmbuf, blkno);
+
+ if (write(dst_fd, vmbuf, BLCKSZ) != BLCKSZ)
+ {
+ if (errno == 0)
+ errno = ENOSPC;
+ goto err;
+ }
+
+ end += rewriteVmBytesPerPage;
+ blkno++;
+ }
+ }
+
+err:
+ if (src_fd != 0)
+ close(src_fd);
+
+ if (dst_fd != 0)
+ close(dst_fd);
+
+ return (errno == 0) ? NULL : getErrorText(errno);
+}
+
void
check_hard_link(void)
{
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index fa4661b..5349a21 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -110,6 +110,10 @@ extern char *output_files[];
#define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031
/*
+ * The format of visibility map is changed with this 9.6 commit,
+ */
+#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201511241
+/*
* pg_multixact format changed in 9.3 commit 0ac5ad5134f2769ccbaefec73844f85,
* ("Improve concurrency of foreign key locking") which also updated catalog
* version to this value. pg_upgrade behavior depends on whether old and new
@@ -394,10 +398,12 @@ const pageCnvCtx *setupPageConverter(void);
typedef void *pageCnvCtx;
#endif
+const char *copyOrRewriteFile(pageCnvCtx *pageConverter, const char *src,
+ const char *dst, bool force, bool rewrite_vm);
const char *copyAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
const char *dst, bool force);
const char *linkAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
- const char *dst);
+ const char *dst, bool rewrite_vm);
void check_hard_link(void);
FILE *fopen_priv(const char *path, const char *mode);
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index c22df42..b3322e9 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -18,7 +18,7 @@
static void transfer_single_new_db(pageCnvCtx *pageConverter,
FileNameMap *maps, int size, char *old_tablespace);
static void transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *suffix);
+ const char *suffix, bool vm_need_rewrite);
/*
@@ -171,6 +171,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
{
int mapnum;
bool vm_crashsafe_match = true;
+ bool vm_need_rewrite = false;
/*
* Do the old and new cluster disagree on the crash-safetiness of the vm
@@ -180,13 +181,20 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
vm_crashsafe_match = false;
+ /*
+ * Do we need to rewrite visibilitymap?
+ */
+ if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ vm_need_rewrite = true;
+
for (mapnum = 0; mapnum < size; mapnum++)
{
if (old_tablespace == NULL ||
strcmp(maps[mapnum].old_tablespace, old_tablespace) == 0)
{
/* transfer primary file */
- transfer_relfile(pageConverter, &maps[mapnum], "");
+ transfer_relfile(pageConverter, &maps[mapnum], "", false);
/* fsm/vm files added in PG 8.4 */
if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)
@@ -194,9 +202,14 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
/*
* Copy/link any fsm and vm files, if they exist
*/
- transfer_relfile(pageConverter, &maps[mapnum], "_fsm");
+ transfer_relfile(pageConverter, &maps[mapnum], "_fsm", false);
if (vm_crashsafe_match)
- transfer_relfile(pageConverter, &maps[mapnum], "_vm");
+ {
+ if (vm_need_rewrite)
+ transfer_relfile(pageConverter, &maps[mapnum], "_vm", true);
+ else
+ transfer_relfile(pageConverter, &maps[mapnum], "_vm", false);
+ }
}
}
}
@@ -210,7 +223,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
*/
static void
transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *type_suffix)
+ const char *type_suffix, bool vm_need_rewrite)
{
const char *msg;
char old_file[MAXPGPATH];
@@ -276,7 +289,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
+ if ((msg = copyOrRewriteFile(pageConverter, old_file, new_file, true, vm_need_rewrite)) != NULL)
pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
@@ -284,7 +297,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "linking \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = linkAndUpdateFile(pageConverter, old_file, new_file)) != NULL)
+ if ((msg = linkAndUpdateFile(pageConverter, old_file, new_file, vm_need_rewrite)) != NULL)
pg_fatal("error while creating link for relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
diff --git a/src/bin/pg_upgrade/test.sh b/src/bin/pg_upgrade/test.sh
index aa7f399..6b058d4 100644
--- a/src/bin/pg_upgrade/test.sh
+++ b/src/bin/pg_upgrade/test.sh
@@ -174,6 +174,11 @@ if "$MAKE" -C "$oldsrc" installcheck; then
mv "$temp_root"/dump1.sql "$temp_root"/dump1.sql.orig
sed "s;$oldsrc;$newsrc;g" "$temp_root"/dump1.sql.orig >"$temp_root"/dump1.sql
fi
+
+ vm_sql="SELECT c.relname, c.relallvisible FROM pg_class as c, pg_namespace as n WHERE c.relnamespace = n.oid AND n.nspname NOT IN ('information_schema', 'pg_toast', 'pg_catalog') ORDER BY c.relname;"
+ # Test for rewriting visibility map
+ vacuumdb -d regression || visibilitymap_vacuum1_status=$?
+ psql -d regression -c "$vm_sql" > "$temp_root"/vm_test1.txt || visibilitymap_test1_status=$?
else
make_installcheck_status=$?
fi
@@ -188,6 +193,14 @@ if [ -n "$pg_dumpall1_status" ]; then
echo "pg_dumpall of pre-upgrade database cluster failed"
exit 1
fi
+if [ -n "$visibilitymap_vacuum1_status" ];then
+ echo "VACUUM of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+if [ -n "$visibilitymap_test1_status" ];then
+ echo "SELECT of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
PGDATA=$BASE_PGDATA
@@ -203,6 +216,8 @@ case $testhost in
esac
pg_dumpall -f "$temp_root"/dump2.sql || pg_dumpall2_status=$?
+vacuumdb -d regression || visibilitymap_vacuum2_status=$?
+psql -d regression -c "$vm_sql" > "$temp_root"/vm_test2.txt || visibilitymap_test2_status=$?
pg_ctl -m fast stop
# no need to echo commands anymore
@@ -214,11 +229,26 @@ if [ -n "$pg_dumpall2_status" ]; then
exit 1
fi
+if [ -n "$visibilitymap_vacuum2_status" ];then
+ echo "VACUUM of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
+if [ -n "$visibilitymap_test2_status" ];then
+ echo "SELECT of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
case $testhost in
MINGW*) cmd /c delete_old_cluster.bat ;;
*) sh ./delete_old_cluster.sh ;;
esac
+if ! diff "$temp_root"/vm_test1.txt "$temp_root"/vm_test2.txt >/dev/null; then
+ echo "Visibility map rewriting test failed"
+ exit 1
+fi
+
if diff "$temp_root"/dump1.sql "$temp_root"/dump2.sql >/dev/null; then
echo PASSED
exit 0
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index caa0f14..c55d232 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -320,9 +320,10 @@ typedef struct xl_heap_freeze_page
typedef struct xl_heap_visible
{
TransactionId cutoff_xid;
+ uint8 flags;
} xl_heap_visible;
-#define SizeOfHeapVisible (offsetof(xl_heap_visible, cutoff_xid) + sizeof(TransactionId))
+#define SizeOfHeapVisible (offsetof(xl_heap_visible, flags) + sizeof(uint8))
typedef struct xl_heap_new_cid
{
@@ -389,6 +390,6 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
xl_heap_freeze_tuple *xlrec_tp);
extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
- Buffer vm_buffer, TransactionId cutoff_xid);
+ Buffer pim_buffer, TransactionId cutoff_xid, uint8 flags);
#endif /* HEAPAM_XLOG_H */
diff --git a/src/include/access/hio.h b/src/include/access/hio.h
index b3b91e7..a200e5e 100644
--- a/src/include/access/hio.h
+++ b/src/include/access/hio.h
@@ -40,6 +40,6 @@ extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
Buffer otherBuffer, int options,
BulkInsertState bistate,
- Buffer *vmbuffer, Buffer *vmbuffer_other);
+ Buffer *pimbuffer, Buffer *pimbuffer_other);
#endif /* HIO_H */
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 0c0e0ef..5f032ab 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -19,15 +19,28 @@
#include "storage/buf.h"
#include "utils/relcache.h"
+/* Flags for bit map */
+#define VISIBILITYMAP_ALL_VISIBLE 0x01
+#define VISIBILITYMAP_ALL_FROZEN 0x02
+
+#define VISIBILITYMAP_VALID_BITS 0x03 /* OR of all valid visiblitymap flags bits */
+
+/* Macros for visibilitymap test */
+#define VM_ALL_VISIBLE(r, b, v) \
+ ((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_VISIBLE) != 0)
+#define VM_ALL_FROZEN(r, b, v) \
+ ((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_FROZEN) != 0)
+
extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk,
Buffer vmbuf);
extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
Buffer *vmbuf);
extern bool visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid);
-extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
-extern BlockNumber visibilitymap_count(Relation rel);
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags);
+extern uint8 visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
+extern BlockNumber visibilitymap_count(Relation rel, BlockNumber *all_frozen);
extern void visibilitymap_truncate(Relation rel, BlockNumber nheapblocks);
#endif /* VISIBILITYMAP_H */
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index eba4150..5dd96f6 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 201511071
+#define CATALOG_VERSION_NO 201511241
#endif
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index d8640db..9a77d7d 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2779,6 +2779,8 @@ DATA(insert OID = 3056 ( pg_stat_get_analyze_count PGNSP PGUID 12 1 0 0 0 f f f
DESCR("statistics: number of manual analyzes for a table");
DATA(insert OID = 3057 ( pg_stat_get_autoanalyze_count PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_autoanalyze_count _null_ _null_ _null_ ));
DESCR("statistics: number of auto analyzes for a table");
+DATA(insert OID = 6015 ( pg_stat_get_frozen_pages PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 23 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_frozen_pages _null_ _null_ _null_ ));
+DESCR("statistics: number of frozen pages of table");
DATA(insert OID = 1936 ( pg_stat_get_backend_idset PGNSP PGUID 12 1 100 0 0 f f f f t t s r 0 0 23 "" _null_ _null_ _null_ _null_ _null_ pg_stat_get_backend_idset _null_ _null_ _null_ ));
DESCR("statistics: currently active backend IDs");
DATA(insert OID = 2022 ( pg_stat_get_activity PGNSP PGUID 12 1 100 0 0 f f f f f t s r 1 0 2249 "23" "{23,26,23,26,25,25,25,16,1184,1184,1184,1184,869,25,23,28,28,16,25,25,23,16,25}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,datid,pid,usesysid,application_name,state,query,waiting,xact_start,query_start,backend_start,state_change,client_addr,client_hostname,client_port,backend_xid,backend_xmin,ssl,sslversion,sslcipher,sslbits,sslcompression,sslclientdn}" _null_ _null_ pg_stat_get_activity _null_ _null_ _null_ ));
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index eb3591a..6165500 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1363,7 +1363,7 @@ typedef struct IndexScanState
* RuntimeContext expr context for evaling runtime Skeys
* RelationDesc index relation descriptor
* ScanDesc index scan descriptor
- * VMBuffer buffer in use for visibility map testing, if any
+ * PIMBuffer buffer in use for visibility map testing, if any
* HeapFetches number of tuples we were forced to fetch from heap
* ----------------
*/
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 9ecc163..614ca5a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -355,6 +355,7 @@ typedef struct PgStat_MsgVacuum
TimestampTz m_vacuumtime;
PgStat_Counter m_live_tuples;
PgStat_Counter m_dead_tuples;
+ int32 m_frozen_pages;
} PgStat_MsgVacuum;
@@ -372,6 +373,7 @@ typedef struct PgStat_MsgAnalyze
TimestampTz m_analyzetime;
PgStat_Counter m_live_tuples;
PgStat_Counter m_dead_tuples;
+ int32 m_frozen_pages;
} PgStat_MsgAnalyze;
@@ -551,7 +553,7 @@ typedef union PgStat_Msg
* ------------------------------------------------------------
*/
-#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9D
+#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9E
/* ----------
* PgStat_StatDBEntry The collector's data per database
@@ -615,6 +617,8 @@ typedef struct PgStat_StatTabEntry
PgStat_Counter n_dead_tuples;
PgStat_Counter changes_since_analyze;
+ int32 n_frozen_pages;
+
PgStat_Counter blocks_fetched;
PgStat_Counter blocks_hit;
@@ -917,9 +921,11 @@ extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type t
extern void pgstat_report_autovac(Oid dboid);
extern void pgstat_report_vacuum(Oid tableoid, bool shared,
- PgStat_Counter livetuples, PgStat_Counter deadtuples);
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages);
extern void pgstat_report_analyze(Relation rel,
- PgStat_Counter livetuples, PgStat_Counter deadtuples);
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages);
extern void pgstat_report_recovery_conflict(int reason);
extern void pgstat_report_deadlock(void);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index a2f78ee..102aa81 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -178,8 +178,10 @@ typedef PageHeaderData *PageHeader;
* tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
+#define PD_ALL_FROZEN 0x0008 /* all tuples on page are completely
+ frozen */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x000F /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -367,7 +369,12 @@ typedef PageHeaderData *PageHeader;
#define PageSetAllVisible(page) \
(((PageHeader) (page))->pd_flags |= PD_ALL_VISIBLE)
#define PageClearAllVisible(page) \
- (((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
+ (((PageHeader) (page))->pd_flags &= ~(PD_ALL_VISIBLE | PD_ALL_FROZEN))
+
+#define PageIsAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags & PD_ALL_FROZEN)
+#define PageSetAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags |= PD_ALL_FROZEN)
#define PageIsPrunable(page, oldestxmin) \
( \
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 80374e4..b259e65 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1683,6 +1683,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,
pg_stat_get_live_tuples(c.oid) AS n_live_tup,
pg_stat_get_dead_tuples(c.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(c.oid) AS n_mod_since_analyze,
+ pg_stat_get_frozen_pages(c.oid) AS n_frozen_page,
pg_stat_get_last_vacuum_time(c.oid) AS last_vacuum,
pg_stat_get_last_autovacuum_time(c.oid) AS last_autovacuum,
pg_stat_get_last_analyze_time(c.oid) AS last_analyze,
@@ -1795,6 +1796,7 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
+ pg_stat_all_tables.n_frozen_page,
pg_stat_all_tables.last_vacuum,
pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,
@@ -1838,6 +1840,7 @@ pg_stat_user_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
+ pg_stat_all_tables.n_frozen_page,
pg_stat_all_tables.last_vacuum,
pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index f5be70f..dd49786 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -127,6 +127,8 @@ SELECT count(*) FROM tenk2 WHERE unique1 = 1;
1
(1 row)
+-- do VACUUM FREEZE
+VACUUM FREEZE tenk2;
-- force the rate-limiting logic in pgstat_report_tabstat() to time out
-- and send a message
SELECT pg_sleep(1.0);
@@ -175,6 +177,14 @@ SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages,
t | t
(1 row)
+SELECT n_frozen_page = (pg_relation_size('tenk2') / current_setting('block_size')::int)
+ FROM pg_stat_user_tables
+ WHERE relname ='tenk2';
+ ?column?
+----------
+ t
+(1 row)
+
SELECT pr.snap_ts < pg_stat_get_snapshot_timestamp() as snapshot_newer
FROM prevstats AS pr;
snapshot_newer
diff --git a/src/test/regress/expected/visibilitymap.out b/src/test/regress/expected/visibilitymap.out
new file mode 100644
index 0000000..3be0354
--- /dev/null
+++ b/src/test/regress/expected/visibilitymap.out
@@ -0,0 +1,22 @@
+--
+-- Visibility Map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+-- All pages are become all-visible
+VACUUM FREEZE vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+ ?column?
+----------
+ t
+(1 row)
+
+-- Check whether vacuum skips all-frozen pages
+\set VERBOSITY terse
+VACUUM FREEZE VERBOSE vmtest;
+INFO: vacuuming "public.vmtest"
+INFO: index "vmtest_pkey" now contains 10000 row versions in 30 pages
+INFO: skipped 45 frozen pages according to visibility map
+INFO: "vmtest": found 0 removable, 0 nonremovable row versions in 0 out of 45 pages
+\set VERBOSITY default
+DROP TABLE vmtest;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 3987b4c..5253a29 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -110,3 +110,6 @@ test: event_trigger
# run stats by itself because its delay may be insufficient under heavy load
test: stats
+
+# page info map and vacuum test cannot run concurrently with any test that runs SQL
+test: visibilitymap
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 379f272..c5fd695 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -160,3 +160,4 @@ test: with
test: xml
test: event_trigger
test: stats
+test: visibilitymap
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index cd2d592..b3c640f 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -120,6 +120,8 @@ ROLLBACK;
SELECT count(*) FROM tenk2;
-- do an indexscan
SELECT count(*) FROM tenk2 WHERE unique1 = 1;
+-- do VACUUM FREEZE
+VACUUM FREEZE tenk2;
-- force the rate-limiting logic in pgstat_report_tabstat() to time out
-- and send a message
@@ -145,6 +147,10 @@ SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages,
FROM pg_statio_user_tables AS st, pg_class AS cl, prevstats AS pr
WHERE st.relname='tenk2' AND cl.relname='tenk2';
+SELECT n_frozen_page = (pg_relation_size('tenk2') / current_setting('block_size')::int)
+ FROM pg_stat_user_tables
+ WHERE relname ='tenk2';
+
SELECT pr.snap_ts < pg_stat_get_snapshot_timestamp() as snapshot_newer
FROM prevstats AS pr;
diff --git a/src/test/regress/sql/visibilitymap.sql b/src/test/regress/sql/visibilitymap.sql
new file mode 100644
index 0000000..365570b
--- /dev/null
+++ b/src/test/regress/sql/visibilitymap.sql
@@ -0,0 +1,16 @@
+--
+-- Visibility Map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+
+-- All pages are become all-visible
+VACUUM FREEZE vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+
+-- Check whether vacuum skips all-frozen pages
+\set VERBOSITY terse
+VACUUM FREEZE VERBOSE vmtest;
+\set VERBOSITY default
+
+DROP TABLE vmtest;
On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Yeah, we need to consider to compute checksum if enabled.
I've changed the patch, and attached.
Please review it.
Thanks for the update. This now conflicts with the updates doesn to
fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the
conflict in order to do some testing, but I'd like to get an updated
patch from the author in case I did it wrong. I don't want to find
bugs that I just introduced myself.
Thanks,
Jeff
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Nov 29, 2015 at 2:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Yeah, we need to consider to compute checksum if enabled.
I've changed the patch, and attached.
Please review it.Thanks for the update. This now conflicts with the updates doesn to
fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the
conflict in order to do some testing, but I'd like to get an updated
patch from the author in case I did it wrong. I don't want to find
bugs that I just introduced myself.
Thank you for having a look.
Attached updated v28 patch.
Please review it.
Regards,
--
Masahiko Sawada
Attachments:
000_add_frozen_bit_into_visibilitymap_v28.patchapplication/octet-stream; name=000_add_frozen_bit_into_visibilitymap_v28.patchDownload
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 22c5f7a..e8ebfe9 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -87,7 +87,7 @@ statapprox_heap(Relation rel, output_type *stat)
* If the page has only visible tuples, then we can find out the free
* space from the FSM and move on.
*/
- if (visibilitymap_test(rel, blkno, &vmbuffer))
+ if (VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
{
freespace = GetRecordedFreeSpace(rel, blkno);
stat->tuple_len += BLCKSZ - freespace;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 392eb70..c43443a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5916,7 +5916,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs an eager freezing if the table's
<structname>pg_class</>.<structfield>relfrozenxid</> field has reached
the age specified by this setting. The default is 150 million
transactions. Although users can set this value anywhere from zero to
@@ -5960,7 +5960,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs an eager freezing if the table's
<structname>pg_class</>.<structfield>relminmxid</> field has reached
the age specified by this setting. The default is 150 million multixacts.
Although users can set this value anywhere from zero to two billions,
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 5204b34..5a43c28 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -352,9 +352,9 @@
Vacuum maintains a <link linkend="storage-vm">visibility map</> for each
table to keep track of which pages contain only tuples that are known to be
visible to all active transactions (and all future transactions, until the
- page is again modified). This has two purposes. First, vacuum
- itself can skip such pages on the next run, since there is nothing to
- clean up.
+ page is again modified), and pages contain only frozen tuples.
+ This has two purposes. First, vacuum itself can skip such pages
+ on the next run, since there is nothing to clean up.
</para>
<para>
@@ -438,28 +438,25 @@
</para>
<para>
- <command>VACUUM</> normally skips pages that don't have any dead row
- versions, but those pages might still have row versions with old XID
- values. To ensure all old row versions have been frozen, a
- scan of the whole table is needed.
+ <command>VACUUM</> skips pages that don't have any dead row
+ versions, and pages that have only frozen rows. To ensure all old
+ row versions have been frozen, a scan of all unfrozen pages is needed.
<xref linkend="guc-vacuum-freeze-table-age"> controls when
- <command>VACUUM</> does that: a whole table sweep is forced if
- the table hasn't been fully scanned for <varname>vacuum_freeze_table_age</>
- minus <varname>vacuum_freeze_min_age</> transactions. Setting it to 0
- forces <command>VACUUM</> to always scan all pages, effectively ignoring
- the visibility map.
+ <command>VACUUM</> does that: a whole-table freezing is forced if
+ the table hasn't been ensured all row versions are frozen for
+ <varname>vacuum_freeze_table_age</> minus <varname>vacuum_freeze_min_age</>
+ transcations.
</para>
<para>
The maximum time that a table can go unvacuumed is two billion
transactions minus the <varname>vacuum_freeze_min_age</> value at
- the time <command>VACUUM</> last scanned the whole table. If it were to go
- unvacuumed for longer than
- that, data loss could result. To ensure that this does not happen,
- autovacuum is invoked on any table that might contain unfrozen rows with
- XIDs older than the age specified by the configuration parameter <xref
- linkend="guc-autovacuum-freeze-max-age">. (This will happen even if
- autovacuum is disabled.)
+ the time <command>VACUUM</> last scanned unfrozen pages. If it were to go
+ unvacuumed for longer than that, data loss could result. To ensure
+ that this does not happen, autovacuum is invoked on any table that might
+ contain unfrozen rows with XIDs older than the age specified by the
+ configuration parameter <xref linkend="guc-autovacuum-freeze-max-age">.
+ (This will happen even if autovacuum is disabled.)
</para>
<para>
@@ -490,8 +487,7 @@
a regularly scheduled <command>VACUUM</> or an autovacuum triggered by
normal delete and update activity is run in that window. Setting it too
close could lead to anti-wraparound autovacuums, even though the table
- was recently vacuumed to reclaim space, whereas lower values lead to more
- frequent whole-table scans.
+ was recently vacuumed to reclaim space.
</para>
<para>
@@ -527,7 +523,7 @@
<structname>pg_database</>. In particular,
the <structfield>relfrozenxid</> column of a table's
<structname>pg_class</> row contains the freeze cutoff XID that was used
- by the last whole-table <command>VACUUM</> for that table. All rows
+ by the last whole-table freezing for that table. All rows
inserted by transactions with XIDs older than this cutoff XID are
guaranteed to have been frozen. Similarly,
the <structfield>datfrozenxid</> column of a database's
@@ -555,17 +551,19 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
<command>VACUUM</> normally
only scans pages that have been modified since the last vacuum, but
<structfield>relfrozenxid</> can only be advanced when the whole table is
- scanned. The whole table is scanned when <structfield>relfrozenxid</> is
- more than <varname>vacuum_freeze_table_age</> transactions old, when
- <command>VACUUM</>'s <literal>FREEZE</> option is used, or when all pages
- happen to
- require vacuuming to remove dead row versions. When <command>VACUUM</>
- scans the whole table, after it's finished <literal>age(relfrozenxid)</>
- should be a little more than the <varname>vacuum_freeze_min_age</> setting
- that was used (more by the number of transactions started since the
- <command>VACUUM</> started). If no whole-table-scanning <command>VACUUM</>
- is issued on the table until <varname>autovacuum_freeze_max_age</> is
- reached, an autovacuum will soon be forced for the table.
+ frozen. The whole-table freezing is occuerred only when all pages happen to
+ require freezing to freeze rows. In other cases such as where
+ <structfield>relfrozenxid</> is more than <varname>vacuum_freeze_table_age</>
+ transcations old, where <command>VACUUM</>'s <literal>FREEZE</> option is used,
+ <command>VACUUM</> can skip the pages that all tuples on the page itself are
+ marked as frozen.
+ When all pages of table are eventually marked as frozen by <command>VACUUM</>,
+ after it's finished <literal>age(relfrozenxid)</> should be a little more
+ than the <varname>vacuum_freeze_min_age</> setting that was used (more by
+ the number of transcations started since the <command>VACUUM</> started).
+ If the advancing of <structfield>relfrozenxid</> is not happend until
+ <varname>autovacuum_freeze_max_age</> is reached, an autovacuum will soon
+ be forced for the table.
</para>
<para>
@@ -642,28 +640,28 @@ HINT: Stop the postmaster and vacuum that database in single-user mode.
<structname>pg_class</>.<structfield>relminmxid</> stores the oldest
possible multixact ID still appearing in any tuple of that table.
If this value is older than
- <xref linkend="guc-vacuum-multixact-freeze-table-age">, a whole-table
+ <xref linkend="guc-vacuum-multixact-freeze-table-age">, a table
scan is forced. <function>mxid_age()</> can be used on
<structname>pg_class</>.<structfield>relminmxid</> to find its age.
</para>
<para>
- Whole-table <command>VACUUM</> scans, regardless of
- what causes them, enable advancing the value for that table.
+ When <command>VACUUM</> scans all unfrozen pages, regardless of what causes
+ them, enable advancing the value for that table.
Eventually, as all tables in all databases are scanned and their
oldest multixact values are advanced, on-disk storage for older
multixacts can be removed.
</para>
<para>
- As a safety device, a whole-table vacuum scan will occur for any table
- whose multixact-age is greater than
- <xref linkend="guc-autovacuum-multixact-freeze-max-age">. Whole-table
+ As a safety device, a vacuum scan will occur for any table whose
+ multixact-age is greater than
+ <xref linkend="guc-autovacuum-multixact-freeze-max-age">. A
vacuum scans will also occur progressively for all tables, starting with
those that have the oldest multixact-age, if the amount of used member
storage space exceeds the amount 50% of the addressable storage space.
- Both of these kinds of whole-table scans will occur even if autovacuum is
- nominally disabled.
+ Both of these kinds of scans will occur even if autovacuum is nominally
+ disabled.
</para>
</sect3>
</sect2>
@@ -743,8 +741,8 @@ vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuple
<command>UPDATE</command> and <command>DELETE</command> operation. (It
is only semi-accurate because some information might be lost under heavy
load.) If the <structfield>relfrozenxid</> value of the table is more
- than <varname>vacuum_freeze_table_age</> transactions old, the whole
- table is scanned to freeze old tuples and advance
+ than <varname>vacuum_freeze_table_age</> transactions old, the table is
+ scanned to freeze old tuples and advance
<structfield>relfrozenxid</>, otherwise only pages that have been modified
since the last vacuum are scanned.
</para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e64b7ef..1908a4d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1332,6 +1332,11 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
<entry>Estimated number of rows modified since this table was last analyzed</entry>
</row>
<row>
+ <entry><structfield>n_frozen_page</></entry>
+ <entry><type>integer</></entry>
+ <entry>Number of frozen pages</entry>
+ </row>
+ <row>
<entry><structfield>last_vacuum</></entry>
<entry><type>timestamp with time zone</></entry>
<entry>Last time at which this table was manually vacuumed
diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index eb113c2..275b69c 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -657,6 +657,12 @@ psql --username postgres --file script.sql postgres
</para>
<para>
+ Since the format of visibility map has been changed in version 9.6,
+ <application>pg_upgrade</> creates and rewrite new <literal>'_vm'</literal>
+ file even if upgrading from 9.5 or before to 9.6 or later with link mode (-k).
+ </para>
+
+ <para>
All failure, rebuild, and reindex cases will be reported by
<application>pg_upgrade</> if they affect your installation;
post-upgrade scripts to rebuild tables and indexes will be
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 164d08c..5dc8b04 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -623,18 +623,20 @@ can be used to examine the information stored in free space maps.
<para>
Each heap relation has a Visibility Map
(VM) to keep track of which pages contain only tuples that are known to be
-visible to all active transactions. It's stored
-alongside the main relation data in a separate relation fork, named after the
-filenode number of the relation, plus a <literal>_vm</> suffix. For example,
-if the filenode of a relation is 12345, the VM is stored in a file called
-<filename>12345_vm</>, in the same directory as the main relation file.
+visible to all active transactions, and pages contain only unfrozen tuples.
+It's stored alongside the main relation data in a separate relation fork,
+named after the filenode number of the relation, plus a <literal>_vm</> suffix.
+For example, if the filenode of a relation is 12345, the VM is stored in a file
+called <filename>12345_vm</>, in the same directory as the main relation file.
Note that indexes do not have VMs.
</para>
<para>
-The visibility map simply stores one bit per heap page. A set bit means
-that all tuples on the page are known to be visible to all transactions.
-This means that the page does not contain any tuples that need to be vacuumed.
+The visibility map stores two bits per heap page: all-visible, all-frozen.
+A set all-visible bit means that all tuples on the page are known to be visible
+to all transactions. A set all-frozen bit means that all tuples on the page are
+completely marked as frozen. This means that the page does not contain any tuples
+that need to be vacuumed, even if a scan of whole table is required.
This information can also be used by <firstterm>index-only scans</> to answer
queries using only the index tuple.
</para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9ff7a41..651dd0e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3034,9 +3034,9 @@ heap_delete(Relation relation, ItemPointer tid,
/*
* If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, we'll have to unlock and
- * re-lock, to avoid holding the buffer lock across an I/O. That's a bit
- * unfortunate, but hopefully shouldn't happen often.
+ * visible or all frozen while we were busy locking the buffer, we'll
+ * have to unlock and re-lock, to avoid holding the buffer lock across an
+ * I/O. That's a bit unfortunate, but hopefully shouldn't happen often.
*/
if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
{
@@ -7205,7 +7205,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
*/
XLogRecPtr
log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
- TransactionId cutoff_xid)
+ TransactionId cutoff_xid, uint8 vmflags)
{
xl_heap_visible xlrec;
XLogRecPtr recptr;
@@ -7215,6 +7215,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
Assert(BufferIsValid(vm_buffer));
xlrec.cutoff_xid = cutoff_xid;
+ xlrec.flags = vmflags;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
@@ -7804,7 +7805,12 @@ heap_xlog_visible(XLogReaderState *record)
* the subsequent update won't be replayed to clear the flag.
*/
page = BufferGetPage(buffer);
- PageSetAllVisible(page);
+
+ if (xlrec->flags & VISIBILITYMAP_ALL_VISIBLE)
+ PageSetAllVisible(page);
+ if (xlrec->flags & VISIBILITYMAP_ALL_FROZEN)
+ PageSetAllFrozen(page);
+
MarkBufferDirty(buffer);
}
else if (action == BLK_RESTORED)
@@ -7856,7 +7862,7 @@ heap_xlog_visible(XLogReaderState *record)
*/
if (lsn > PageGetLSN(vmpage))
visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
- xlrec->cutoff_xid);
+ xlrec->cutoff_xid, xlrec->flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 6db73bf..0fe49eb 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -119,7 +119,7 @@ ReadBufferBI(Relation relation, BlockNumber targetBlock,
* be less than buffer2.
*/
static void
-GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
+GetVisibilitymapPins(Relation relation, Buffer buffer1, Buffer buffer2,
BlockNumber block1, BlockNumber block2,
Buffer *vmbuffer1, Buffer *vmbuffer2)
{
@@ -380,11 +380,11 @@ RelationGetBufferForTuple(Relation relation, Size len,
* done.
*/
if (otherBuffer == InvalidBuffer || buffer <= otherBuffer)
- GetVisibilityMapPins(relation, buffer, otherBuffer,
+ GetVisibilitymapPins(relation, buffer, otherBuffer,
targetBlock, otherBlock, vmbuffer,
vmbuffer_other);
else
- GetVisibilityMapPins(relation, otherBuffer, buffer,
+ GetVisibilitymapPins(relation, otherBuffer, buffer,
otherBlock, targetBlock, vmbuffer_other,
vmbuffer);
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 7c38772..387a0d6 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -15,39 +15,46 @@
* visibilitymap_pin - pin a map page for setting a bit
* visibilitymap_pin_ok - check whether correct map page is already pinned
* visibilitymap_set - set a bit in a previously pinned page
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_get_status - get status of bits
* visibilitymap_count - count number of bits set in visibility map
* visibilitymap_truncate - truncate the visibility map
*
* NOTES
*
- * The visibility map is a bitmap with one bit per heap page. A set bit means
- * that all tuples on the page are known visible to all transactions, and
- * therefore the page doesn't need to be vacuumed. The map is conservative in
- * the sense that we make sure that whenever a bit is set, we know the
- * condition is true, but if a bit is not set, it might or might not be true.
+ * The visibility map is a bitmap with two bits (all-visible and all-frozen)
+ * per heap page. A set all-visible bit means that all tuples on the page are
+ * known visible to all transactions, and therefore the page doesn't need to
+ * be vacuumed. A set all-frozen bit means that all tuples on the page are
+ * completely frozen, and therefore the page doesn't need to be vacuumed even
+ * if whole table scanning vacuum is required (e.g. anti-wraparound vacuum).
+ * The all-frozen bit must be set only when the page is already all-visible.
+ *
+ * The map is conservative in the sense that we make sure that whenever a bit
+ * is set, we know the condition is true, but if a bit is not set, it might or
+ * might not be true.
*
* Clearing a visibility map bit is not separately WAL-logged. The callers
* must make sure that whenever a bit is cleared, the bit is cleared on WAL
- * replay of the updating operation as well.
+ * replay of the updating operation as well. And all-frozen bit must be
+ * cleared with all-visible at the same time.
*
* When we *set* a visibility map during VACUUM, we must write WAL. This may
* seem counterintuitive, since the bit is basically a hint: if it is clear,
- * it may still be the case that every tuple on the page is visible to all
- * transactions; we just don't know that for certain. The difficulty is that
- * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
- * on the page itself, and the visibility map bit. If a crash occurs after the
- * visibility map page makes it to disk and before the updated heap page makes
- * it to disk, redo must set the bit on the heap page. Otherwise, the next
- * insert, update, or delete on the heap page will fail to realize that the
- * visibility map bit must be cleared, possibly causing index-only scans to
- * return wrong answers.
+ * it may still be the case that every tuple on the page is all-visible or
+ * all-frozen we just don't know that for certain. The difficulty is that
+ * there are two bits which are typically set together: the PD_ALL_VISIBLE
+ * or PD_ALL_FROZEN bit on the page itself, and the corresponding visibility map
+ * bit. If a crash occurs after the visibility map page makes it to disk and before
+ * the updated heap page makes it to disk, redo must set the bit on the heap page.
+ * Otherwise, the next insert, update, or delete on the heap page will fail to
+ * realize that the visibility map bit must be cleared, possibly causing index-only
+ * scans to return wrong answers.
*
* VACUUM will normally skip pages for which the visibility map bit is set;
* such pages can't contain any dead tuples and therefore don't need vacuuming.
- * The visibility map is not used for anti-wraparound vacuums, because
- * an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
- * present in the table, even on pages that don't have any dead tuples.
+ * The visibility map has the all-frozen bit which indicates all tuples on
+ * corresponding page has been completely frozen, so the visibility map is also
+ * used for anti-wraparound vacuum, even if freezing of tuples is required.
*
* LOCKING
*
@@ -58,14 +65,14 @@
* section that logs the page modification. However, we don't want to hold
* the buffer lock over any I/O that may be required to read in the visibility
* map page. To avoid this, we examine the heap page before locking it;
- * if the page-level PD_ALL_VISIBLE bit is set, we pin the visibility map
- * bit. Then, we lock the buffer. But this creates a race condition: there
- * is a possibility that in the time it takes to lock the buffer, the
- * PD_ALL_VISIBLE bit gets set. If that happens, we have to unlock the
- * buffer, pin the visibility map page, and relock the buffer. This shouldn't
- * happen often, because only VACUUM currently sets visibility map bits,
- * and the race will only occur if VACUUM processes a given page at almost
- * exactly the same time that someone tries to further modify it.
+ * if the page-level PD_ALL_VISIBLE or PD_ALL_FROZEN bit is set, we pin the
+ * visibility map bit. Then, we lock the buffer. But this creates a race
+ * condition: there is a possibility that in the time it takes to lock the
+ * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that happens,
+ * we have to unlock the buffer, pin the visibility map page, and relock the
+ * buffer. This shouldn't happen often, because only VACUUM currently sets
+ * visibility map bits, and the race will only occur if VACUUM processes a given
+ * page at almost exactly the same time that someone tries to further modify it.
*
* To set a bit, you need to hold a lock on the heap page. That prevents
* the race condition where VACUUM sees that all tuples on the page are
@@ -101,11 +108,14 @@
*/
#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
-/* Number of bits allocated for each heap block. */
-#define BITS_PER_HEAPBLOCK 1
+/*
+ * Number of bits allocated for each heap block.
+ * One for all-visible, other for all-frozen.
+*/
+#define BITS_PER_HEAPBLOCK 2
/* Number of heap blocks we can represent in one byte. */
-#define HEAPBLOCKS_PER_BYTE 8
+#define HEAPBLOCKS_PER_BYTE 4
/* Number of heap blocks we can represent in one visibility map page. */
#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
@@ -113,26 +123,44 @@
/* Mapping from heap block number to the right bit in the visibility map */
#define HEAPBLK_TO_MAPBLOCK(x) ((x) / HEAPBLOCKS_PER_PAGE)
#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
-#define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
-
-/* table for fast counting of set bits */
-static const uint8 number_of_ones[256] = {
- 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
+#define HEAPBLK_TO_MAPBIT(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)
+
+/* tables for fast counting of set bits for visible and freeze */
+static const uint8 number_of_ones_for_visible[256] = {
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4
+};
+static const uint8 number_of_ones_for_frozen[256] = {
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4
};
/* prototypes for internal routines */
@@ -141,7 +169,7 @@ static void vm_extend(Relation rel, BlockNumber nvmblocks);
/*
- * visibilitymap_clear - clear a bit in visibility map
+ * visibilitymap_clear - clear all bits in visibility map
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -153,11 +181,11 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
int mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- uint8 mask = 1 << mapBit;
+ uint8 mask = VISIBILITYMAP_VALID_BITS << mapBit;
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_clear %s block %d", RelationGetRelationName(rel), heapBlk);
#endif
if (!BufferIsValid(buf) || BufferGetBlockNumber(buf) != mapBlock)
@@ -186,7 +214,7 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
* visibilitymap_set to actually set the bit.
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by
- * an earlier call to visibilitymap_pin or visibilitymap_test on the same
+ * an earlier call to visibilitymap_pin or visibilitymap_get_status on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk.
*
@@ -212,7 +240,7 @@ visibilitymap_pin(Relation rel, BlockNumber heapBlk, Buffer *buf)
* visibilitymap_pin_ok - do we already have the correct page pinned?
*
* On entry, buf should be InvalidBuffer or a valid buffer returned by
- * an earlier call to visibilitymap_pin or visibilitymap_test on the same
+ * an earlier call to visibilitymap_pin or visibilitymap_get_status on the same
* relation. The return value indicates whether the buffer covers the
* given heapBlk.
*/
@@ -225,7 +253,7 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
}
/*
- * visibilitymap_set - set a bit on a previously pinned page
+ * visibilitymap_set - set bit(s) on a previously pinned page
*
* recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
* or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
@@ -234,10 +262,11 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
* marked all-visible; it is needed for Hot Standby, and can be
* InvalidTransactionId if the page contains no tuples.
*
- * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
- * this function. Except in recovery, caller should also pass the heap
- * buffer. When checksums are enabled and we're not in recovery, we must add
- * the heap buffer to the WAL chain to protect it from being torn.
+ * Caller is expected to set the heap page's PD_ALL_VISIBLE or PD_ALL_FROZEN
+ * bit before calling this function. Except in recovery, caller should also
+ * pass the heap buffer and flags which indicates what flag we want to set.
+ * When checksums are enabled and we're not in recovery, we must add the heap
+ * buffer to the WAL chain to protect it from being torn.
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -245,7 +274,8 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
*/
void
visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid)
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -254,11 +284,12 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_set %s block %d, flag %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
Assert(InRecovery || BufferIsValid(heapBuf));
+ Assert(flags & VISIBILITYMAP_VALID_BITS);
/* Check that we have the right heap page pinned, if present */
if (BufferIsValid(heapBuf) && BufferGetBlockNumber(heapBuf) != heapBlk)
@@ -272,11 +303,11 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map = PageGetContents(page);
LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
- if (!(map[mapByte] & (1 << mapBit)))
+ if (flags != (map[mapByte] & (flags << mapBit)))
{
START_CRIT_SECTION();
- map[mapByte] |= (1 << mapBit);
+ map[mapByte] |= (flags << mapBit);
MarkBufferDirty(vmBuf);
if (RelationNeedsWAL(rel))
@@ -285,7 +316,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Assert(!InRecovery);
recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
- cutoff_xid);
+ cutoff_xid, flags);
/*
* If data checksums are enabled (or wal_log_hints=on), we
@@ -295,11 +326,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Page heapPage = BufferGetPage(heapBuf);
- /* caller is expected to set PD_ALL_VISIBLE first */
- Assert(PageIsAllVisible(heapPage));
+ /*
+ * Caller is expected to set PD_ALL_VISIBLE or
+ * PD_ALL_FROZEN first.
+ */
+ Assert(((flags | VISIBILITYMAP_ALL_VISIBLE) && PageIsAllVisible(heapPage)) ||
+ ((flags | VISIBILITYMAP_ALL_FROZEN) && PageIsAllFrozen(heapPage)));
PageSetLSN(heapPage, recptr);
}
}
+
PageSetLSN(page, recptr);
}
@@ -310,15 +346,17 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
}
/*
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_get_status - get status of bits
*
- * Are all tuples on heapBlk visible to all, according to the visibility map?
+ * Are all tuples on heapBlk visible to all or are marked frozen, according
+ * to the visibility map?
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by an
- * earlier call to visibilitymap_pin or visibilitymap_test on the same
+ * earlier call to visibilitymap_pin or visibilitymap_get_status on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk, or InvalidBuffer. The caller is responsible for
- * releasing *buf after it's done testing and setting bits.
+ * releasing *buf after it's done testing and setting bits, and must pass flags
+ * for which it needs to check the value in visibility map.
*
* NOTE: This function is typically called without a lock on the heap page,
* so somebody else could change the bit just after we look at it. In fact,
@@ -327,17 +365,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
* we might see the old value. It is the caller's responsibility to deal with
* all concurrency issues!
*/
-bool
-visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
+uint8
+visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *buf)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
uint8 mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- bool result;
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_test %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_get_status %s, block %d", RelationGetRelationName(rel), heapBlk);
#endif
/* Reuse the old pinned buffer if possible */
@@ -360,13 +397,11 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
map = PageGetContents(BufferGetPage(*buf));
/*
- * A single-bit read is atomic. There could be memory-ordering effects
+ * The double bits read is atomic. There could be memory-ordering effects
* here, but for performance reasons we make it the caller's job to worry
* about that.
*/
- result = (map[mapByte] & (1 << mapBit)) ? true : false;
-
- return result;
+ return ((map[mapByte] >> mapBit) & VISIBILITYMAP_VALID_BITS);
}
/*
@@ -374,13 +409,17 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
*
* Note: we ignore the possibility of race conditions when the table is being
* extended concurrently with the call. New pages added to the table aren't
- * going to be marked all-visible, so they won't affect the result.
+ * going to be marked all-visible or all-frozen, so they won't affect the result.
+ * The caller must set the flags which indicates what flag we want to count.
*/
BlockNumber
-visibilitymap_count(Relation rel)
+visibilitymap_count(Relation rel, BlockNumber *all_frozen)
{
- BlockNumber result = 0;
BlockNumber mapBlock;
+ BlockNumber all_visible = 0;
+
+ if (all_frozen)
+ *all_frozen = 0;
for (mapBlock = 0;; mapBlock++)
{
@@ -406,13 +445,15 @@ visibilitymap_count(Relation rel)
for (i = 0; i < MAPSIZE; i++)
{
- result += number_of_ones[map[i]];
+ all_visible += number_of_ones_for_visible[map[i]];
+ if (all_frozen)
+ *all_frozen += number_of_ones_for_frozen[map[i]];
}
ReleaseBuffer(mapBuffer);
}
- return result;
+ return all_visible;
}
/*
@@ -435,7 +476,7 @@ visibilitymap_truncate(Relation rel, BlockNumber nheapblocks)
uint8 truncBit = HEAPBLK_TO_MAPBIT(nheapblocks);
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_truncate %s %d", RelationGetRelationName(rel), nheapblocks);
+ elog(DEBUG1, "vm_truncate %s block %d", RelationGetRelationName(rel), nheapblocks);
#endif
RelationOpenSmgr(rel);
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e59b163..1cea026 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1813,7 +1813,7 @@ FormIndexDatum(IndexInfo *indexInfo,
* isprimary: if true, set relhaspkey true; else no change
* reltuples: if >= 0, set reltuples to this value; else no change
*
- * If reltuples >= 0, relpages and relallvisible are also updated (using
+ * If reltuples >= 0, relpages, relallvisible are also updated (using
* RelationGetNumberOfBlocks() and visibilitymap_count()).
*
* NOTE: an important side-effect of this operation is that an SI invalidation
@@ -1921,7 +1921,7 @@ index_update_stats(Relation rel,
BlockNumber relallvisible;
if (rd_rel->relkind != RELKIND_INDEX)
- relallvisible = visibilitymap_count(rel);
+ relallvisible = visibilitymap_count(rel, NULL);
else /* don't bother for indexes */
relallvisible = 0;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index ccc030f..8c555eb 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -444,6 +444,7 @@ CREATE VIEW pg_stat_all_tables AS
pg_stat_get_live_tuples(C.oid) AS n_live_tup,
pg_stat_get_dead_tuples(C.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(C.oid) AS n_mod_since_analyze,
+ pg_stat_get_frozen_pages(C.oid) AS n_frozen_page,
pg_stat_get_last_vacuum_time(C.oid) as last_vacuum,
pg_stat_get_last_autovacuum_time(C.oid) as last_autovacuum,
pg_stat_get_last_analyze_time(C.oid) as last_analyze,
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index ddb68ab..02a2c68 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -314,6 +314,8 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
+ BlockNumber relallvisible,
+ relallfrozen;
if (inh)
ereport(elevel,
@@ -564,6 +566,10 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
}
}
+ /* Calculate the number of all-visible and all-frozen bit */
+ if (!inh)
+ relallvisible = visibilitymap_count(onerel, &relallfrozen);
+
/*
* Update pages/tuples stats in pg_class ... but not if we're doing
* inherited stats.
@@ -572,7 +578,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
vac_update_relstats(onerel,
relpages,
totalrows,
- visibilitymap_count(onerel),
+ relallvisible,
hasindex,
InvalidTransactionId,
InvalidMultiXactId,
@@ -608,7 +614,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
* tracks per-table stats.
*/
if (!inh)
- pgstat_report_analyze(onerel, totalrows, totaldeadrows);
+ pgstat_report_analyze(onerel, totalrows, totaldeadrows, relallfrozen);
/* If this isn't part of VACUUM ANALYZE, let index AMs do cleanup */
if (!(options & VACOPT_VACUUM))
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7c4ef58..0a02a25 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -729,11 +729,11 @@ vac_estimate_reltuples(Relation relation, bool is_analyze,
* marked with xmin = our xid.
*
* In addition to fundamentally nontransactional statistics such as
- * relpages and relallvisible, we try to maintain certain lazily-updated
- * DDL flags such as relhasindex, by clearing them if no longer correct.
- * It's safe to do this in VACUUM, which can't run in parallel with
- * CREATE INDEX/RULE/TRIGGER and can't be part of a transaction block.
- * However, it's *not* safe to do it in an ANALYZE that's within an
+ * relpages, relallvisible, we try to maintain certain lazily-updated
+ * DDL flags such as relhasindex, by clearing them if no onger correct.
+ * It's safe to do this in VACUUM, which can't run in
+ * parallel with CREATE INDEX/RULE/TRIGGER and can't be part of a transaction
+ * block. However, it's *not* safe to do it in an ANALYZE that's within an
* outer transaction, because for example the current transaction might
* have dropped the last index; then we'd think relhasindex should be
* cleared, but if the transaction later rolls back this would be wrong.
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 2429889..994efb7 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -106,6 +106,8 @@ typedef struct LVRelStats
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* number of pages we examined */
BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber vmskipped_frozen_pages; /* # of pages we skipped by all-frozen bit
+ of visibility map */
double scanned_tuples; /* counts only tuples on scanned pages */
double old_rel_tuples; /* previous value of pg_class.reltuples */
double new_rel_tuples; /* new estimated total # of tuples */
@@ -156,7 +158,7 @@ static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
static int vac_cmp_itemptr(const void *left, const void *right);
static bool heap_page_is_all_visible(Relation rel, Buffer buf,
- TransactionId *visibility_cutoff_xid);
+ TransactionId *visibility_cutoff_xid, bool *all_frozen);
/*
@@ -188,7 +190,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
MultiXactId mxactFullScanLimit;
BlockNumber new_rel_pages;
double new_rel_tuples;
- BlockNumber new_rel_allvisible;
+ BlockNumber new_rel_allvisible,
+ new_rel_allfrozen;
double new_live_tuples;
TransactionId new_frozen_xid;
MultiXactId new_min_multi;
@@ -221,7 +224,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* We request a full scan if either the table's frozen Xid is now older
* than or equal to the requested Xid full-table scan limit; or if the
* table's minimum MultiXactId is older than or equal to the requested
- * mxid full-table scan limit.
+ * mxid full-table scan limit. During full scan, we could skip some pags
+ * according to all-frozen bit of visibility map.
*/
scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
xidFullScanLimit);
@@ -253,7 +257,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
- if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages + vacrelstats->vmskipped_frozen_pages)
+ < vacrelstats->rel_pages)
{
Assert(!scan_all);
scanned_all = false;
@@ -301,10 +306,13 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_tuples = vacrelstats->old_rel_tuples;
}
- new_rel_allvisible = visibilitymap_count(onerel);
+ new_rel_allvisible = visibilitymap_count(onerel, &new_rel_allfrozen);
if (new_rel_allvisible > new_rel_pages)
new_rel_allvisible = new_rel_pages;
+ if (new_rel_allfrozen > new_rel_pages)
+ new_rel_allfrozen = new_rel_pages;
+
new_frozen_xid = scanned_all ? FreezeLimit : InvalidTransactionId;
new_min_multi = scanned_all ? MultiXactCutoff : InvalidMultiXactId;
@@ -325,7 +333,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
pgstat_report_vacuum(RelationGetRelid(onerel),
onerel->rd_rel->relisshared,
new_live_tuples,
- vacrelstats->new_dead_tuples);
+ vacrelstats->new_dead_tuples,
+ new_rel_allfrozen);
/* and log the action if appropriate */
if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
@@ -360,10 +369,11 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel),
vacrelstats->num_index_scans);
- appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins\n"),
+ appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped according to vm\n"),
vacrelstats->pages_removed,
vacrelstats->rel_pages,
- vacrelstats->pinskipped_pages);
+ vacrelstats->pinskipped_pages,
+ vacrelstats->vmskipped_frozen_pages);
appendStringInfo(&buf,
_("tuples: %.0f removed, %.0f remain, %.0f are dead but not yet removable\n"),
vacrelstats->tuples_deleted,
@@ -486,9 +496,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* consecutive pages. Since we're reading sequentially, the OS should be
* doing readahead for us, so there's no gain in skipping a page now and
* then; that's likely to disable readahead and so be counterproductive.
- * Also, skipping even a single page means that we can't update
- * relfrozenxid, so we only want to do it if we can skip a goodly number
- * of pages.
+ * Also, skipping even a single page accorinding to all-visible bit of
+ * visibility map means that we can't update relfrozenxid, so we only want
+ * to do it if we can skip a goodly number. On the other hand, we count
+ * both how many pages we skipped according to all-frozen bit of visibility
+ * map and how many pages we freeze, so we can update relfrozenxid if
+ * the sum of two is as many as pages of table.
*
* Before entering the main loop, establish the invariant that
* next_not_all_visible_block is the next block number >= blkno that's not
@@ -498,24 +511,24 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* started skipping blocks, we may as well skip everything up to the next
* not-all-visible block.
*
- * Note: if scan_all is true, we won't actually skip any pages; but we
+ * Note: if scan_all is true, we might not actually skip any pages; but we
* maintain next_not_all_visible_block anyway, so as to set up the
* all_visible_according_to_vm flag correctly for each page.
*
- * Note: The value returned by visibilitymap_test could be slightly
+ * Note: The value returned by visibilitymap_get_status could be slightly
* out-of-date, since we make this test before reading the corresponding
* heap page or locking the buffer. This is OK. If we mistakenly think
- * that the page is all-visible when in fact the flag's just been cleared,
- * we might fail to vacuum the page. But it's OK to skip pages when
- * scan_all is not set, so no great harm done; the next vacuum will find
- * them. If we make the reverse mistake and vacuum a page unnecessarily,
- * it'll just be a no-op.
+ * that the page is all-visible/all-frozen when in fact the flag's just
+ * been cleared, we might fail to vacuum the page. But it's OK to skip
+ * pages when scan_all is not set, so no great harm done; the next vacuum
+ * will find them. If we make the reverse mistake and vacuum a page
+ * unnecessarily, it'll just be a no-op.
*/
for (next_not_all_visible_block = 0;
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer))
+ if (!VM_ALL_VISIBLE(onerel, next_not_all_visible_block, &vmbuffer))
break;
vacuum_delay_point();
}
@@ -533,9 +546,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
bool tupgone,
hastup;
int prev_dead_count;
- int nfrozen;
+ int nfrozen; /* # of tuples is frozen */
+ int nalready_frozen; /* # of tuples is already frozen */
+ int ntotal_frozen; /* # of frozen tuples in single page */
+ int ntup_per_page;
Size freespace;
bool all_visible_according_to_vm;
+ bool all_frozen_according_to_vm;
bool all_visible;
bool has_dead_tuples;
TransactionId visibility_cutoff_xid = InvalidTransactionId;
@@ -547,8 +564,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block,
- &vmbuffer))
+ if (!VM_ALL_VISIBLE(onerel, next_not_all_visible_block, &vmbuffer))
break;
vacuum_delay_point();
}
@@ -562,14 +578,29 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
skipping_all_visible_blocks = true;
else
skipping_all_visible_blocks = false;
+
all_visible_according_to_vm = false;
+ all_frozen_according_to_vm = false;
}
else
{
- /* Current block is all-visible */
- if (skipping_all_visible_blocks && !scan_all)
- continue;
+ /*
+ * This block is at least all-visible according to the visibility map.
+ * We check whether this block is all-frozen or not, to skip to
+ * vacuum this page even if scan_all is true.
+ */
+ bool all_frozen = VM_ALL_FROZEN(onerel, blkno, &vmbuffer);
+
+ if (scan_all && all_frozen)
+ {
+ vacrelstats->vmskipped_frozen_pages++;
+ continue;
+ }
+ else if (!scan_all && skipping_all_visible_blocks)
+ continue;
+
all_visible_according_to_vm = true;
+ all_frozen_according_to_vm = all_frozen;
}
vacuum_delay_point();
@@ -716,7 +747,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
empty_pages++;
freespace = PageGetHeapFreeSpace(page);
- /* empty pages are always all-visible */
+ /* empty pages are always all-visible and all-frozen */
if (!PageIsAllVisible(page))
{
START_CRIT_SECTION();
@@ -739,8 +770,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
log_newpage_buffer(buf, true);
PageSetAllVisible(page);
+ PageSetAllFrozen(page);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, InvalidTransactionId);
+ vmbuffer, InvalidTransactionId,
+ VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
END_CRIT_SECTION();
}
@@ -764,6 +797,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
all_visible = true;
has_dead_tuples = false;
nfrozen = 0;
+ nalready_frozen = 0;
+ ntup_per_page = 0;
hastup = false;
prev_dead_count = vacrelstats->num_dead_tuples;
maxoff = PageGetMaxOffsetNumber(page);
@@ -918,8 +953,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
{
num_tuples += 1;
+ ntup_per_page += 1;;
hastup = true;
+ /* Check whether this tuple is alrady frozen or not */
+ if (HeapTupleHeaderXminFrozen(tuple.t_data))
+ nalready_frozen += 1;
+
/*
* Each non-removable tuple must be checked to see if it needs
* freezing. Note we already have exclusive buffer lock.
@@ -931,9 +971,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
} /* scan along page */
/*
- * If we froze any tuples, mark the buffer dirty, and write a WAL
- * record recording the changes. We must log the changes to be
- * crash-safe against future truncation of CLOG.
+ * If we freeze any tuples, mark the buffer dirty, and write a WAL
+ * record recording the changes. We must log the changes to be crash-safe
+ * against future truncation of CLOG.
*/
if (nfrozen > 0)
{
@@ -966,6 +1006,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
END_CRIT_SECTION();
}
+ /* Compute the number of frozen tuples in a page */
+ ntotal_frozen = nfrozen + nalready_frozen;
+
/*
* If there are no indexes then we can vacuum the page right now
* instead of doing a second scan.
@@ -988,26 +1031,46 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
freespace = PageGetHeapFreeSpace(page);
- /* mark page all-visible, if appropriate */
- if (all_visible && !all_visible_according_to_vm)
+ /* This page is all visible */
+ if (all_visible)
{
- /*
- * It should never be the case that the visibility map page is set
- * while the page-level bit is clear, but the reverse is allowed
- * (if checksums are not enabled). Regardless, set the both bits
- * so that we get back in sync.
- *
- * NB: If the heap page is all-visible but the VM bit is not set,
- * we don't need to dirty the heap page. However, if checksums
- * are enabled, we do need to make sure that the heap page is
- * dirtied before passing it to visibilitymap_set(), because it
- * may be logged. Given that this situation should only happen in
- * rare cases after a crash, it is not worth optimizing.
- */
- PageSetAllVisible(page);
- MarkBufferDirty(buf);
- visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ /* mark page all-visible, if appropriate */
+ if (!all_visible_according_to_vm)
+ {
+ /*
+ * It should never be the case that the visibility map page is set
+ * while the page-level bit is clear, but the reverse is allowed
+ * (if checksums are not enabled). Regardless, set the both bits
+ * so that we get back in sync.
+ *
+ * NB: If the heap page is all-visible but the VM bit is not set,
+ * we don't need to dirty the heap page. However, if checksums
+ * are enabled, we do need to make sure that the heap page is
+ * dirtied before passing it to visibilitymap_set(), because it
+ * may be logged. Given that this situation should only happen in
+ * rare cases after a crash, it is not worth optimizing.
+ */
+ PageSetAllVisible(page);
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+ }
+
+ /* mark page all-frozen, if all tuples are frozen and not marked yet */
+ if ((ntotal_frozen == ntup_per_page) && !all_frozen_according_to_vm)
+ {
+ Assert(PageIsAllVisible(page));
+
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ {
+ MarkBufferDirty(buf);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+ vmbuffer, visibility_cutoff_xid, flags);
+ }
}
/*
@@ -1018,9 +1081,14 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* that something bad has happened.
*/
else if (all_visible_according_to_vm && !PageIsAllVisible(page)
- && visibilitymap_test(onerel, blkno, &vmbuffer))
+ && VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
{
- elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
+ /* If the all-frozen is set then all-visible must be set */
+ if (all_frozen_according_to_vm)
+ Assert(VM_ALL_FROZEN(onerel, blkno, &vmbuffer) &&
+ VM_ALL_VISIBLE(onerel, blkno, &vmbuffer));
+
+ elog(WARNING, "page is not marked all-visible (and all-frozen) but visibility map bit(s) is set in relation \"%s\" page %u",
relname, blkno);
visibilitymap_clear(onerel, blkno, vmbuffer);
}
@@ -1028,19 +1096,25 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
/*
* It's possible for the value returned by GetOldestXmin() to move
* backwards, so it's not wrong for us to see tuples that appear to
- * not be visible to everyone yet, while PD_ALL_VISIBLE is already
- * set. The real safe xmin value never moves backwards, but
- * GetOldestXmin() is conservative and sometimes returns a value
+ * not be visible to everyone yet, while PD_ALL_VISIBLE (and PD_ALL_FROZEN)
+ * are already set. The real safe xmin value never moves backwards,
+ * but GetOldestXmin() is conservative and sometimes returns a value
* that's unnecessarily small, so if we see that contradiction it just
* means that the tuples that we think are not visible to everyone yet
- * actually are, and the PD_ALL_VISIBLE flag is correct.
+ * actually are, and the PD_ALL_VISIBLE (and PD_ALL_FROZEN) flags are
+ * correct.
*
* There should never be dead tuples on a page with PD_ALL_VISIBLE
* set, however.
*/
else if (PageIsAllVisible(page) && has_dead_tuples)
{
- elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
+ /* If the all-frozen is set then all-visible must be set */
+ if (PageIsAllFrozen(page))
+ Assert(VM_ALL_FROZEN(onerel, blkno, &vmbuffer) &&
+ VM_ALL_VISIBLE(onerel, blkno, &vmbuffer));
+
+ elog(WARNING, "page containing dead tuples is marked as all-visible (and all-frozen) in relation \"%s\" page %u",
relname, blkno);
PageClearAllVisible(page);
MarkBufferDirty(buf);
@@ -1114,6 +1188,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
RelationGetRelationName(onerel),
tups_vacuumed, vacuumed_pages)));
+ /* Report how many frozen pages vacuum skipped according to visibility map */
+ ereport(elevel,
+ (errmsg_plural("skipped %d frozen page according to visibility map",
+ "skipped %d frozen pages according to visibility map",
+ vacrelstats->vmskipped_frozen_pages,
+ vacrelstats->vmskipped_frozen_pages)));
+
/*
* This is pretty messy, but we split it up so that we can skip emitting
* individual parts of the message when not applicable.
@@ -1230,6 +1311,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
OffsetNumber unused[MaxOffsetNumber];
int uncnt = 0;
TransactionId visibility_cutoff_xid;
+ bool all_frozen;
START_CRIT_SECTION();
@@ -1281,19 +1363,34 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
* dirty, exclusively locked, and, if needed, a full page image has been
* emitted in the log_heap_clean() above.
*/
- if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid))
+ if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid, &all_frozen))
PageSetAllVisible(page);
/*
* All the changes to the heap page have been done. If the all-visible
- * flag is now set, also set the VM bit.
+ * flag is now set, also set the VM all-visible bit.
+ * Also, if this page is all-frozen, set the VM all-frozen bit and flag.
*/
- if (PageIsAllVisible(page) &&
- !visibilitymap_test(onerel, blkno, vmbuffer))
+ if (PageIsAllVisible(page))
{
+ uint8 vm_status = visibilitymap_get_status(onerel, blkno, vmbuffer);
+ uint8 flags = 0;
+
+ if (!(vm_status & VISIBILITYMAP_ALL_VISIBLE))
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+
+ /* Set the VM all-frozen bit to flag, if needed */
+ if (all_frozen && !(vm_status & VISIBILITYMAP_ALL_FROZEN))
+ {
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
Assert(BufferIsValid(*vmbuffer));
- visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
- visibility_cutoff_xid);
+
+ if (vm_status != flags)
+ visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
+ visibility_cutoff_xid, flags);
}
return tupindex;
@@ -1783,10 +1880,12 @@ vac_cmp_itemptr(const void *left, const void *right)
/*
* Check if every tuple in the given page is visible to all current and future
* transactions. Also return the visibility_cutoff_xid which is the highest
- * xmin amongst the visible tuples.
+ * xmin amongst the visible tuples, and all_frozen which implies that all tuples
+ * of this page are frozen.
*/
static bool
-heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid)
+heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid,
+ bool *all_frozen)
{
Page page = BufferGetPage(buf);
BlockNumber blockno = BufferGetBlockNumber(buf);
@@ -1795,6 +1894,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
bool all_visible = true;
*visibility_cutoff_xid = InvalidTransactionId;
+ *all_frozen = true;
/*
* This is a stripped down version of the line pointer scan in
@@ -1818,11 +1918,12 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/*
* Dead line pointers can have index pointers pointing to them. So
- * they can't be treated as visible
+ * they can't be treated as visible and frozen.
*/
if (ItemIdIsDead(itemid))
{
all_visible = false;
+ *all_frozen = false;
break;
}
@@ -1859,6 +1960,10 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
+
+ /* Check whether this tuple is alrady frozen or not */
+ if (!HeapTupleHeaderXminFrozen(tuple.t_data))
+ *all_frozen = false;
}
break;
@@ -1867,6 +1972,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
case HEAPTUPLE_INSERT_IN_PROGRESS:
case HEAPTUPLE_DELETE_IN_PROGRESS:
all_visible = false;
+ *all_frozen = false;
break;
default:
@@ -1875,5 +1981,8 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
}
} /* scan along page */
+ if (!all_visible)
+ *all_frozen = false;
+
return all_visible;
}
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 9f54c46..e345177 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -85,7 +85,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
* which all tuples are known visible to everybody. In any case,
* we'll use the index tuple not the heap tuple as the data source.
*
- * Note on Memory Ordering Effects: visibilitymap_test does not lock
+ * Note on Memory Ordering Effects: visibilitymap_get_stattus does not lock
* the visibility map buffer, and therefore the result we read here
* could be slightly stale. However, it can't be stale enough to
* matter.
@@ -114,9 +114,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
* It's worth going through this complexity to avoid needing to lock
* the VM buffer, which could cause significant contention.
*/
- if (!visibilitymap_test(scandesc->heapRelation,
- ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
+ if (!VM_ALL_VISIBLE(scandesc->heapRelation, ItemPointerGetBlockNumber(tid),
+ &node->ioss_VMBuffer))
{
/*
* Rats, we have to visit the heap to check visibility.
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 990486c..d100a7d 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -468,7 +468,7 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count)
* estimates based on the correlation squared (XXX is that appropriate?).
*
* If it's an index-only scan, then we will not need to fetch any heap
- * pages for which the visibility map shows all tuples are visible.
+ * pages for which the visibility map map shows all tuples are visible.
* Hence, reduce the estimated number of heap fetches accordingly.
* We use the measured fraction of the entire heap that is all-visible,
* which might not be particularly relevant to the subset of the heap
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index ab018c4..ca7257a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1329,7 +1329,8 @@ pgstat_report_autovac(Oid dboid)
*/
void
pgstat_report_vacuum(Oid tableoid, bool shared,
- PgStat_Counter livetuples, PgStat_Counter deadtuples)
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages)
{
PgStat_MsgVacuum msg;
@@ -1343,6 +1344,7 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
msg.m_vacuumtime = GetCurrentTimestamp();
msg.m_live_tuples = livetuples;
msg.m_dead_tuples = deadtuples;
+ msg.m_frozen_pages = frozenpages;
pgstat_send(&msg, sizeof(msg));
}
@@ -1354,7 +1356,8 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
*/
void
pgstat_report_analyze(Relation rel,
- PgStat_Counter livetuples, PgStat_Counter deadtuples)
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages)
{
PgStat_MsgAnalyze msg;
@@ -1394,6 +1397,7 @@ pgstat_report_analyze(Relation rel,
msg.m_analyzetime = GetCurrentTimestamp();
msg.m_live_tuples = livetuples;
msg.m_dead_tuples = deadtuples;
+ msg.m_frozen_pages = frozenpages;
pgstat_send(&msg, sizeof(msg));
}
@@ -3702,6 +3706,7 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
result->n_live_tuples = 0;
result->n_dead_tuples = 0;
result->changes_since_analyze = 0;
+ result->n_frozen_pages = 0;
result->blocks_fetched = 0;
result->blocks_hit = 0;
result->vacuum_timestamp = 0;
@@ -5069,6 +5074,7 @@ pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
tabentry->n_live_tuples = msg->m_live_tuples;
tabentry->n_dead_tuples = msg->m_dead_tuples;
+ tabentry->n_frozen_pages = msg->m_frozen_pages;
if (msg->m_autovacuum)
{
@@ -5103,6 +5109,7 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
tabentry->n_live_tuples = msg->m_live_tuples;
tabentry->n_dead_tuples = msg->m_dead_tuples;
+ tabentry->n_frozen_pages = msg->m_frozen_pages;
/*
* We reset changes_since_analyze to zero, forgetting any changes that
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index f7c9bf6..98c14f9 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -46,6 +46,7 @@ extern Datum pg_stat_get_vacuum_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_autovacuum_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_analyze_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_autoanalyze_count(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_frozen_pages(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_function_calls(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_function_total_time(PG_FUNCTION_ARGS);
@@ -450,6 +451,21 @@ pg_stat_get_autoanalyze_count(PG_FUNCTION_ARGS)
}
Datum
+pg_stat_get_frozen_pages(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int32 result;
+ PgStat_StatTabEntry *tabentry;
+
+ if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+ result = 0;
+ else
+ result = (int32) (tabentry->n_frozen_pages);
+
+ PG_RETURN_INT32(result);
+}
+
+Datum
pg_stat_get_function_calls(PG_FUNCTION_ARGS)
{
Oid funcid = PG_GETARG_OID(0);
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index c84783c..349fd2b 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -10,6 +10,9 @@
#include "postgres_fe.h"
#include "pg_upgrade.h"
+#include "storage/bufpage.h"
+#include "storage/checksum.h"
+#include "storage/checksum_impl.h"
#include <fcntl.h>
@@ -21,6 +24,43 @@ static int copy_file(const char *fromfile, const char *tofile, bool force);
static int win32_pghardlink(const char *src, const char *dst);
#endif
+static const char *rewriteVisibilitymap(const char *fromfile, const char *tofile,
+ bool force);
+
+/* table for fast rewriting vm file in order to add all-frozen information */
+static const uint16 rewrite_vm_table[256] = {
+ 0, 1, 4, 5, 16, 17, 20, 21, 64, 65, 68, 69, 80, 81, 84, 85,
+ 256, 257, 260, 261, 272, 273, 276, 277, 320, 321, 324, 325, 336, 337, 340, 341,
+ 1024, 1025, 1028, 1029, 1040, 1041, 1044, 1045, 1088, 1089, 1092, 1093, 1104, 1105, 1108, 1109,
+ 1280, 1281, 1284, 1285, 1296, 1297, 1300, 1301, 1344, 1345, 1348, 1349, 1360, 1361, 1364, 1365,
+ 4096, 4097, 4100, 4101, 4112, 4113, 4116, 4117, 4160, 4161, 4164, 4165, 4176, 4177, 4180, 4181,
+ 4352, 4353, 4356, 4357, 4368, 4369, 4372, 4373, 4416, 4417, 4420, 4421, 4432, 4433, 4436, 4437,
+ 5120, 5121, 5124, 5125, 5136, 5137, 5140, 5141, 5184, 5185, 5188, 5189, 5200, 5201, 5204, 5205,
+ 5376, 5377, 5380, 5381, 5392, 5393, 5396, 5397, 5440, 5441, 5444, 5445, 5456, 5457, 5460, 5461,
+ 16384, 16385, 16388, 16389, 16400, 16401, 16404, 16405, 16448, 16449, 16452, 16453, 16464, 16465, 16468, 16469,
+ 16640, 16641, 16644, 16645, 16656, 16657, 16660, 16661, 16704, 16705, 16708, 16709, 16720, 16721, 16724, 16725,
+ 17408, 17409, 17412, 17413, 17424, 17425, 17428, 17429, 17472, 17473, 17476, 17477, 17488, 17489, 17492, 17493,
+ 17664, 17665, 17668, 17669, 17680, 17681, 17684, 17685, 17728, 17729, 17732, 17733, 17744, 17745, 17748, 17749,
+ 20480, 20481, 20484, 20485, 20496, 20497, 20500, 20501, 20544, 20545, 20548, 20549, 20560, 20561, 20564, 20565,
+ 20736, 20737, 20740, 20741, 20752, 20753, 20756, 20757, 20800, 20801, 20804, 20805, 20816, 20817, 20820, 20821,
+ 21504, 21505, 21508, 21509, 21520, 21521, 21524, 21525, 21568, 21569, 21572, 21573, 21584, 21585, 21588, 21589,
+ 21760, 21761, 21764, 21765, 21776, 21777, 21780, 21781, 21824, 21825, 21828, 21829, 21840, 21841, 21844, 21845
+};
+
+/*
+ * copyOrRewriteFile()
+ * This function copies file or rewrite visibility map file.
+ * If rewrite_vm is true, we have to rewrite visibility map regardless value of pageConverter.
+ */
+const char *
+copyOrRewriteFile(pageCnvCtx *pageConverter,
+ const char *src, const char *dst, bool force, bool rewrite_vm)
+{
+ if (rewrite_vm)
+ return rewriteVisibilitymap(src, dst, force);
+ else
+ return copyAndUpdateFile(pageConverter, src, dst, force);
+}
/*
* copyAndUpdateFile()
@@ -115,12 +155,14 @@ copyAndUpdateFile(pageCnvCtx *pageConverter,
*/
const char *
linkAndUpdateFile(pageCnvCtx *pageConverter,
- const char *src, const char *dst)
+ const char *src, const char *dst, bool rewrite_vm)
{
if (pageConverter != NULL)
return "Cannot in-place update this cluster, page-by-page conversion is required";
- if (pg_link_file(src, dst) == -1)
+ if (rewrite_vm)
+ return rewriteVisibilitymap(src, dst, true);
+ else if (pg_link_file(src, dst) == -1)
return getErrorText();
else
return NULL;
@@ -205,6 +247,99 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
#endif
+/*
+ * rewriteVisibilitymap()
+ *
+ * Copies a visibility map file while adding all-frozen bit(0) into each bit.
+ */
+static const char *
+
+rewriteVisibilitymap(const char *fromfile, const char *tofile, bool force)
+{
+#define BITS_PER_HEAPBLOCK 2
+
+ int src_fd = 0;
+ int dst_fd = 0;
+ char buffer[BLCKSZ];
+ ssize_t bytesRead;
+ int rewriteVmBytesPerPage = (BLCKSZ - SizeOfPageHeaderData) / 2;
+ BlockNumber blkno = 0;
+
+ /* Reset errno */
+ errno = 0;
+
+ if ((fromfile == NULL) || (tofile == NULL))
+ return getErrorText();
+
+ if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
+ goto err;
+
+ if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+ goto err;
+
+ /* Perform data rewriting per page */
+ while ((bytesRead = read(src_fd, buffer, BLCKSZ)) == BLCKSZ)
+ {
+ char *cur, *end, *blkend;
+ PageHeaderData pageheader;
+ uint16 vm_bits;
+
+ /* Save the page header data */
+ memcpy(&pageheader, buffer, SizeOfPageHeaderData);
+
+ cur = buffer;
+ end = buffer + SizeOfPageHeaderData + rewriteVmBytesPerPage;
+ blkend = buffer + bytesRead;
+
+ while (blkend >= end)
+ {
+ char vmbuf[BLCKSZ];
+ char *vmtmp = vmbuf;
+
+ /* Copy page header in advance */
+ memcpy(vmbuf, &pageheader, SizeOfPageHeaderData);
+
+ cur += SizeOfPageHeaderData;
+ vmtmp += SizeOfPageHeaderData;
+
+ /* Rewrite visibility map bit one by one */
+ while (end > cur)
+ {
+ /* Write rewritten bit from table and its string representation */
+ vm_bits = rewrite_vm_table[(uint8) *cur];
+ memcpy(vmtmp, &vm_bits, BITS_PER_HEAPBLOCK);
+
+ cur++;
+ vmtmp += BITS_PER_HEAPBLOCK;
+ }
+
+ /* Set new checksum for a visibility map page, If enabled */
+ if (old_cluster.controldata.data_checksum_version != 0 &&
+ new_cluster.controldata.data_checksum_version != 0)
+ ((PageHeader) vmbuf)->pd_checksum = pg_checksum_page(vmbuf, blkno);
+
+ if (write(dst_fd, vmbuf, BLCKSZ) != BLCKSZ)
+ {
+ if (errno == 0)
+ errno = ENOSPC;
+ goto err;
+ }
+
+ end += rewriteVmBytesPerPage;
+ blkno++;
+ }
+ }
+
+err:
+ if (src_fd != 0)
+ close(src_fd);
+
+ if (dst_fd != 0)
+ close(dst_fd);
+
+ return (errno == 0) ? NULL : getErrorText();
+}
+
void
check_hard_link(void)
{
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index a43dff5..9b20064 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -110,6 +110,10 @@ extern char *output_files[];
#define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031
/*
+ * The format of visibility map is changed with this 9.6 commit,
+ */
+#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201511241
+/*
* pg_multixact format changed in 9.3 commit 0ac5ad5134f2769ccbaefec73844f85,
* ("Improve concurrency of foreign key locking") which also updated catalog
* version to this value. pg_upgrade behavior depends on whether old and new
@@ -394,10 +398,12 @@ const pageCnvCtx *setupPageConverter(void);
typedef void *pageCnvCtx;
#endif
+const char *copyOrRewriteFile(pageCnvCtx *pageConverter, const char *src,
+ const char *dst, bool force, bool rewrite_vm);
const char *copyAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
const char *dst, bool force);
const char *linkAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
- const char *dst);
+ const char *dst, bool rewrite_vm);
void check_hard_link(void);
FILE *fopen_priv(const char *path, const char *mode);
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index bfde1b1..5992bda 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -18,7 +18,7 @@
static void transfer_single_new_db(pageCnvCtx *pageConverter,
FileNameMap *maps, int size, char *old_tablespace);
static void transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *suffix);
+ const char *suffix, bool vm_need_rewrite);
/*
@@ -171,6 +171,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
{
int mapnum;
bool vm_crashsafe_match = true;
+ bool vm_need_rewrite = false;
/*
* Do the old and new cluster disagree on the crash-safetiness of the vm
@@ -180,13 +181,20 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
vm_crashsafe_match = false;
+ /*
+ * Do we need to rewrite visibilitymap?
+ */
+ if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ vm_need_rewrite = true;
+
for (mapnum = 0; mapnum < size; mapnum++)
{
if (old_tablespace == NULL ||
strcmp(maps[mapnum].old_tablespace, old_tablespace) == 0)
{
/* transfer primary file */
- transfer_relfile(pageConverter, &maps[mapnum], "");
+ transfer_relfile(pageConverter, &maps[mapnum], "", false);
/* fsm/vm files added in PG 8.4 */
if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)
@@ -194,9 +202,14 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
/*
* Copy/link any fsm and vm files, if they exist
*/
- transfer_relfile(pageConverter, &maps[mapnum], "_fsm");
+ transfer_relfile(pageConverter, &maps[mapnum], "_fsm", false);
if (vm_crashsafe_match)
- transfer_relfile(pageConverter, &maps[mapnum], "_vm");
+ {
+ if (vm_need_rewrite)
+ transfer_relfile(pageConverter, &maps[mapnum], "_vm", true);
+ else
+ transfer_relfile(pageConverter, &maps[mapnum], "_vm", false);
+ }
}
}
}
@@ -210,7 +223,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
*/
static void
transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *type_suffix)
+ const char *type_suffix, bool vm_need_rewrite)
{
const char *msg;
char old_file[MAXPGPATH];
@@ -276,7 +289,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
+ if ((msg = copyOrRewriteFile(pageConverter, old_file, new_file, true, vm_need_rewrite)) != NULL)
pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
@@ -284,7 +297,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "linking \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = linkAndUpdateFile(pageConverter, old_file, new_file)) != NULL)
+ if ((msg = linkAndUpdateFile(pageConverter, old_file, new_file, vm_need_rewrite)) != NULL)
pg_fatal("error while creating link for relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
diff --git a/src/bin/pg_upgrade/test.sh b/src/bin/pg_upgrade/test.sh
index aa7f399..6b058d4 100644
--- a/src/bin/pg_upgrade/test.sh
+++ b/src/bin/pg_upgrade/test.sh
@@ -174,6 +174,11 @@ if "$MAKE" -C "$oldsrc" installcheck; then
mv "$temp_root"/dump1.sql "$temp_root"/dump1.sql.orig
sed "s;$oldsrc;$newsrc;g" "$temp_root"/dump1.sql.orig >"$temp_root"/dump1.sql
fi
+
+ vm_sql="SELECT c.relname, c.relallvisible FROM pg_class as c, pg_namespace as n WHERE c.relnamespace = n.oid AND n.nspname NOT IN ('information_schema', 'pg_toast', 'pg_catalog') ORDER BY c.relname;"
+ # Test for rewriting visibility map
+ vacuumdb -d regression || visibilitymap_vacuum1_status=$?
+ psql -d regression -c "$vm_sql" > "$temp_root"/vm_test1.txt || visibilitymap_test1_status=$?
else
make_installcheck_status=$?
fi
@@ -188,6 +193,14 @@ if [ -n "$pg_dumpall1_status" ]; then
echo "pg_dumpall of pre-upgrade database cluster failed"
exit 1
fi
+if [ -n "$visibilitymap_vacuum1_status" ];then
+ echo "VACUUM of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+if [ -n "$visibilitymap_test1_status" ];then
+ echo "SELECT of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
PGDATA=$BASE_PGDATA
@@ -203,6 +216,8 @@ case $testhost in
esac
pg_dumpall -f "$temp_root"/dump2.sql || pg_dumpall2_status=$?
+vacuumdb -d regression || visibilitymap_vacuum2_status=$?
+psql -d regression -c "$vm_sql" > "$temp_root"/vm_test2.txt || visibilitymap_test2_status=$?
pg_ctl -m fast stop
# no need to echo commands anymore
@@ -214,11 +229,26 @@ if [ -n "$pg_dumpall2_status" ]; then
exit 1
fi
+if [ -n "$visibilitymap_vacuum2_status" ];then
+ echo "VACUUM of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
+if [ -n "$visibilitymap_test2_status" ];then
+ echo "SELECT of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
case $testhost in
MINGW*) cmd /c delete_old_cluster.bat ;;
*) sh ./delete_old_cluster.sh ;;
esac
+if ! diff "$temp_root"/vm_test1.txt "$temp_root"/vm_test2.txt >/dev/null; then
+ echo "Visibility map rewriting test failed"
+ exit 1
+fi
+
if diff "$temp_root"/dump1.sql "$temp_root"/dump2.sql >/dev/null; then
echo PASSED
exit 0
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index caa0f14..c55d232 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -320,9 +320,10 @@ typedef struct xl_heap_freeze_page
typedef struct xl_heap_visible
{
TransactionId cutoff_xid;
+ uint8 flags;
} xl_heap_visible;
-#define SizeOfHeapVisible (offsetof(xl_heap_visible, cutoff_xid) + sizeof(TransactionId))
+#define SizeOfHeapVisible (offsetof(xl_heap_visible, flags) + sizeof(uint8))
typedef struct xl_heap_new_cid
{
@@ -389,6 +390,6 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
xl_heap_freeze_tuple *xlrec_tp);
extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
- Buffer vm_buffer, TransactionId cutoff_xid);
+ Buffer pim_buffer, TransactionId cutoff_xid, uint8 flags);
#endif /* HEAPAM_XLOG_H */
diff --git a/src/include/access/hio.h b/src/include/access/hio.h
index b3b91e7..a200e5e 100644
--- a/src/include/access/hio.h
+++ b/src/include/access/hio.h
@@ -40,6 +40,6 @@ extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
Buffer otherBuffer, int options,
BulkInsertState bistate,
- Buffer *vmbuffer, Buffer *vmbuffer_other);
+ Buffer *pimbuffer, Buffer *pimbuffer_other);
#endif /* HIO_H */
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 0c0e0ef..5f032ab 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -19,15 +19,28 @@
#include "storage/buf.h"
#include "utils/relcache.h"
+/* Flags for bit map */
+#define VISIBILITYMAP_ALL_VISIBLE 0x01
+#define VISIBILITYMAP_ALL_FROZEN 0x02
+
+#define VISIBILITYMAP_VALID_BITS 0x03 /* OR of all valid visiblitymap flags bits */
+
+/* Macros for visibilitymap test */
+#define VM_ALL_VISIBLE(r, b, v) \
+ ((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_VISIBLE) != 0)
+#define VM_ALL_FROZEN(r, b, v) \
+ ((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_FROZEN) != 0)
+
extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk,
Buffer vmbuf);
extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
Buffer *vmbuf);
extern bool visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid);
-extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
-extern BlockNumber visibilitymap_count(Relation rel);
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags);
+extern uint8 visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
+extern BlockNumber visibilitymap_count(Relation rel, BlockNumber *all_frozen);
extern void visibilitymap_truncate(Relation rel, BlockNumber nheapblocks);
#endif /* VISIBILITYMAP_H */
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index eba4150..5dd96f6 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 201511071
+#define CATALOG_VERSION_NO 201511241
#endif
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index d8640db..9a77d7d 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2779,6 +2779,8 @@ DATA(insert OID = 3056 ( pg_stat_get_analyze_count PGNSP PGUID 12 1 0 0 0 f f f
DESCR("statistics: number of manual analyzes for a table");
DATA(insert OID = 3057 ( pg_stat_get_autoanalyze_count PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_autoanalyze_count _null_ _null_ _null_ ));
DESCR("statistics: number of auto analyzes for a table");
+DATA(insert OID = 6015 ( pg_stat_get_frozen_pages PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 23 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_frozen_pages _null_ _null_ _null_ ));
+DESCR("statistics: number of frozen pages of table");
DATA(insert OID = 1936 ( pg_stat_get_backend_idset PGNSP PGUID 12 1 100 0 0 f f f f t t s r 0 0 23 "" _null_ _null_ _null_ _null_ _null_ pg_stat_get_backend_idset _null_ _null_ _null_ ));
DESCR("statistics: currently active backend IDs");
DATA(insert OID = 2022 ( pg_stat_get_activity PGNSP PGUID 12 1 100 0 0 f f f f f t s r 1 0 2249 "23" "{23,26,23,26,25,25,25,16,1184,1184,1184,1184,869,25,23,28,28,16,25,25,23,16,25}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,datid,pid,usesysid,application_name,state,query,waiting,xact_start,query_start,backend_start,state_change,client_addr,client_hostname,client_port,backend_xid,backend_xmin,ssl,sslversion,sslcipher,sslbits,sslcompression,sslclientdn}" _null_ _null_ pg_stat_get_activity _null_ _null_ _null_ ));
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index eb3591a..6165500 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1363,7 +1363,7 @@ typedef struct IndexScanState
* RuntimeContext expr context for evaling runtime Skeys
* RelationDesc index relation descriptor
* ScanDesc index scan descriptor
- * VMBuffer buffer in use for visibility map testing, if any
+ * PIMBuffer buffer in use for visibility map testing, if any
* HeapFetches number of tuples we were forced to fetch from heap
* ----------------
*/
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 9ecc163..614ca5a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -355,6 +355,7 @@ typedef struct PgStat_MsgVacuum
TimestampTz m_vacuumtime;
PgStat_Counter m_live_tuples;
PgStat_Counter m_dead_tuples;
+ int32 m_frozen_pages;
} PgStat_MsgVacuum;
@@ -372,6 +373,7 @@ typedef struct PgStat_MsgAnalyze
TimestampTz m_analyzetime;
PgStat_Counter m_live_tuples;
PgStat_Counter m_dead_tuples;
+ int32 m_frozen_pages;
} PgStat_MsgAnalyze;
@@ -551,7 +553,7 @@ typedef union PgStat_Msg
* ------------------------------------------------------------
*/
-#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9D
+#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9E
/* ----------
* PgStat_StatDBEntry The collector's data per database
@@ -615,6 +617,8 @@ typedef struct PgStat_StatTabEntry
PgStat_Counter n_dead_tuples;
PgStat_Counter changes_since_analyze;
+ int32 n_frozen_pages;
+
PgStat_Counter blocks_fetched;
PgStat_Counter blocks_hit;
@@ -917,9 +921,11 @@ extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type t
extern void pgstat_report_autovac(Oid dboid);
extern void pgstat_report_vacuum(Oid tableoid, bool shared,
- PgStat_Counter livetuples, PgStat_Counter deadtuples);
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages);
extern void pgstat_report_analyze(Relation rel,
- PgStat_Counter livetuples, PgStat_Counter deadtuples);
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages);
extern void pgstat_report_recovery_conflict(int reason);
extern void pgstat_report_deadlock(void);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index a2f78ee..102aa81 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -178,8 +178,10 @@ typedef PageHeaderData *PageHeader;
* tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
+#define PD_ALL_FROZEN 0x0008 /* all tuples on page are completely
+ frozen */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x000F /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -367,7 +369,12 @@ typedef PageHeaderData *PageHeader;
#define PageSetAllVisible(page) \
(((PageHeader) (page))->pd_flags |= PD_ALL_VISIBLE)
#define PageClearAllVisible(page) \
- (((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
+ (((PageHeader) (page))->pd_flags &= ~(PD_ALL_VISIBLE | PD_ALL_FROZEN))
+
+#define PageIsAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags & PD_ALL_FROZEN)
+#define PageSetAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags |= PD_ALL_FROZEN)
#define PageIsPrunable(page, oldestxmin) \
( \
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 80374e4..b259e65 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1683,6 +1683,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,
pg_stat_get_live_tuples(c.oid) AS n_live_tup,
pg_stat_get_dead_tuples(c.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(c.oid) AS n_mod_since_analyze,
+ pg_stat_get_frozen_pages(c.oid) AS n_frozen_page,
pg_stat_get_last_vacuum_time(c.oid) AS last_vacuum,
pg_stat_get_last_autovacuum_time(c.oid) AS last_autovacuum,
pg_stat_get_last_analyze_time(c.oid) AS last_analyze,
@@ -1795,6 +1796,7 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
+ pg_stat_all_tables.n_frozen_page,
pg_stat_all_tables.last_vacuum,
pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,
@@ -1838,6 +1840,7 @@ pg_stat_user_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
+ pg_stat_all_tables.n_frozen_page,
pg_stat_all_tables.last_vacuum,
pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index f5be70f..dd49786 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -127,6 +127,8 @@ SELECT count(*) FROM tenk2 WHERE unique1 = 1;
1
(1 row)
+-- do VACUUM FREEZE
+VACUUM FREEZE tenk2;
-- force the rate-limiting logic in pgstat_report_tabstat() to time out
-- and send a message
SELECT pg_sleep(1.0);
@@ -175,6 +177,14 @@ SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages,
t | t
(1 row)
+SELECT n_frozen_page = (pg_relation_size('tenk2') / current_setting('block_size')::int)
+ FROM pg_stat_user_tables
+ WHERE relname ='tenk2';
+ ?column?
+----------
+ t
+(1 row)
+
SELECT pr.snap_ts < pg_stat_get_snapshot_timestamp() as snapshot_newer
FROM prevstats AS pr;
snapshot_newer
diff --git a/src/test/regress/expected/visibilitymap.out b/src/test/regress/expected/visibilitymap.out
new file mode 100644
index 0000000..3be0354
--- /dev/null
+++ b/src/test/regress/expected/visibilitymap.out
@@ -0,0 +1,22 @@
+--
+-- Visibility Map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+-- All pages are become all-visible
+VACUUM FREEZE vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+ ?column?
+----------
+ t
+(1 row)
+
+-- Check whether vacuum skips all-frozen pages
+\set VERBOSITY terse
+VACUUM FREEZE VERBOSE vmtest;
+INFO: vacuuming "public.vmtest"
+INFO: index "vmtest_pkey" now contains 10000 row versions in 30 pages
+INFO: skipped 45 frozen pages according to visibility map
+INFO: "vmtest": found 0 removable, 0 nonremovable row versions in 0 out of 45 pages
+\set VERBOSITY default
+DROP TABLE vmtest;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index b1bc7c7..e31fa76 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -110,3 +110,6 @@ test: event_trigger
# run stats by itself because its delay may be insufficient under heavy load
test: stats
+
+# page info map and vacuum test cannot run concurrently with any test that runs SQL
+test: visibilitymap
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index ade9ef1..666e40c 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -161,3 +161,4 @@ test: with
test: xml
test: event_trigger
test: stats
+test: visibilitymap
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index cd2d592..b3c640f 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -120,6 +120,8 @@ ROLLBACK;
SELECT count(*) FROM tenk2;
-- do an indexscan
SELECT count(*) FROM tenk2 WHERE unique1 = 1;
+-- do VACUUM FREEZE
+VACUUM FREEZE tenk2;
-- force the rate-limiting logic in pgstat_report_tabstat() to time out
-- and send a message
@@ -145,6 +147,10 @@ SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages,
FROM pg_statio_user_tables AS st, pg_class AS cl, prevstats AS pr
WHERE st.relname='tenk2' AND cl.relname='tenk2';
+SELECT n_frozen_page = (pg_relation_size('tenk2') / current_setting('block_size')::int)
+ FROM pg_stat_user_tables
+ WHERE relname ='tenk2';
+
SELECT pr.snap_ts < pg_stat_get_snapshot_timestamp() as snapshot_newer
FROM prevstats AS pr;
diff --git a/src/test/regress/sql/visibilitymap.sql b/src/test/regress/sql/visibilitymap.sql
new file mode 100644
index 0000000..365570b
--- /dev/null
+++ b/src/test/regress/sql/visibilitymap.sql
@@ -0,0 +1,16 @@
+--
+-- Visibility Map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+
+-- All pages are become all-visible
+VACUUM FREEZE vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+
+-- Check whether vacuum skips all-frozen pages
+\set VERBOSITY terse
+VACUUM FREEZE VERBOSE vmtest;
+\set VERBOSITY default
+
+DROP TABLE vmtest;
On Mon, Nov 30, 2015 at 10:48:04PM +0530, Masahiko Sawada wrote:
On Sun, Nov 29, 2015 at 2:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Yeah, we need to consider to compute checksum if enabled.
I've changed the patch, and attached.
Please review it.Thanks for the update. This now conflicts with the updates doesn to
fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the
conflict in order to do some testing, but I'd like to get an updated
patch from the author in case I did it wrong. I don't want to find
bugs that I just introduced myself.Thank you for having a look.
I would not bother mentioning this detail in the pg_upgrade manual page:
+ Since the format of visibility map has been changed in version 9.6,
+ <application>pg_upgrade</> creates and rewrite new <literal>'_vm'</literal>
+ file even if upgrading from 9.5 or before to 9.6 or later with link mode (-k).
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-11-30 12:58:43 -0500, Bruce Momjian wrote:
I would not bother mentioning this detail in the pg_upgrade manual page:
+ Since the format of visibility map has been changed in version 9.6, + <application>pg_upgrade</> creates and rewrite new <literal>'_vm'</literal> + file even if upgrading from 9.5 or before to 9.6 or later with link mode (-k).
Might be worthwhile to keep as that influences the runtime for link mode
when migrating <9.6 -> 9.6.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Nov 30, 2015 at 07:05:21PM +0100, Andres Freund wrote:
On 2015-11-30 12:58:43 -0500, Bruce Momjian wrote:
I would not bother mentioning this detail in the pg_upgrade manual page:
+ Since the format of visibility map has been changed in version 9.6, + <application>pg_upgrade</> creates and rewrite new <literal>'_vm'</literal> + file even if upgrading from 9.5 or before to 9.6 or later with link mode (-k).Might be worthwhile to keep as that influences the runtime for link mode
when migrating <9.6 -> 9.6.
It is hard to see that it would have a measurable duration. The
pg_upgrade docs are already very long and this detail doesn't seems
significant. Can someone test the overhead?
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Nov 30, 2015 at 9:18 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Sun, Nov 29, 2015 at 2:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Yeah, we need to consider to compute checksum if enabled.
I've changed the patch, and attached.
Please review it.Thanks for the update. This now conflicts with the updates doesn to
fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the
conflict in order to do some testing, but I'd like to get an updated
patch from the author in case I did it wrong. I don't want to find
bugs that I just introduced myself.Thank you for having a look.
Attached updated v28 patch.
Please review it.Regards,
After running pg_upgrade, if I manually vacuum a table a start getting warnings:
WARNING: page is not marked all-visible (and all-frozen) but
visibility map bit(s) is set in relation "foo" page 32756
WARNING: page is not marked all-visible (and all-frozen) but
visibility map bit(s) is set in relation "foo" page 32756
WARNING: page is not marked all-visible (and all-frozen) but
visibility map bit(s) is set in relation "foo" page 32757
WARNING: page is not marked all-visible (and all-frozen) but
visibility map bit(s) is set in relation "foo" page 32757
The warnings are right where the blocks would start using the 2nd page
of the _vm, so I think the problem is there. And looking at the code,
I think that "cur += SizeOfPageHeaderData;" in the inner loop cannot
be correct. We can't skip a header in the current (old) block each
time we reach the end of the new block. The thing we are skipping in
the current block is half the time not a header, but the data at the
halfway point through the block.
Cheers,
Jeff
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Dec 1, 2015 at 3:04 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Mon, Nov 30, 2015 at 9:18 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Sun, Nov 29, 2015 at 2:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Yeah, we need to consider to compute checksum if enabled.
I've changed the patch, and attached.
Please review it.Thanks for the update. This now conflicts with the updates doesn to
fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the
conflict in order to do some testing, but I'd like to get an updated
patch from the author in case I did it wrong. I don't want to find
bugs that I just introduced myself.Thank you for having a look.
Attached updated v28 patch.
Please review it.Regards,
After running pg_upgrade, if I manually vacuum a table a start getting warnings:
WARNING: page is not marked all-visible (and all-frozen) but
visibility map bit(s) is set in relation "foo" page 32756
WARNING: page is not marked all-visible (and all-frozen) but
visibility map bit(s) is set in relation "foo" page 32756
WARNING: page is not marked all-visible (and all-frozen) but
visibility map bit(s) is set in relation "foo" page 32757
WARNING: page is not marked all-visible (and all-frozen) but
visibility map bit(s) is set in relation "foo" page 32757The warnings are right where the blocks would start using the 2nd page
of the _vm, so I think the problem is there. And looking at the code,
I think that "cur += SizeOfPageHeaderData;" in the inner loop cannot
be correct. We can't skip a header in the current (old) block each
time we reach the end of the new block. The thing we are skipping in
the current block is half the time not a header, but the data at the
halfway point through the block.
Thank you for reviewing.
You're right, it's not necessary.
Attached latest v29 patch which removes the mention in pg_upgrade documentation.
Regards,
--
Masahiko Sawada
Attachments:
000_add_frozen_bit_into_visibilitymap_v29.patchapplication/octet-stream; name=000_add_frozen_bit_into_visibilitymap_v29.patchDownload
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 22c5f7a..e8ebfe9 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -87,7 +87,7 @@ statapprox_heap(Relation rel, output_type *stat)
* If the page has only visible tuples, then we can find out the free
* space from the FSM and move on.
*/
- if (visibilitymap_test(rel, blkno, &vmbuffer))
+ if (VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
{
freespace = GetRecordedFreeSpace(rel, blkno);
stat->tuple_len += BLCKSZ - freespace;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 392eb70..c43443a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5916,7 +5916,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs an eager freezing if the table's
<structname>pg_class</>.<structfield>relfrozenxid</> field has reached
the age specified by this setting. The default is 150 million
transactions. Although users can set this value anywhere from zero to
@@ -5960,7 +5960,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs an eager freezing if the table's
<structname>pg_class</>.<structfield>relminmxid</> field has reached
the age specified by this setting. The default is 150 million multixacts.
Although users can set this value anywhere from zero to two billions,
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 5204b34..5a43c28 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -352,9 +352,9 @@
Vacuum maintains a <link linkend="storage-vm">visibility map</> for each
table to keep track of which pages contain only tuples that are known to be
visible to all active transactions (and all future transactions, until the
- page is again modified). This has two purposes. First, vacuum
- itself can skip such pages on the next run, since there is nothing to
- clean up.
+ page is again modified), and pages contain only frozen tuples.
+ This has two purposes. First, vacuum itself can skip such pages
+ on the next run, since there is nothing to clean up.
</para>
<para>
@@ -438,28 +438,25 @@
</para>
<para>
- <command>VACUUM</> normally skips pages that don't have any dead row
- versions, but those pages might still have row versions with old XID
- values. To ensure all old row versions have been frozen, a
- scan of the whole table is needed.
+ <command>VACUUM</> skips pages that don't have any dead row
+ versions, and pages that have only frozen rows. To ensure all old
+ row versions have been frozen, a scan of all unfrozen pages is needed.
<xref linkend="guc-vacuum-freeze-table-age"> controls when
- <command>VACUUM</> does that: a whole table sweep is forced if
- the table hasn't been fully scanned for <varname>vacuum_freeze_table_age</>
- minus <varname>vacuum_freeze_min_age</> transactions. Setting it to 0
- forces <command>VACUUM</> to always scan all pages, effectively ignoring
- the visibility map.
+ <command>VACUUM</> does that: a whole-table freezing is forced if
+ the table hasn't been ensured all row versions are frozen for
+ <varname>vacuum_freeze_table_age</> minus <varname>vacuum_freeze_min_age</>
+ transcations.
</para>
<para>
The maximum time that a table can go unvacuumed is two billion
transactions minus the <varname>vacuum_freeze_min_age</> value at
- the time <command>VACUUM</> last scanned the whole table. If it were to go
- unvacuumed for longer than
- that, data loss could result. To ensure that this does not happen,
- autovacuum is invoked on any table that might contain unfrozen rows with
- XIDs older than the age specified by the configuration parameter <xref
- linkend="guc-autovacuum-freeze-max-age">. (This will happen even if
- autovacuum is disabled.)
+ the time <command>VACUUM</> last scanned unfrozen pages. If it were to go
+ unvacuumed for longer than that, data loss could result. To ensure
+ that this does not happen, autovacuum is invoked on any table that might
+ contain unfrozen rows with XIDs older than the age specified by the
+ configuration parameter <xref linkend="guc-autovacuum-freeze-max-age">.
+ (This will happen even if autovacuum is disabled.)
</para>
<para>
@@ -490,8 +487,7 @@
a regularly scheduled <command>VACUUM</> or an autovacuum triggered by
normal delete and update activity is run in that window. Setting it too
close could lead to anti-wraparound autovacuums, even though the table
- was recently vacuumed to reclaim space, whereas lower values lead to more
- frequent whole-table scans.
+ was recently vacuumed to reclaim space.
</para>
<para>
@@ -527,7 +523,7 @@
<structname>pg_database</>. In particular,
the <structfield>relfrozenxid</> column of a table's
<structname>pg_class</> row contains the freeze cutoff XID that was used
- by the last whole-table <command>VACUUM</> for that table. All rows
+ by the last whole-table freezing for that table. All rows
inserted by transactions with XIDs older than this cutoff XID are
guaranteed to have been frozen. Similarly,
the <structfield>datfrozenxid</> column of a database's
@@ -555,17 +551,19 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
<command>VACUUM</> normally
only scans pages that have been modified since the last vacuum, but
<structfield>relfrozenxid</> can only be advanced when the whole table is
- scanned. The whole table is scanned when <structfield>relfrozenxid</> is
- more than <varname>vacuum_freeze_table_age</> transactions old, when
- <command>VACUUM</>'s <literal>FREEZE</> option is used, or when all pages
- happen to
- require vacuuming to remove dead row versions. When <command>VACUUM</>
- scans the whole table, after it's finished <literal>age(relfrozenxid)</>
- should be a little more than the <varname>vacuum_freeze_min_age</> setting
- that was used (more by the number of transactions started since the
- <command>VACUUM</> started). If no whole-table-scanning <command>VACUUM</>
- is issued on the table until <varname>autovacuum_freeze_max_age</> is
- reached, an autovacuum will soon be forced for the table.
+ frozen. The whole-table freezing is occuerred only when all pages happen to
+ require freezing to freeze rows. In other cases such as where
+ <structfield>relfrozenxid</> is more than <varname>vacuum_freeze_table_age</>
+ transcations old, where <command>VACUUM</>'s <literal>FREEZE</> option is used,
+ <command>VACUUM</> can skip the pages that all tuples on the page itself are
+ marked as frozen.
+ When all pages of table are eventually marked as frozen by <command>VACUUM</>,
+ after it's finished <literal>age(relfrozenxid)</> should be a little more
+ than the <varname>vacuum_freeze_min_age</> setting that was used (more by
+ the number of transcations started since the <command>VACUUM</> started).
+ If the advancing of <structfield>relfrozenxid</> is not happend until
+ <varname>autovacuum_freeze_max_age</> is reached, an autovacuum will soon
+ be forced for the table.
</para>
<para>
@@ -642,28 +640,28 @@ HINT: Stop the postmaster and vacuum that database in single-user mode.
<structname>pg_class</>.<structfield>relminmxid</> stores the oldest
possible multixact ID still appearing in any tuple of that table.
If this value is older than
- <xref linkend="guc-vacuum-multixact-freeze-table-age">, a whole-table
+ <xref linkend="guc-vacuum-multixact-freeze-table-age">, a table
scan is forced. <function>mxid_age()</> can be used on
<structname>pg_class</>.<structfield>relminmxid</> to find its age.
</para>
<para>
- Whole-table <command>VACUUM</> scans, regardless of
- what causes them, enable advancing the value for that table.
+ When <command>VACUUM</> scans all unfrozen pages, regardless of what causes
+ them, enable advancing the value for that table.
Eventually, as all tables in all databases are scanned and their
oldest multixact values are advanced, on-disk storage for older
multixacts can be removed.
</para>
<para>
- As a safety device, a whole-table vacuum scan will occur for any table
- whose multixact-age is greater than
- <xref linkend="guc-autovacuum-multixact-freeze-max-age">. Whole-table
+ As a safety device, a vacuum scan will occur for any table whose
+ multixact-age is greater than
+ <xref linkend="guc-autovacuum-multixact-freeze-max-age">. A
vacuum scans will also occur progressively for all tables, starting with
those that have the oldest multixact-age, if the amount of used member
storage space exceeds the amount 50% of the addressable storage space.
- Both of these kinds of whole-table scans will occur even if autovacuum is
- nominally disabled.
+ Both of these kinds of scans will occur even if autovacuum is nominally
+ disabled.
</para>
</sect3>
</sect2>
@@ -743,8 +741,8 @@ vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuple
<command>UPDATE</command> and <command>DELETE</command> operation. (It
is only semi-accurate because some information might be lost under heavy
load.) If the <structfield>relfrozenxid</> value of the table is more
- than <varname>vacuum_freeze_table_age</> transactions old, the whole
- table is scanned to freeze old tuples and advance
+ than <varname>vacuum_freeze_table_age</> transactions old, the table is
+ scanned to freeze old tuples and advance
<structfield>relfrozenxid</>, otherwise only pages that have been modified
since the last vacuum are scanned.
</para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e64b7ef..1908a4d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1332,6 +1332,11 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
<entry>Estimated number of rows modified since this table was last analyzed</entry>
</row>
<row>
+ <entry><structfield>n_frozen_page</></entry>
+ <entry><type>integer</></entry>
+ <entry>Number of frozen pages</entry>
+ </row>
+ <row>
<entry><structfield>last_vacuum</></entry>
<entry><type>timestamp with time zone</></entry>
<entry>Last time at which this table was manually vacuumed
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 164d08c..5dc8b04 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -623,18 +623,20 @@ can be used to examine the information stored in free space maps.
<para>
Each heap relation has a Visibility Map
(VM) to keep track of which pages contain only tuples that are known to be
-visible to all active transactions. It's stored
-alongside the main relation data in a separate relation fork, named after the
-filenode number of the relation, plus a <literal>_vm</> suffix. For example,
-if the filenode of a relation is 12345, the VM is stored in a file called
-<filename>12345_vm</>, in the same directory as the main relation file.
+visible to all active transactions, and pages contain only unfrozen tuples.
+It's stored alongside the main relation data in a separate relation fork,
+named after the filenode number of the relation, plus a <literal>_vm</> suffix.
+For example, if the filenode of a relation is 12345, the VM is stored in a file
+called <filename>12345_vm</>, in the same directory as the main relation file.
Note that indexes do not have VMs.
</para>
<para>
-The visibility map simply stores one bit per heap page. A set bit means
-that all tuples on the page are known to be visible to all transactions.
-This means that the page does not contain any tuples that need to be vacuumed.
+The visibility map stores two bits per heap page: all-visible, all-frozen.
+A set all-visible bit means that all tuples on the page are known to be visible
+to all transactions. A set all-frozen bit means that all tuples on the page are
+completely marked as frozen. This means that the page does not contain any tuples
+that need to be vacuumed, even if a scan of whole table is required.
This information can also be used by <firstterm>index-only scans</> to answer
queries using only the index tuple.
</para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9ff7a41..651dd0e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3034,9 +3034,9 @@ heap_delete(Relation relation, ItemPointer tid,
/*
* If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, we'll have to unlock and
- * re-lock, to avoid holding the buffer lock across an I/O. That's a bit
- * unfortunate, but hopefully shouldn't happen often.
+ * visible or all frozen while we were busy locking the buffer, we'll
+ * have to unlock and re-lock, to avoid holding the buffer lock across an
+ * I/O. That's a bit unfortunate, but hopefully shouldn't happen often.
*/
if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
{
@@ -7205,7 +7205,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
*/
XLogRecPtr
log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
- TransactionId cutoff_xid)
+ TransactionId cutoff_xid, uint8 vmflags)
{
xl_heap_visible xlrec;
XLogRecPtr recptr;
@@ -7215,6 +7215,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
Assert(BufferIsValid(vm_buffer));
xlrec.cutoff_xid = cutoff_xid;
+ xlrec.flags = vmflags;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
@@ -7804,7 +7805,12 @@ heap_xlog_visible(XLogReaderState *record)
* the subsequent update won't be replayed to clear the flag.
*/
page = BufferGetPage(buffer);
- PageSetAllVisible(page);
+
+ if (xlrec->flags & VISIBILITYMAP_ALL_VISIBLE)
+ PageSetAllVisible(page);
+ if (xlrec->flags & VISIBILITYMAP_ALL_FROZEN)
+ PageSetAllFrozen(page);
+
MarkBufferDirty(buffer);
}
else if (action == BLK_RESTORED)
@@ -7856,7 +7862,7 @@ heap_xlog_visible(XLogReaderState *record)
*/
if (lsn > PageGetLSN(vmpage))
visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
- xlrec->cutoff_xid);
+ xlrec->cutoff_xid, xlrec->flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 6db73bf..0fe49eb 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -119,7 +119,7 @@ ReadBufferBI(Relation relation, BlockNumber targetBlock,
* be less than buffer2.
*/
static void
-GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
+GetVisibilitymapPins(Relation relation, Buffer buffer1, Buffer buffer2,
BlockNumber block1, BlockNumber block2,
Buffer *vmbuffer1, Buffer *vmbuffer2)
{
@@ -380,11 +380,11 @@ RelationGetBufferForTuple(Relation relation, Size len,
* done.
*/
if (otherBuffer == InvalidBuffer || buffer <= otherBuffer)
- GetVisibilityMapPins(relation, buffer, otherBuffer,
+ GetVisibilitymapPins(relation, buffer, otherBuffer,
targetBlock, otherBlock, vmbuffer,
vmbuffer_other);
else
- GetVisibilityMapPins(relation, otherBuffer, buffer,
+ GetVisibilitymapPins(relation, otherBuffer, buffer,
otherBlock, targetBlock, vmbuffer_other,
vmbuffer);
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 7c38772..387a0d6 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -15,39 +15,46 @@
* visibilitymap_pin - pin a map page for setting a bit
* visibilitymap_pin_ok - check whether correct map page is already pinned
* visibilitymap_set - set a bit in a previously pinned page
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_get_status - get status of bits
* visibilitymap_count - count number of bits set in visibility map
* visibilitymap_truncate - truncate the visibility map
*
* NOTES
*
- * The visibility map is a bitmap with one bit per heap page. A set bit means
- * that all tuples on the page are known visible to all transactions, and
- * therefore the page doesn't need to be vacuumed. The map is conservative in
- * the sense that we make sure that whenever a bit is set, we know the
- * condition is true, but if a bit is not set, it might or might not be true.
+ * The visibility map is a bitmap with two bits (all-visible and all-frozen)
+ * per heap page. A set all-visible bit means that all tuples on the page are
+ * known visible to all transactions, and therefore the page doesn't need to
+ * be vacuumed. A set all-frozen bit means that all tuples on the page are
+ * completely frozen, and therefore the page doesn't need to be vacuumed even
+ * if whole table scanning vacuum is required (e.g. anti-wraparound vacuum).
+ * The all-frozen bit must be set only when the page is already all-visible.
+ *
+ * The map is conservative in the sense that we make sure that whenever a bit
+ * is set, we know the condition is true, but if a bit is not set, it might or
+ * might not be true.
*
* Clearing a visibility map bit is not separately WAL-logged. The callers
* must make sure that whenever a bit is cleared, the bit is cleared on WAL
- * replay of the updating operation as well.
+ * replay of the updating operation as well. And all-frozen bit must be
+ * cleared with all-visible at the same time.
*
* When we *set* a visibility map during VACUUM, we must write WAL. This may
* seem counterintuitive, since the bit is basically a hint: if it is clear,
- * it may still be the case that every tuple on the page is visible to all
- * transactions; we just don't know that for certain. The difficulty is that
- * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
- * on the page itself, and the visibility map bit. If a crash occurs after the
- * visibility map page makes it to disk and before the updated heap page makes
- * it to disk, redo must set the bit on the heap page. Otherwise, the next
- * insert, update, or delete on the heap page will fail to realize that the
- * visibility map bit must be cleared, possibly causing index-only scans to
- * return wrong answers.
+ * it may still be the case that every tuple on the page is all-visible or
+ * all-frozen we just don't know that for certain. The difficulty is that
+ * there are two bits which are typically set together: the PD_ALL_VISIBLE
+ * or PD_ALL_FROZEN bit on the page itself, and the corresponding visibility map
+ * bit. If a crash occurs after the visibility map page makes it to disk and before
+ * the updated heap page makes it to disk, redo must set the bit on the heap page.
+ * Otherwise, the next insert, update, or delete on the heap page will fail to
+ * realize that the visibility map bit must be cleared, possibly causing index-only
+ * scans to return wrong answers.
*
* VACUUM will normally skip pages for which the visibility map bit is set;
* such pages can't contain any dead tuples and therefore don't need vacuuming.
- * The visibility map is not used for anti-wraparound vacuums, because
- * an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
- * present in the table, even on pages that don't have any dead tuples.
+ * The visibility map has the all-frozen bit which indicates all tuples on
+ * corresponding page has been completely frozen, so the visibility map is also
+ * used for anti-wraparound vacuum, even if freezing of tuples is required.
*
* LOCKING
*
@@ -58,14 +65,14 @@
* section that logs the page modification. However, we don't want to hold
* the buffer lock over any I/O that may be required to read in the visibility
* map page. To avoid this, we examine the heap page before locking it;
- * if the page-level PD_ALL_VISIBLE bit is set, we pin the visibility map
- * bit. Then, we lock the buffer. But this creates a race condition: there
- * is a possibility that in the time it takes to lock the buffer, the
- * PD_ALL_VISIBLE bit gets set. If that happens, we have to unlock the
- * buffer, pin the visibility map page, and relock the buffer. This shouldn't
- * happen often, because only VACUUM currently sets visibility map bits,
- * and the race will only occur if VACUUM processes a given page at almost
- * exactly the same time that someone tries to further modify it.
+ * if the page-level PD_ALL_VISIBLE or PD_ALL_FROZEN bit is set, we pin the
+ * visibility map bit. Then, we lock the buffer. But this creates a race
+ * condition: there is a possibility that in the time it takes to lock the
+ * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that happens,
+ * we have to unlock the buffer, pin the visibility map page, and relock the
+ * buffer. This shouldn't happen often, because only VACUUM currently sets
+ * visibility map bits, and the race will only occur if VACUUM processes a given
+ * page at almost exactly the same time that someone tries to further modify it.
*
* To set a bit, you need to hold a lock on the heap page. That prevents
* the race condition where VACUUM sees that all tuples on the page are
@@ -101,11 +108,14 @@
*/
#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
-/* Number of bits allocated for each heap block. */
-#define BITS_PER_HEAPBLOCK 1
+/*
+ * Number of bits allocated for each heap block.
+ * One for all-visible, other for all-frozen.
+*/
+#define BITS_PER_HEAPBLOCK 2
/* Number of heap blocks we can represent in one byte. */
-#define HEAPBLOCKS_PER_BYTE 8
+#define HEAPBLOCKS_PER_BYTE 4
/* Number of heap blocks we can represent in one visibility map page. */
#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
@@ -113,26 +123,44 @@
/* Mapping from heap block number to the right bit in the visibility map */
#define HEAPBLK_TO_MAPBLOCK(x) ((x) / HEAPBLOCKS_PER_PAGE)
#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
-#define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
-
-/* table for fast counting of set bits */
-static const uint8 number_of_ones[256] = {
- 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
+#define HEAPBLK_TO_MAPBIT(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)
+
+/* tables for fast counting of set bits for visible and freeze */
+static const uint8 number_of_ones_for_visible[256] = {
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4
+};
+static const uint8 number_of_ones_for_frozen[256] = {
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4
};
/* prototypes for internal routines */
@@ -141,7 +169,7 @@ static void vm_extend(Relation rel, BlockNumber nvmblocks);
/*
- * visibilitymap_clear - clear a bit in visibility map
+ * visibilitymap_clear - clear all bits in visibility map
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -153,11 +181,11 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
int mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- uint8 mask = 1 << mapBit;
+ uint8 mask = VISIBILITYMAP_VALID_BITS << mapBit;
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_clear %s block %d", RelationGetRelationName(rel), heapBlk);
#endif
if (!BufferIsValid(buf) || BufferGetBlockNumber(buf) != mapBlock)
@@ -186,7 +214,7 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
* visibilitymap_set to actually set the bit.
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by
- * an earlier call to visibilitymap_pin or visibilitymap_test on the same
+ * an earlier call to visibilitymap_pin or visibilitymap_get_status on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk.
*
@@ -212,7 +240,7 @@ visibilitymap_pin(Relation rel, BlockNumber heapBlk, Buffer *buf)
* visibilitymap_pin_ok - do we already have the correct page pinned?
*
* On entry, buf should be InvalidBuffer or a valid buffer returned by
- * an earlier call to visibilitymap_pin or visibilitymap_test on the same
+ * an earlier call to visibilitymap_pin or visibilitymap_get_status on the same
* relation. The return value indicates whether the buffer covers the
* given heapBlk.
*/
@@ -225,7 +253,7 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
}
/*
- * visibilitymap_set - set a bit on a previously pinned page
+ * visibilitymap_set - set bit(s) on a previously pinned page
*
* recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
* or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
@@ -234,10 +262,11 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
* marked all-visible; it is needed for Hot Standby, and can be
* InvalidTransactionId if the page contains no tuples.
*
- * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
- * this function. Except in recovery, caller should also pass the heap
- * buffer. When checksums are enabled and we're not in recovery, we must add
- * the heap buffer to the WAL chain to protect it from being torn.
+ * Caller is expected to set the heap page's PD_ALL_VISIBLE or PD_ALL_FROZEN
+ * bit before calling this function. Except in recovery, caller should also
+ * pass the heap buffer and flags which indicates what flag we want to set.
+ * When checksums are enabled and we're not in recovery, we must add the heap
+ * buffer to the WAL chain to protect it from being torn.
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -245,7 +274,8 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
*/
void
visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid)
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
@@ -254,11 +284,12 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_set %s block %d, flag %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
Assert(InRecovery || BufferIsValid(heapBuf));
+ Assert(flags & VISIBILITYMAP_VALID_BITS);
/* Check that we have the right heap page pinned, if present */
if (BufferIsValid(heapBuf) && BufferGetBlockNumber(heapBuf) != heapBlk)
@@ -272,11 +303,11 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map = PageGetContents(page);
LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
- if (!(map[mapByte] & (1 << mapBit)))
+ if (flags != (map[mapByte] & (flags << mapBit)))
{
START_CRIT_SECTION();
- map[mapByte] |= (1 << mapBit);
+ map[mapByte] |= (flags << mapBit);
MarkBufferDirty(vmBuf);
if (RelationNeedsWAL(rel))
@@ -285,7 +316,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Assert(!InRecovery);
recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
- cutoff_xid);
+ cutoff_xid, flags);
/*
* If data checksums are enabled (or wal_log_hints=on), we
@@ -295,11 +326,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Page heapPage = BufferGetPage(heapBuf);
- /* caller is expected to set PD_ALL_VISIBLE first */
- Assert(PageIsAllVisible(heapPage));
+ /*
+ * Caller is expected to set PD_ALL_VISIBLE or
+ * PD_ALL_FROZEN first.
+ */
+ Assert(((flags | VISIBILITYMAP_ALL_VISIBLE) && PageIsAllVisible(heapPage)) ||
+ ((flags | VISIBILITYMAP_ALL_FROZEN) && PageIsAllFrozen(heapPage)));
PageSetLSN(heapPage, recptr);
}
}
+
PageSetLSN(page, recptr);
}
@@ -310,15 +346,17 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
}
/*
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_get_status - get status of bits
*
- * Are all tuples on heapBlk visible to all, according to the visibility map?
+ * Are all tuples on heapBlk visible to all or are marked frozen, according
+ * to the visibility map?
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by an
- * earlier call to visibilitymap_pin or visibilitymap_test on the same
+ * earlier call to visibilitymap_pin or visibilitymap_get_status on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk, or InvalidBuffer. The caller is responsible for
- * releasing *buf after it's done testing and setting bits.
+ * releasing *buf after it's done testing and setting bits, and must pass flags
+ * for which it needs to check the value in visibility map.
*
* NOTE: This function is typically called without a lock on the heap page,
* so somebody else could change the bit just after we look at it. In fact,
@@ -327,17 +365,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
* we might see the old value. It is the caller's responsibility to deal with
* all concurrency issues!
*/
-bool
-visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
+uint8
+visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *buf)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
uint8 mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- bool result;
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_test %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_get_status %s, block %d", RelationGetRelationName(rel), heapBlk);
#endif
/* Reuse the old pinned buffer if possible */
@@ -360,13 +397,11 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
map = PageGetContents(BufferGetPage(*buf));
/*
- * A single-bit read is atomic. There could be memory-ordering effects
+ * The double bits read is atomic. There could be memory-ordering effects
* here, but for performance reasons we make it the caller's job to worry
* about that.
*/
- result = (map[mapByte] & (1 << mapBit)) ? true : false;
-
- return result;
+ return ((map[mapByte] >> mapBit) & VISIBILITYMAP_VALID_BITS);
}
/*
@@ -374,13 +409,17 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
*
* Note: we ignore the possibility of race conditions when the table is being
* extended concurrently with the call. New pages added to the table aren't
- * going to be marked all-visible, so they won't affect the result.
+ * going to be marked all-visible or all-frozen, so they won't affect the result.
+ * The caller must set the flags which indicates what flag we want to count.
*/
BlockNumber
-visibilitymap_count(Relation rel)
+visibilitymap_count(Relation rel, BlockNumber *all_frozen)
{
- BlockNumber result = 0;
BlockNumber mapBlock;
+ BlockNumber all_visible = 0;
+
+ if (all_frozen)
+ *all_frozen = 0;
for (mapBlock = 0;; mapBlock++)
{
@@ -406,13 +445,15 @@ visibilitymap_count(Relation rel)
for (i = 0; i < MAPSIZE; i++)
{
- result += number_of_ones[map[i]];
+ all_visible += number_of_ones_for_visible[map[i]];
+ if (all_frozen)
+ *all_frozen += number_of_ones_for_frozen[map[i]];
}
ReleaseBuffer(mapBuffer);
}
- return result;
+ return all_visible;
}
/*
@@ -435,7 +476,7 @@ visibilitymap_truncate(Relation rel, BlockNumber nheapblocks)
uint8 truncBit = HEAPBLK_TO_MAPBIT(nheapblocks);
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_truncate %s %d", RelationGetRelationName(rel), nheapblocks);
+ elog(DEBUG1, "vm_truncate %s block %d", RelationGetRelationName(rel), nheapblocks);
#endif
RelationOpenSmgr(rel);
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e59b163..1cea026 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1813,7 +1813,7 @@ FormIndexDatum(IndexInfo *indexInfo,
* isprimary: if true, set relhaspkey true; else no change
* reltuples: if >= 0, set reltuples to this value; else no change
*
- * If reltuples >= 0, relpages and relallvisible are also updated (using
+ * If reltuples >= 0, relpages, relallvisible are also updated (using
* RelationGetNumberOfBlocks() and visibilitymap_count()).
*
* NOTE: an important side-effect of this operation is that an SI invalidation
@@ -1921,7 +1921,7 @@ index_update_stats(Relation rel,
BlockNumber relallvisible;
if (rd_rel->relkind != RELKIND_INDEX)
- relallvisible = visibilitymap_count(rel);
+ relallvisible = visibilitymap_count(rel, NULL);
else /* don't bother for indexes */
relallvisible = 0;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index ccc030f..8c555eb 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -444,6 +444,7 @@ CREATE VIEW pg_stat_all_tables AS
pg_stat_get_live_tuples(C.oid) AS n_live_tup,
pg_stat_get_dead_tuples(C.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(C.oid) AS n_mod_since_analyze,
+ pg_stat_get_frozen_pages(C.oid) AS n_frozen_page,
pg_stat_get_last_vacuum_time(C.oid) as last_vacuum,
pg_stat_get_last_autovacuum_time(C.oid) as last_autovacuum,
pg_stat_get_last_analyze_time(C.oid) as last_analyze,
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index ddb68ab..02a2c68 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -314,6 +314,8 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
+ BlockNumber relallvisible,
+ relallfrozen;
if (inh)
ereport(elevel,
@@ -564,6 +566,10 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
}
}
+ /* Calculate the number of all-visible and all-frozen bit */
+ if (!inh)
+ relallvisible = visibilitymap_count(onerel, &relallfrozen);
+
/*
* Update pages/tuples stats in pg_class ... but not if we're doing
* inherited stats.
@@ -572,7 +578,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
vac_update_relstats(onerel,
relpages,
totalrows,
- visibilitymap_count(onerel),
+ relallvisible,
hasindex,
InvalidTransactionId,
InvalidMultiXactId,
@@ -608,7 +614,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
* tracks per-table stats.
*/
if (!inh)
- pgstat_report_analyze(onerel, totalrows, totaldeadrows);
+ pgstat_report_analyze(onerel, totalrows, totaldeadrows, relallfrozen);
/* If this isn't part of VACUUM ANALYZE, let index AMs do cleanup */
if (!(options & VACOPT_VACUUM))
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7c4ef58..0a02a25 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -729,11 +729,11 @@ vac_estimate_reltuples(Relation relation, bool is_analyze,
* marked with xmin = our xid.
*
* In addition to fundamentally nontransactional statistics such as
- * relpages and relallvisible, we try to maintain certain lazily-updated
- * DDL flags such as relhasindex, by clearing them if no longer correct.
- * It's safe to do this in VACUUM, which can't run in parallel with
- * CREATE INDEX/RULE/TRIGGER and can't be part of a transaction block.
- * However, it's *not* safe to do it in an ANALYZE that's within an
+ * relpages, relallvisible, we try to maintain certain lazily-updated
+ * DDL flags such as relhasindex, by clearing them if no onger correct.
+ * It's safe to do this in VACUUM, which can't run in
+ * parallel with CREATE INDEX/RULE/TRIGGER and can't be part of a transaction
+ * block. However, it's *not* safe to do it in an ANALYZE that's within an
* outer transaction, because for example the current transaction might
* have dropped the last index; then we'd think relhasindex should be
* cleared, but if the transaction later rolls back this would be wrong.
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 2429889..994efb7 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -106,6 +106,8 @@ typedef struct LVRelStats
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* number of pages we examined */
BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber vmskipped_frozen_pages; /* # of pages we skipped by all-frozen bit
+ of visibility map */
double scanned_tuples; /* counts only tuples on scanned pages */
double old_rel_tuples; /* previous value of pg_class.reltuples */
double new_rel_tuples; /* new estimated total # of tuples */
@@ -156,7 +158,7 @@ static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
static int vac_cmp_itemptr(const void *left, const void *right);
static bool heap_page_is_all_visible(Relation rel, Buffer buf,
- TransactionId *visibility_cutoff_xid);
+ TransactionId *visibility_cutoff_xid, bool *all_frozen);
/*
@@ -188,7 +190,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
MultiXactId mxactFullScanLimit;
BlockNumber new_rel_pages;
double new_rel_tuples;
- BlockNumber new_rel_allvisible;
+ BlockNumber new_rel_allvisible,
+ new_rel_allfrozen;
double new_live_tuples;
TransactionId new_frozen_xid;
MultiXactId new_min_multi;
@@ -221,7 +224,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* We request a full scan if either the table's frozen Xid is now older
* than or equal to the requested Xid full-table scan limit; or if the
* table's minimum MultiXactId is older than or equal to the requested
- * mxid full-table scan limit.
+ * mxid full-table scan limit. During full scan, we could skip some pags
+ * according to all-frozen bit of visibility map.
*/
scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
xidFullScanLimit);
@@ -253,7 +257,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
- if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages + vacrelstats->vmskipped_frozen_pages)
+ < vacrelstats->rel_pages)
{
Assert(!scan_all);
scanned_all = false;
@@ -301,10 +306,13 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_tuples = vacrelstats->old_rel_tuples;
}
- new_rel_allvisible = visibilitymap_count(onerel);
+ new_rel_allvisible = visibilitymap_count(onerel, &new_rel_allfrozen);
if (new_rel_allvisible > new_rel_pages)
new_rel_allvisible = new_rel_pages;
+ if (new_rel_allfrozen > new_rel_pages)
+ new_rel_allfrozen = new_rel_pages;
+
new_frozen_xid = scanned_all ? FreezeLimit : InvalidTransactionId;
new_min_multi = scanned_all ? MultiXactCutoff : InvalidMultiXactId;
@@ -325,7 +333,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
pgstat_report_vacuum(RelationGetRelid(onerel),
onerel->rd_rel->relisshared,
new_live_tuples,
- vacrelstats->new_dead_tuples);
+ vacrelstats->new_dead_tuples,
+ new_rel_allfrozen);
/* and log the action if appropriate */
if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
@@ -360,10 +369,11 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel),
vacrelstats->num_index_scans);
- appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins\n"),
+ appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped according to vm\n"),
vacrelstats->pages_removed,
vacrelstats->rel_pages,
- vacrelstats->pinskipped_pages);
+ vacrelstats->pinskipped_pages,
+ vacrelstats->vmskipped_frozen_pages);
appendStringInfo(&buf,
_("tuples: %.0f removed, %.0f remain, %.0f are dead but not yet removable\n"),
vacrelstats->tuples_deleted,
@@ -486,9 +496,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* consecutive pages. Since we're reading sequentially, the OS should be
* doing readahead for us, so there's no gain in skipping a page now and
* then; that's likely to disable readahead and so be counterproductive.
- * Also, skipping even a single page means that we can't update
- * relfrozenxid, so we only want to do it if we can skip a goodly number
- * of pages.
+ * Also, skipping even a single page accorinding to all-visible bit of
+ * visibility map means that we can't update relfrozenxid, so we only want
+ * to do it if we can skip a goodly number. On the other hand, we count
+ * both how many pages we skipped according to all-frozen bit of visibility
+ * map and how many pages we freeze, so we can update relfrozenxid if
+ * the sum of two is as many as pages of table.
*
* Before entering the main loop, establish the invariant that
* next_not_all_visible_block is the next block number >= blkno that's not
@@ -498,24 +511,24 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* started skipping blocks, we may as well skip everything up to the next
* not-all-visible block.
*
- * Note: if scan_all is true, we won't actually skip any pages; but we
+ * Note: if scan_all is true, we might not actually skip any pages; but we
* maintain next_not_all_visible_block anyway, so as to set up the
* all_visible_according_to_vm flag correctly for each page.
*
- * Note: The value returned by visibilitymap_test could be slightly
+ * Note: The value returned by visibilitymap_get_status could be slightly
* out-of-date, since we make this test before reading the corresponding
* heap page or locking the buffer. This is OK. If we mistakenly think
- * that the page is all-visible when in fact the flag's just been cleared,
- * we might fail to vacuum the page. But it's OK to skip pages when
- * scan_all is not set, so no great harm done; the next vacuum will find
- * them. If we make the reverse mistake and vacuum a page unnecessarily,
- * it'll just be a no-op.
+ * that the page is all-visible/all-frozen when in fact the flag's just
+ * been cleared, we might fail to vacuum the page. But it's OK to skip
+ * pages when scan_all is not set, so no great harm done; the next vacuum
+ * will find them. If we make the reverse mistake and vacuum a page
+ * unnecessarily, it'll just be a no-op.
*/
for (next_not_all_visible_block = 0;
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer))
+ if (!VM_ALL_VISIBLE(onerel, next_not_all_visible_block, &vmbuffer))
break;
vacuum_delay_point();
}
@@ -533,9 +546,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
bool tupgone,
hastup;
int prev_dead_count;
- int nfrozen;
+ int nfrozen; /* # of tuples is frozen */
+ int nalready_frozen; /* # of tuples is already frozen */
+ int ntotal_frozen; /* # of frozen tuples in single page */
+ int ntup_per_page;
Size freespace;
bool all_visible_according_to_vm;
+ bool all_frozen_according_to_vm;
bool all_visible;
bool has_dead_tuples;
TransactionId visibility_cutoff_xid = InvalidTransactionId;
@@ -547,8 +564,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block,
- &vmbuffer))
+ if (!VM_ALL_VISIBLE(onerel, next_not_all_visible_block, &vmbuffer))
break;
vacuum_delay_point();
}
@@ -562,14 +578,29 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
skipping_all_visible_blocks = true;
else
skipping_all_visible_blocks = false;
+
all_visible_according_to_vm = false;
+ all_frozen_according_to_vm = false;
}
else
{
- /* Current block is all-visible */
- if (skipping_all_visible_blocks && !scan_all)
- continue;
+ /*
+ * This block is at least all-visible according to the visibility map.
+ * We check whether this block is all-frozen or not, to skip to
+ * vacuum this page even if scan_all is true.
+ */
+ bool all_frozen = VM_ALL_FROZEN(onerel, blkno, &vmbuffer);
+
+ if (scan_all && all_frozen)
+ {
+ vacrelstats->vmskipped_frozen_pages++;
+ continue;
+ }
+ else if (!scan_all && skipping_all_visible_blocks)
+ continue;
+
all_visible_according_to_vm = true;
+ all_frozen_according_to_vm = all_frozen;
}
vacuum_delay_point();
@@ -716,7 +747,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
empty_pages++;
freespace = PageGetHeapFreeSpace(page);
- /* empty pages are always all-visible */
+ /* empty pages are always all-visible and all-frozen */
if (!PageIsAllVisible(page))
{
START_CRIT_SECTION();
@@ -739,8 +770,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
log_newpage_buffer(buf, true);
PageSetAllVisible(page);
+ PageSetAllFrozen(page);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, InvalidTransactionId);
+ vmbuffer, InvalidTransactionId,
+ VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
END_CRIT_SECTION();
}
@@ -764,6 +797,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
all_visible = true;
has_dead_tuples = false;
nfrozen = 0;
+ nalready_frozen = 0;
+ ntup_per_page = 0;
hastup = false;
prev_dead_count = vacrelstats->num_dead_tuples;
maxoff = PageGetMaxOffsetNumber(page);
@@ -918,8 +953,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
{
num_tuples += 1;
+ ntup_per_page += 1;;
hastup = true;
+ /* Check whether this tuple is alrady frozen or not */
+ if (HeapTupleHeaderXminFrozen(tuple.t_data))
+ nalready_frozen += 1;
+
/*
* Each non-removable tuple must be checked to see if it needs
* freezing. Note we already have exclusive buffer lock.
@@ -931,9 +971,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
} /* scan along page */
/*
- * If we froze any tuples, mark the buffer dirty, and write a WAL
- * record recording the changes. We must log the changes to be
- * crash-safe against future truncation of CLOG.
+ * If we freeze any tuples, mark the buffer dirty, and write a WAL
+ * record recording the changes. We must log the changes to be crash-safe
+ * against future truncation of CLOG.
*/
if (nfrozen > 0)
{
@@ -966,6 +1006,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
END_CRIT_SECTION();
}
+ /* Compute the number of frozen tuples in a page */
+ ntotal_frozen = nfrozen + nalready_frozen;
+
/*
* If there are no indexes then we can vacuum the page right now
* instead of doing a second scan.
@@ -988,26 +1031,46 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
freespace = PageGetHeapFreeSpace(page);
- /* mark page all-visible, if appropriate */
- if (all_visible && !all_visible_according_to_vm)
+ /* This page is all visible */
+ if (all_visible)
{
- /*
- * It should never be the case that the visibility map page is set
- * while the page-level bit is clear, but the reverse is allowed
- * (if checksums are not enabled). Regardless, set the both bits
- * so that we get back in sync.
- *
- * NB: If the heap page is all-visible but the VM bit is not set,
- * we don't need to dirty the heap page. However, if checksums
- * are enabled, we do need to make sure that the heap page is
- * dirtied before passing it to visibilitymap_set(), because it
- * may be logged. Given that this situation should only happen in
- * rare cases after a crash, it is not worth optimizing.
- */
- PageSetAllVisible(page);
- MarkBufferDirty(buf);
- visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ /* mark page all-visible, if appropriate */
+ if (!all_visible_according_to_vm)
+ {
+ /*
+ * It should never be the case that the visibility map page is set
+ * while the page-level bit is clear, but the reverse is allowed
+ * (if checksums are not enabled). Regardless, set the both bits
+ * so that we get back in sync.
+ *
+ * NB: If the heap page is all-visible but the VM bit is not set,
+ * we don't need to dirty the heap page. However, if checksums
+ * are enabled, we do need to make sure that the heap page is
+ * dirtied before passing it to visibilitymap_set(), because it
+ * may be logged. Given that this situation should only happen in
+ * rare cases after a crash, it is not worth optimizing.
+ */
+ PageSetAllVisible(page);
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+ }
+
+ /* mark page all-frozen, if all tuples are frozen and not marked yet */
+ if ((ntotal_frozen == ntup_per_page) && !all_frozen_according_to_vm)
+ {
+ Assert(PageIsAllVisible(page));
+
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ {
+ MarkBufferDirty(buf);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+ vmbuffer, visibility_cutoff_xid, flags);
+ }
}
/*
@@ -1018,9 +1081,14 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* that something bad has happened.
*/
else if (all_visible_according_to_vm && !PageIsAllVisible(page)
- && visibilitymap_test(onerel, blkno, &vmbuffer))
+ && VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
{
- elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
+ /* If the all-frozen is set then all-visible must be set */
+ if (all_frozen_according_to_vm)
+ Assert(VM_ALL_FROZEN(onerel, blkno, &vmbuffer) &&
+ VM_ALL_VISIBLE(onerel, blkno, &vmbuffer));
+
+ elog(WARNING, "page is not marked all-visible (and all-frozen) but visibility map bit(s) is set in relation \"%s\" page %u",
relname, blkno);
visibilitymap_clear(onerel, blkno, vmbuffer);
}
@@ -1028,19 +1096,25 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
/*
* It's possible for the value returned by GetOldestXmin() to move
* backwards, so it's not wrong for us to see tuples that appear to
- * not be visible to everyone yet, while PD_ALL_VISIBLE is already
- * set. The real safe xmin value never moves backwards, but
- * GetOldestXmin() is conservative and sometimes returns a value
+ * not be visible to everyone yet, while PD_ALL_VISIBLE (and PD_ALL_FROZEN)
+ * are already set. The real safe xmin value never moves backwards,
+ * but GetOldestXmin() is conservative and sometimes returns a value
* that's unnecessarily small, so if we see that contradiction it just
* means that the tuples that we think are not visible to everyone yet
- * actually are, and the PD_ALL_VISIBLE flag is correct.
+ * actually are, and the PD_ALL_VISIBLE (and PD_ALL_FROZEN) flags are
+ * correct.
*
* There should never be dead tuples on a page with PD_ALL_VISIBLE
* set, however.
*/
else if (PageIsAllVisible(page) && has_dead_tuples)
{
- elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
+ /* If the all-frozen is set then all-visible must be set */
+ if (PageIsAllFrozen(page))
+ Assert(VM_ALL_FROZEN(onerel, blkno, &vmbuffer) &&
+ VM_ALL_VISIBLE(onerel, blkno, &vmbuffer));
+
+ elog(WARNING, "page containing dead tuples is marked as all-visible (and all-frozen) in relation \"%s\" page %u",
relname, blkno);
PageClearAllVisible(page);
MarkBufferDirty(buf);
@@ -1114,6 +1188,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
RelationGetRelationName(onerel),
tups_vacuumed, vacuumed_pages)));
+ /* Report how many frozen pages vacuum skipped according to visibility map */
+ ereport(elevel,
+ (errmsg_plural("skipped %d frozen page according to visibility map",
+ "skipped %d frozen pages according to visibility map",
+ vacrelstats->vmskipped_frozen_pages,
+ vacrelstats->vmskipped_frozen_pages)));
+
/*
* This is pretty messy, but we split it up so that we can skip emitting
* individual parts of the message when not applicable.
@@ -1230,6 +1311,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
OffsetNumber unused[MaxOffsetNumber];
int uncnt = 0;
TransactionId visibility_cutoff_xid;
+ bool all_frozen;
START_CRIT_SECTION();
@@ -1281,19 +1363,34 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
* dirty, exclusively locked, and, if needed, a full page image has been
* emitted in the log_heap_clean() above.
*/
- if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid))
+ if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid, &all_frozen))
PageSetAllVisible(page);
/*
* All the changes to the heap page have been done. If the all-visible
- * flag is now set, also set the VM bit.
+ * flag is now set, also set the VM all-visible bit.
+ * Also, if this page is all-frozen, set the VM all-frozen bit and flag.
*/
- if (PageIsAllVisible(page) &&
- !visibilitymap_test(onerel, blkno, vmbuffer))
+ if (PageIsAllVisible(page))
{
+ uint8 vm_status = visibilitymap_get_status(onerel, blkno, vmbuffer);
+ uint8 flags = 0;
+
+ if (!(vm_status & VISIBILITYMAP_ALL_VISIBLE))
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+
+ /* Set the VM all-frozen bit to flag, if needed */
+ if (all_frozen && !(vm_status & VISIBILITYMAP_ALL_FROZEN))
+ {
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
Assert(BufferIsValid(*vmbuffer));
- visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
- visibility_cutoff_xid);
+
+ if (vm_status != flags)
+ visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
+ visibility_cutoff_xid, flags);
}
return tupindex;
@@ -1783,10 +1880,12 @@ vac_cmp_itemptr(const void *left, const void *right)
/*
* Check if every tuple in the given page is visible to all current and future
* transactions. Also return the visibility_cutoff_xid which is the highest
- * xmin amongst the visible tuples.
+ * xmin amongst the visible tuples, and all_frozen which implies that all tuples
+ * of this page are frozen.
*/
static bool
-heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid)
+heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid,
+ bool *all_frozen)
{
Page page = BufferGetPage(buf);
BlockNumber blockno = BufferGetBlockNumber(buf);
@@ -1795,6 +1894,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
bool all_visible = true;
*visibility_cutoff_xid = InvalidTransactionId;
+ *all_frozen = true;
/*
* This is a stripped down version of the line pointer scan in
@@ -1818,11 +1918,12 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/*
* Dead line pointers can have index pointers pointing to them. So
- * they can't be treated as visible
+ * they can't be treated as visible and frozen.
*/
if (ItemIdIsDead(itemid))
{
all_visible = false;
+ *all_frozen = false;
break;
}
@@ -1859,6 +1960,10 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
+
+ /* Check whether this tuple is alrady frozen or not */
+ if (!HeapTupleHeaderXminFrozen(tuple.t_data))
+ *all_frozen = false;
}
break;
@@ -1867,6 +1972,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
case HEAPTUPLE_INSERT_IN_PROGRESS:
case HEAPTUPLE_DELETE_IN_PROGRESS:
all_visible = false;
+ *all_frozen = false;
break;
default:
@@ -1875,5 +1981,8 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
}
} /* scan along page */
+ if (!all_visible)
+ *all_frozen = false;
+
return all_visible;
}
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 9f54c46..e345177 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -85,7 +85,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
* which all tuples are known visible to everybody. In any case,
* we'll use the index tuple not the heap tuple as the data source.
*
- * Note on Memory Ordering Effects: visibilitymap_test does not lock
+ * Note on Memory Ordering Effects: visibilitymap_get_stattus does not lock
* the visibility map buffer, and therefore the result we read here
* could be slightly stale. However, it can't be stale enough to
* matter.
@@ -114,9 +114,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
* It's worth going through this complexity to avoid needing to lock
* the VM buffer, which could cause significant contention.
*/
- if (!visibilitymap_test(scandesc->heapRelation,
- ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
+ if (!VM_ALL_VISIBLE(scandesc->heapRelation, ItemPointerGetBlockNumber(tid),
+ &node->ioss_VMBuffer))
{
/*
* Rats, we have to visit the heap to check visibility.
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 990486c..d100a7d 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -468,7 +468,7 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count)
* estimates based on the correlation squared (XXX is that appropriate?).
*
* If it's an index-only scan, then we will not need to fetch any heap
- * pages for which the visibility map shows all tuples are visible.
+ * pages for which the visibility map map shows all tuples are visible.
* Hence, reduce the estimated number of heap fetches accordingly.
* We use the measured fraction of the entire heap that is all-visible,
* which might not be particularly relevant to the subset of the heap
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index ab018c4..ca7257a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1329,7 +1329,8 @@ pgstat_report_autovac(Oid dboid)
*/
void
pgstat_report_vacuum(Oid tableoid, bool shared,
- PgStat_Counter livetuples, PgStat_Counter deadtuples)
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages)
{
PgStat_MsgVacuum msg;
@@ -1343,6 +1344,7 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
msg.m_vacuumtime = GetCurrentTimestamp();
msg.m_live_tuples = livetuples;
msg.m_dead_tuples = deadtuples;
+ msg.m_frozen_pages = frozenpages;
pgstat_send(&msg, sizeof(msg));
}
@@ -1354,7 +1356,8 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
*/
void
pgstat_report_analyze(Relation rel,
- PgStat_Counter livetuples, PgStat_Counter deadtuples)
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages)
{
PgStat_MsgAnalyze msg;
@@ -1394,6 +1397,7 @@ pgstat_report_analyze(Relation rel,
msg.m_analyzetime = GetCurrentTimestamp();
msg.m_live_tuples = livetuples;
msg.m_dead_tuples = deadtuples;
+ msg.m_frozen_pages = frozenpages;
pgstat_send(&msg, sizeof(msg));
}
@@ -3702,6 +3706,7 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
result->n_live_tuples = 0;
result->n_dead_tuples = 0;
result->changes_since_analyze = 0;
+ result->n_frozen_pages = 0;
result->blocks_fetched = 0;
result->blocks_hit = 0;
result->vacuum_timestamp = 0;
@@ -5069,6 +5074,7 @@ pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
tabentry->n_live_tuples = msg->m_live_tuples;
tabentry->n_dead_tuples = msg->m_dead_tuples;
+ tabentry->n_frozen_pages = msg->m_frozen_pages;
if (msg->m_autovacuum)
{
@@ -5103,6 +5109,7 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
tabentry->n_live_tuples = msg->m_live_tuples;
tabentry->n_dead_tuples = msg->m_dead_tuples;
+ tabentry->n_frozen_pages = msg->m_frozen_pages;
/*
* We reset changes_since_analyze to zero, forgetting any changes that
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index f7c9bf6..98c14f9 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -46,6 +46,7 @@ extern Datum pg_stat_get_vacuum_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_autovacuum_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_analyze_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_autoanalyze_count(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_frozen_pages(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_function_calls(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_function_total_time(PG_FUNCTION_ARGS);
@@ -450,6 +451,21 @@ pg_stat_get_autoanalyze_count(PG_FUNCTION_ARGS)
}
Datum
+pg_stat_get_frozen_pages(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int32 result;
+ PgStat_StatTabEntry *tabentry;
+
+ if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+ result = 0;
+ else
+ result = (int32) (tabentry->n_frozen_pages);
+
+ PG_RETURN_INT32(result);
+}
+
+Datum
pg_stat_get_function_calls(PG_FUNCTION_ARGS)
{
Oid funcid = PG_GETARG_OID(0);
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index c84783c..90e841e 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -10,6 +10,9 @@
#include "postgres_fe.h"
#include "pg_upgrade.h"
+#include "storage/bufpage.h"
+#include "storage/checksum.h"
+#include "storage/checksum_impl.h"
#include <fcntl.h>
@@ -21,6 +24,43 @@ static int copy_file(const char *fromfile, const char *tofile, bool force);
static int win32_pghardlink(const char *src, const char *dst);
#endif
+static const char *rewriteVisibilitymap(const char *fromfile, const char *tofile,
+ bool force);
+
+/* table for fast rewriting vm file in order to add all-frozen information */
+static const uint16 rewrite_vm_table[256] = {
+ 0, 1, 4, 5, 16, 17, 20, 21, 64, 65, 68, 69, 80, 81, 84, 85,
+ 256, 257, 260, 261, 272, 273, 276, 277, 320, 321, 324, 325, 336, 337, 340, 341,
+ 1024, 1025, 1028, 1029, 1040, 1041, 1044, 1045, 1088, 1089, 1092, 1093, 1104, 1105, 1108, 1109,
+ 1280, 1281, 1284, 1285, 1296, 1297, 1300, 1301, 1344, 1345, 1348, 1349, 1360, 1361, 1364, 1365,
+ 4096, 4097, 4100, 4101, 4112, 4113, 4116, 4117, 4160, 4161, 4164, 4165, 4176, 4177, 4180, 4181,
+ 4352, 4353, 4356, 4357, 4368, 4369, 4372, 4373, 4416, 4417, 4420, 4421, 4432, 4433, 4436, 4437,
+ 5120, 5121, 5124, 5125, 5136, 5137, 5140, 5141, 5184, 5185, 5188, 5189, 5200, 5201, 5204, 5205,
+ 5376, 5377, 5380, 5381, 5392, 5393, 5396, 5397, 5440, 5441, 5444, 5445, 5456, 5457, 5460, 5461,
+ 16384, 16385, 16388, 16389, 16400, 16401, 16404, 16405, 16448, 16449, 16452, 16453, 16464, 16465, 16468, 16469,
+ 16640, 16641, 16644, 16645, 16656, 16657, 16660, 16661, 16704, 16705, 16708, 16709, 16720, 16721, 16724, 16725,
+ 17408, 17409, 17412, 17413, 17424, 17425, 17428, 17429, 17472, 17473, 17476, 17477, 17488, 17489, 17492, 17493,
+ 17664, 17665, 17668, 17669, 17680, 17681, 17684, 17685, 17728, 17729, 17732, 17733, 17744, 17745, 17748, 17749,
+ 20480, 20481, 20484, 20485, 20496, 20497, 20500, 20501, 20544, 20545, 20548, 20549, 20560, 20561, 20564, 20565,
+ 20736, 20737, 20740, 20741, 20752, 20753, 20756, 20757, 20800, 20801, 20804, 20805, 20816, 20817, 20820, 20821,
+ 21504, 21505, 21508, 21509, 21520, 21521, 21524, 21525, 21568, 21569, 21572, 21573, 21584, 21585, 21588, 21589,
+ 21760, 21761, 21764, 21765, 21776, 21777, 21780, 21781, 21824, 21825, 21828, 21829, 21840, 21841, 21844, 21845
+};
+
+/*
+ * copyOrRewriteFile()
+ * This function copies file or rewrite visibility map file.
+ * If rewrite_vm is true, we have to rewrite visibility map regardless value of pageConverter.
+ */
+const char *
+copyOrRewriteFile(pageCnvCtx *pageConverter,
+ const char *src, const char *dst, bool force, bool rewrite_vm)
+{
+ if (rewrite_vm)
+ return rewriteVisibilitymap(src, dst, force);
+ else
+ return copyAndUpdateFile(pageConverter, src, dst, force);
+}
/*
* copyAndUpdateFile()
@@ -115,12 +155,14 @@ copyAndUpdateFile(pageCnvCtx *pageConverter,
*/
const char *
linkAndUpdateFile(pageCnvCtx *pageConverter,
- const char *src, const char *dst)
+ const char *src, const char *dst, bool rewrite_vm)
{
if (pageConverter != NULL)
return "Cannot in-place update this cluster, page-by-page conversion is required";
- if (pg_link_file(src, dst) == -1)
+ if (rewrite_vm)
+ return rewriteVisibilitymap(src, dst, true);
+ else if (pg_link_file(src, dst) == -1)
return getErrorText();
else
return NULL;
@@ -205,6 +247,98 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
#endif
+/*
+ * rewriteVisibilitymap()
+ *
+ * Copies a visibility map file while adding all-frozen bit(0) into each bit.
+ */
+static const char *
+
+rewriteVisibilitymap(const char *fromfile, const char *tofile, bool force)
+{
+#define BITS_PER_HEAPBLOCK 2
+
+ int src_fd = 0;
+ int dst_fd = 0;
+ char buffer[BLCKSZ];
+ ssize_t bytesRead;
+ int rewriteVmBytesPerPage = (BLCKSZ - SizeOfPageHeaderData) / 2;
+ BlockNumber blkno = 0;
+
+ /* Reset errno */
+ errno = 0;
+
+ if ((fromfile == NULL) || (tofile == NULL))
+ return getErrorText();
+
+ if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
+ goto err;
+
+ if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+ goto err;
+
+ /* Perform data rewriting per page */
+ while ((bytesRead = read(src_fd, buffer, BLCKSZ)) == BLCKSZ)
+ {
+ char *cur, *end, *blkend;
+ PageHeaderData pageheader;
+ uint16 vm_bits;
+
+ /* Save the page header data */
+ memcpy(&pageheader, buffer, SizeOfPageHeaderData);
+
+ cur = buffer + SizeOfPageHeaderData;
+ end = buffer + SizeOfPageHeaderData + rewriteVmBytesPerPage;
+ blkend = buffer + bytesRead;
+
+ while (blkend >= end)
+ {
+ char vmbuf[BLCKSZ];
+ char *vmtmp = vmbuf;
+
+ /* Copy page header in advance */
+ memcpy(vmbuf, &pageheader, SizeOfPageHeaderData);
+
+ vmtmp += SizeOfPageHeaderData;
+
+ /* Rewrite visibility map bit one by one */
+ while (end > cur)
+ {
+ /* Write rewritten bit from table and its string representation */
+ vm_bits = rewrite_vm_table[(uint8) *cur];
+ memcpy(vmtmp, &vm_bits, BITS_PER_HEAPBLOCK);
+
+ cur++;
+ vmtmp += BITS_PER_HEAPBLOCK;
+ }
+
+ /* Set new checksum for a visibility map page, If enabled */
+ if (old_cluster.controldata.data_checksum_version != 0 &&
+ new_cluster.controldata.data_checksum_version != 0)
+ ((PageHeader) vmbuf)->pd_checksum = pg_checksum_page(vmbuf, blkno);
+
+ if (write(dst_fd, vmbuf, BLCKSZ) != BLCKSZ)
+ {
+ if (errno == 0)
+ errno = ENOSPC;
+ goto err;
+ }
+
+ end += rewriteVmBytesPerPage;
+ blkno++;
+ }
+ }
+
+err:
+ if (src_fd != 0)
+ close(src_fd);
+
+ if (dst_fd != 0)
+ close(dst_fd);
+
+ return (errno == 0) ? NULL : getErrorText();
+}
+
void
check_hard_link(void)
{
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index a43dff5..9b20064 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -110,6 +110,10 @@ extern char *output_files[];
#define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031
/*
+ * The format of visibility map is changed with this 9.6 commit,
+ */
+#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201511241
+/*
* pg_multixact format changed in 9.3 commit 0ac5ad5134f2769ccbaefec73844f85,
* ("Improve concurrency of foreign key locking") which also updated catalog
* version to this value. pg_upgrade behavior depends on whether old and new
@@ -394,10 +398,12 @@ const pageCnvCtx *setupPageConverter(void);
typedef void *pageCnvCtx;
#endif
+const char *copyOrRewriteFile(pageCnvCtx *pageConverter, const char *src,
+ const char *dst, bool force, bool rewrite_vm);
const char *copyAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
const char *dst, bool force);
const char *linkAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
- const char *dst);
+ const char *dst, bool rewrite_vm);
void check_hard_link(void);
FILE *fopen_priv(const char *path, const char *mode);
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index bfde1b1..5992bda 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -18,7 +18,7 @@
static void transfer_single_new_db(pageCnvCtx *pageConverter,
FileNameMap *maps, int size, char *old_tablespace);
static void transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *suffix);
+ const char *suffix, bool vm_need_rewrite);
/*
@@ -171,6 +171,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
{
int mapnum;
bool vm_crashsafe_match = true;
+ bool vm_need_rewrite = false;
/*
* Do the old and new cluster disagree on the crash-safetiness of the vm
@@ -180,13 +181,20 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
vm_crashsafe_match = false;
+ /*
+ * Do we need to rewrite visibilitymap?
+ */
+ if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ vm_need_rewrite = true;
+
for (mapnum = 0; mapnum < size; mapnum++)
{
if (old_tablespace == NULL ||
strcmp(maps[mapnum].old_tablespace, old_tablespace) == 0)
{
/* transfer primary file */
- transfer_relfile(pageConverter, &maps[mapnum], "");
+ transfer_relfile(pageConverter, &maps[mapnum], "", false);
/* fsm/vm files added in PG 8.4 */
if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)
@@ -194,9 +202,14 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
/*
* Copy/link any fsm and vm files, if they exist
*/
- transfer_relfile(pageConverter, &maps[mapnum], "_fsm");
+ transfer_relfile(pageConverter, &maps[mapnum], "_fsm", false);
if (vm_crashsafe_match)
- transfer_relfile(pageConverter, &maps[mapnum], "_vm");
+ {
+ if (vm_need_rewrite)
+ transfer_relfile(pageConverter, &maps[mapnum], "_vm", true);
+ else
+ transfer_relfile(pageConverter, &maps[mapnum], "_vm", false);
+ }
}
}
}
@@ -210,7 +223,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
*/
static void
transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *type_suffix)
+ const char *type_suffix, bool vm_need_rewrite)
{
const char *msg;
char old_file[MAXPGPATH];
@@ -276,7 +289,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
+ if ((msg = copyOrRewriteFile(pageConverter, old_file, new_file, true, vm_need_rewrite)) != NULL)
pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
@@ -284,7 +297,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "linking \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = linkAndUpdateFile(pageConverter, old_file, new_file)) != NULL)
+ if ((msg = linkAndUpdateFile(pageConverter, old_file, new_file, vm_need_rewrite)) != NULL)
pg_fatal("error while creating link for relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
diff --git a/src/bin/pg_upgrade/test.sh b/src/bin/pg_upgrade/test.sh
index aa7f399..6b058d4 100644
--- a/src/bin/pg_upgrade/test.sh
+++ b/src/bin/pg_upgrade/test.sh
@@ -174,6 +174,11 @@ if "$MAKE" -C "$oldsrc" installcheck; then
mv "$temp_root"/dump1.sql "$temp_root"/dump1.sql.orig
sed "s;$oldsrc;$newsrc;g" "$temp_root"/dump1.sql.orig >"$temp_root"/dump1.sql
fi
+
+ vm_sql="SELECT c.relname, c.relallvisible FROM pg_class as c, pg_namespace as n WHERE c.relnamespace = n.oid AND n.nspname NOT IN ('information_schema', 'pg_toast', 'pg_catalog') ORDER BY c.relname;"
+ # Test for rewriting visibility map
+ vacuumdb -d regression || visibilitymap_vacuum1_status=$?
+ psql -d regression -c "$vm_sql" > "$temp_root"/vm_test1.txt || visibilitymap_test1_status=$?
else
make_installcheck_status=$?
fi
@@ -188,6 +193,14 @@ if [ -n "$pg_dumpall1_status" ]; then
echo "pg_dumpall of pre-upgrade database cluster failed"
exit 1
fi
+if [ -n "$visibilitymap_vacuum1_status" ];then
+ echo "VACUUM of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+if [ -n "$visibilitymap_test1_status" ];then
+ echo "SELECT of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
PGDATA=$BASE_PGDATA
@@ -203,6 +216,8 @@ case $testhost in
esac
pg_dumpall -f "$temp_root"/dump2.sql || pg_dumpall2_status=$?
+vacuumdb -d regression || visibilitymap_vacuum2_status=$?
+psql -d regression -c "$vm_sql" > "$temp_root"/vm_test2.txt || visibilitymap_test2_status=$?
pg_ctl -m fast stop
# no need to echo commands anymore
@@ -214,11 +229,26 @@ if [ -n "$pg_dumpall2_status" ]; then
exit 1
fi
+if [ -n "$visibilitymap_vacuum2_status" ];then
+ echo "VACUUM of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
+if [ -n "$visibilitymap_test2_status" ];then
+ echo "SELECT of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
case $testhost in
MINGW*) cmd /c delete_old_cluster.bat ;;
*) sh ./delete_old_cluster.sh ;;
esac
+if ! diff "$temp_root"/vm_test1.txt "$temp_root"/vm_test2.txt >/dev/null; then
+ echo "Visibility map rewriting test failed"
+ exit 1
+fi
+
if diff "$temp_root"/dump1.sql "$temp_root"/dump2.sql >/dev/null; then
echo PASSED
exit 0
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index caa0f14..c55d232 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -320,9 +320,10 @@ typedef struct xl_heap_freeze_page
typedef struct xl_heap_visible
{
TransactionId cutoff_xid;
+ uint8 flags;
} xl_heap_visible;
-#define SizeOfHeapVisible (offsetof(xl_heap_visible, cutoff_xid) + sizeof(TransactionId))
+#define SizeOfHeapVisible (offsetof(xl_heap_visible, flags) + sizeof(uint8))
typedef struct xl_heap_new_cid
{
@@ -389,6 +390,6 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
xl_heap_freeze_tuple *xlrec_tp);
extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
- Buffer vm_buffer, TransactionId cutoff_xid);
+ Buffer pim_buffer, TransactionId cutoff_xid, uint8 flags);
#endif /* HEAPAM_XLOG_H */
diff --git a/src/include/access/hio.h b/src/include/access/hio.h
index b3b91e7..a200e5e 100644
--- a/src/include/access/hio.h
+++ b/src/include/access/hio.h
@@ -40,6 +40,6 @@ extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
Buffer otherBuffer, int options,
BulkInsertState bistate,
- Buffer *vmbuffer, Buffer *vmbuffer_other);
+ Buffer *pimbuffer, Buffer *pimbuffer_other);
#endif /* HIO_H */
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 0c0e0ef..5f032ab 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -19,15 +19,28 @@
#include "storage/buf.h"
#include "utils/relcache.h"
+/* Flags for bit map */
+#define VISIBILITYMAP_ALL_VISIBLE 0x01
+#define VISIBILITYMAP_ALL_FROZEN 0x02
+
+#define VISIBILITYMAP_VALID_BITS 0x03 /* OR of all valid visiblitymap flags bits */
+
+/* Macros for visibilitymap test */
+#define VM_ALL_VISIBLE(r, b, v) \
+ ((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_VISIBLE) != 0)
+#define VM_ALL_FROZEN(r, b, v) \
+ ((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_FROZEN) != 0)
+
extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk,
Buffer vmbuf);
extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
Buffer *vmbuf);
extern bool visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid);
-extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
-extern BlockNumber visibilitymap_count(Relation rel);
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags);
+extern uint8 visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
+extern BlockNumber visibilitymap_count(Relation rel, BlockNumber *all_frozen);
extern void visibilitymap_truncate(Relation rel, BlockNumber nheapblocks);
#endif /* VISIBILITYMAP_H */
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index eba4150..5dd96f6 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 201511071
+#define CATALOG_VERSION_NO 201511241
#endif
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index d8640db..9a77d7d 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2779,6 +2779,8 @@ DATA(insert OID = 3056 ( pg_stat_get_analyze_count PGNSP PGUID 12 1 0 0 0 f f f
DESCR("statistics: number of manual analyzes for a table");
DATA(insert OID = 3057 ( pg_stat_get_autoanalyze_count PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_autoanalyze_count _null_ _null_ _null_ ));
DESCR("statistics: number of auto analyzes for a table");
+DATA(insert OID = 6015 ( pg_stat_get_frozen_pages PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 23 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_frozen_pages _null_ _null_ _null_ ));
+DESCR("statistics: number of frozen pages of table");
DATA(insert OID = 1936 ( pg_stat_get_backend_idset PGNSP PGUID 12 1 100 0 0 f f f f t t s r 0 0 23 "" _null_ _null_ _null_ _null_ _null_ pg_stat_get_backend_idset _null_ _null_ _null_ ));
DESCR("statistics: currently active backend IDs");
DATA(insert OID = 2022 ( pg_stat_get_activity PGNSP PGUID 12 1 100 0 0 f f f f f t s r 1 0 2249 "23" "{23,26,23,26,25,25,25,16,1184,1184,1184,1184,869,25,23,28,28,16,25,25,23,16,25}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,datid,pid,usesysid,application_name,state,query,waiting,xact_start,query_start,backend_start,state_change,client_addr,client_hostname,client_port,backend_xid,backend_xmin,ssl,sslversion,sslcipher,sslbits,sslcompression,sslclientdn}" _null_ _null_ pg_stat_get_activity _null_ _null_ _null_ ));
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index eb3591a..6165500 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1363,7 +1363,7 @@ typedef struct IndexScanState
* RuntimeContext expr context for evaling runtime Skeys
* RelationDesc index relation descriptor
* ScanDesc index scan descriptor
- * VMBuffer buffer in use for visibility map testing, if any
+ * PIMBuffer buffer in use for visibility map testing, if any
* HeapFetches number of tuples we were forced to fetch from heap
* ----------------
*/
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 9ecc163..614ca5a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -355,6 +355,7 @@ typedef struct PgStat_MsgVacuum
TimestampTz m_vacuumtime;
PgStat_Counter m_live_tuples;
PgStat_Counter m_dead_tuples;
+ int32 m_frozen_pages;
} PgStat_MsgVacuum;
@@ -372,6 +373,7 @@ typedef struct PgStat_MsgAnalyze
TimestampTz m_analyzetime;
PgStat_Counter m_live_tuples;
PgStat_Counter m_dead_tuples;
+ int32 m_frozen_pages;
} PgStat_MsgAnalyze;
@@ -551,7 +553,7 @@ typedef union PgStat_Msg
* ------------------------------------------------------------
*/
-#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9D
+#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9E
/* ----------
* PgStat_StatDBEntry The collector's data per database
@@ -615,6 +617,8 @@ typedef struct PgStat_StatTabEntry
PgStat_Counter n_dead_tuples;
PgStat_Counter changes_since_analyze;
+ int32 n_frozen_pages;
+
PgStat_Counter blocks_fetched;
PgStat_Counter blocks_hit;
@@ -917,9 +921,11 @@ extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type t
extern void pgstat_report_autovac(Oid dboid);
extern void pgstat_report_vacuum(Oid tableoid, bool shared,
- PgStat_Counter livetuples, PgStat_Counter deadtuples);
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages);
extern void pgstat_report_analyze(Relation rel,
- PgStat_Counter livetuples, PgStat_Counter deadtuples);
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages);
extern void pgstat_report_recovery_conflict(int reason);
extern void pgstat_report_deadlock(void);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index a2f78ee..102aa81 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -178,8 +178,10 @@ typedef PageHeaderData *PageHeader;
* tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
+#define PD_ALL_FROZEN 0x0008 /* all tuples on page are completely
+ frozen */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x000F /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -367,7 +369,12 @@ typedef PageHeaderData *PageHeader;
#define PageSetAllVisible(page) \
(((PageHeader) (page))->pd_flags |= PD_ALL_VISIBLE)
#define PageClearAllVisible(page) \
- (((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
+ (((PageHeader) (page))->pd_flags &= ~(PD_ALL_VISIBLE | PD_ALL_FROZEN))
+
+#define PageIsAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags & PD_ALL_FROZEN)
+#define PageSetAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags |= PD_ALL_FROZEN)
#define PageIsPrunable(page, oldestxmin) \
( \
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 80374e4..b259e65 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1683,6 +1683,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,
pg_stat_get_live_tuples(c.oid) AS n_live_tup,
pg_stat_get_dead_tuples(c.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(c.oid) AS n_mod_since_analyze,
+ pg_stat_get_frozen_pages(c.oid) AS n_frozen_page,
pg_stat_get_last_vacuum_time(c.oid) AS last_vacuum,
pg_stat_get_last_autovacuum_time(c.oid) AS last_autovacuum,
pg_stat_get_last_analyze_time(c.oid) AS last_analyze,
@@ -1795,6 +1796,7 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
+ pg_stat_all_tables.n_frozen_page,
pg_stat_all_tables.last_vacuum,
pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,
@@ -1838,6 +1840,7 @@ pg_stat_user_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
+ pg_stat_all_tables.n_frozen_page,
pg_stat_all_tables.last_vacuum,
pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index f5be70f..dd49786 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -127,6 +127,8 @@ SELECT count(*) FROM tenk2 WHERE unique1 = 1;
1
(1 row)
+-- do VACUUM FREEZE
+VACUUM FREEZE tenk2;
-- force the rate-limiting logic in pgstat_report_tabstat() to time out
-- and send a message
SELECT pg_sleep(1.0);
@@ -175,6 +177,14 @@ SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages,
t | t
(1 row)
+SELECT n_frozen_page = (pg_relation_size('tenk2') / current_setting('block_size')::int)
+ FROM pg_stat_user_tables
+ WHERE relname ='tenk2';
+ ?column?
+----------
+ t
+(1 row)
+
SELECT pr.snap_ts < pg_stat_get_snapshot_timestamp() as snapshot_newer
FROM prevstats AS pr;
snapshot_newer
diff --git a/src/test/regress/expected/visibilitymap.out b/src/test/regress/expected/visibilitymap.out
new file mode 100644
index 0000000..3be0354
--- /dev/null
+++ b/src/test/regress/expected/visibilitymap.out
@@ -0,0 +1,22 @@
+--
+-- Visibility Map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+-- All pages are become all-visible
+VACUUM FREEZE vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+ ?column?
+----------
+ t
+(1 row)
+
+-- Check whether vacuum skips all-frozen pages
+\set VERBOSITY terse
+VACUUM FREEZE VERBOSE vmtest;
+INFO: vacuuming "public.vmtest"
+INFO: index "vmtest_pkey" now contains 10000 row versions in 30 pages
+INFO: skipped 45 frozen pages according to visibility map
+INFO: "vmtest": found 0 removable, 0 nonremovable row versions in 0 out of 45 pages
+\set VERBOSITY default
+DROP TABLE vmtest;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index b1bc7c7..e31fa76 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -110,3 +110,6 @@ test: event_trigger
# run stats by itself because its delay may be insufficient under heavy load
test: stats
+
+# page info map and vacuum test cannot run concurrently with any test that runs SQL
+test: visibilitymap
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index ade9ef1..666e40c 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -161,3 +161,4 @@ test: with
test: xml
test: event_trigger
test: stats
+test: visibilitymap
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index cd2d592..b3c640f 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -120,6 +120,8 @@ ROLLBACK;
SELECT count(*) FROM tenk2;
-- do an indexscan
SELECT count(*) FROM tenk2 WHERE unique1 = 1;
+-- do VACUUM FREEZE
+VACUUM FREEZE tenk2;
-- force the rate-limiting logic in pgstat_report_tabstat() to time out
-- and send a message
@@ -145,6 +147,10 @@ SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages,
FROM pg_statio_user_tables AS st, pg_class AS cl, prevstats AS pr
WHERE st.relname='tenk2' AND cl.relname='tenk2';
+SELECT n_frozen_page = (pg_relation_size('tenk2') / current_setting('block_size')::int)
+ FROM pg_stat_user_tables
+ WHERE relname ='tenk2';
+
SELECT pr.snap_ts < pg_stat_get_snapshot_timestamp() as snapshot_newer
FROM prevstats AS pr;
diff --git a/src/test/regress/sql/visibilitymap.sql b/src/test/regress/sql/visibilitymap.sql
new file mode 100644
index 0000000..365570b
--- /dev/null
+++ b/src/test/regress/sql/visibilitymap.sql
@@ -0,0 +1,16 @@
+--
+-- Visibility Map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+
+-- All pages are become all-visible
+VACUUM FREEZE vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+
+-- Check whether vacuum skips all-frozen pages
+\set VERBOSITY terse
+VACUUM FREEZE VERBOSE vmtest;
+\set VERBOSITY default
+
+DROP TABLE vmtest;
Hello,
You're right, it's not necessary.
Attached latest v29 patch which removes the mention in pg_upgrade documentation.
The changes looks to be correct but I haven't tested.
And I have some additional random comments.
visibilitymap.c:
In visibilitymap_set, the followint lines.
map = PageGetContents(page);
...
if (flags != (map[mapByte] & (flags << mapBit)))
map is (char*), PageGetContents returns (char*) but flags is
uint8. I think that defining map as (uint8*) would be safer.
In visibilitymap_set, the following lines does something
different from what to do. Only right side of the inequality
gets shifted and what should be used in right side is not flags
but VISIBILITYMAP_VALID_BITS.
- if (!(map[mapByte] & (1 << mapBit)))
+ if (flags != (map[mapByte] & (flags << mapBit)))
Something like the following will do the right thing.
+ if (flags != (map[mapByte]>>mapBit & VISIBILITYMAP_VALID_BITS))
analyze.c:
In do_analyze_rel, the successive if (!inh) in the following
steps looks a bit odd. This would be emphasized by the first if
block you added:) These blocks should be enclosed by if (!inh)
{} block.
/* Calculate the number of all-visible and all-frozen bit */
if (!inh)
relallvisible = visibilitymap_count(onerel, &relallfrozen);
if (!inh)
vac_update_relstats(onerel,
if (!inh && !(options & VACOPT_VACUUM))
{
for (ind = 0; ind < nindexes; ind++)
...
}
if (!inh)
pgstat_report_analyze(onerel, totalrows, totaldeadrows, relallfrozen);
vacuum.c:
- * relpages and relallvisible, we try to maintain certain lazily-updated
- * DDL flags such as relhasindex, by clearing them if no longer correct.
- * It's safe to do this in VACUUM, which can't run in parallel with
- * CREATE INDEX/RULE/TRIGGER and can't be part of a transaction block.
- * However, it's *not* safe to do it in an ANALYZE that's within an
+ * relpages, relallvisible, we try to maintain certain lazily-updated
Why did you just drop the 'and' after relpages? And this seems
the only change of this file except the additinally missing
letter just below:p
+ * DDL flags such as relhasindex, by clearing them if no onger correct. + * It's safe to do this in VACUUM, which can't run in + * parallel with CREATE INDEX/RULE/TRIGGER and can't be part of a transaction + * block. However, it's *not* safe to do it in an ANALYZE that's within an
nodeIndexonlyscan.c:
A duplicate letters. And the line exceeds right margin.
- * Note on Memory Ordering Effects: visibilitymap_test does not lock
-> + * Note on Memory Ordering Effects: visibilitymap_get_stattus does not lock
+ * Note on Memory Ordering Effects: visibilitymap_get_status does not lock
The edited line exceeds right margin by removing a newline.
- if (!visibilitymap_test(scandesc->heapRelation,
- ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
+ if (!VM_ALL_VISIBLE(scandesc->heapRelation, ItemPointerGetBlockNumber(tid),
+ &node->ioss_VMBuffer))
costsize.c:
Duplicate words and it is the only change.
- * pages for which the visibility map shows all tuples are visible.
-> + * pages for which the visibility map map shows all tuples are visible.
+ * pages for which the visibility map shows all tuples are visible.
pgstat.c:
The new parameter frozenpages of pgstat_report_vacuum() is
defined as int32, but its callers give BlockNumber(=uint32). I
recommend to define the frozenpages as BlockNumber.
PgStat_MsgVacuum has a corresponding member defined as int32.
pg_upgrade.c:
BITS_PER_HEAPBLOCK is defined in two c files with the same
definition. This might be better to be merged into some header
file.
heapam_xlog.h, hio.h, execnodes.h:
Have we decided to rename vm to pim? Anyway it is inconsistent
with that of corresponding definition of the function body
remains as 'vm_buffer'. (I'm not confident on that, though.)
- Buffer vm_buffer, TransactionId cutoff_xid); + Buffer pim_buffer, TransactionId cutoff_xid, uint8 flags);
regards,
At Wed, 2 Dec 2015 00:10:09 +0530, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoC72S2ShoeAmCxWYUyGSNOaTn4fMHJ-ZKNX-MPcsQpaOw@mail.gmail.com>
On Tue, Dec 1, 2015 at 3:04 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Mon, Nov 30, 2015 at 9:18 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
After running pg_upgrade, if I manually vacuum a table a start getting warnings:WARNING: page is not marked all-visible (and all-frozen) but
visibility map bit(s) is set in relation "foo" page 32756
WARNING: page is not marked all-visible (and all-frozen) but
visibility map bit(s) is set in relation "foo" page 32756
...
The warnings are right where the blocks would start using the 2nd page
of the _vm, so I think the problem is there. And looking at the code,
I think that "cur += SizeOfPageHeaderData;" in the inner loop cannot
be correct. We can't skip a header in the current (old) block each
time we reach the end of the new block. The thing we are skipping in
the current block is half the time not a header, but the data at the
halfway point through the block.Thank you for reviewing.
You're right, it's not necessary.
Attached latest v29 patch which removes the mention in pg_upgrade documentation.
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Dec 2, 2015 at 9:30 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Hello,
You're right, it's not necessary.
Attached latest v29 patch which removes the mention in pg_upgrade documentation.The changes looks to be correct but I haven't tested.
And I have some additional random comments.
Thank you for revewing!
Fixed these following points, and attached latest patch.
visibilitymap.c:
In visibilitymap_set, the followint lines.
map = PageGetContents(page);
...
if (flags != (map[mapByte] & (flags << mapBit)))map is (char*), PageGetContents returns (char*) but flags is
uint8. I think that defining map as (uint8*) would be safer.
I agree with you.
Fixed.
In visibilitymap_set, the following lines does something
different from what to do. Only right side of the inequality
gets shifted and what should be used in right side is not flags
but VISIBILITYMAP_VALID_BITS.- if (!(map[mapByte] & (1 << mapBit))) + if (flags != (map[mapByte] & (flags << mapBit)))Something like the following will do the right thing.
+ if (flags != (map[mapByte]>>mapBit & VISIBILITYMAP_VALID_BITS))
You're right.
Fixed.
analyze.c:
In do_analyze_rel, the successive if (!inh) in the following
steps looks a bit odd. This would be emphasized by the first if
block you added:) These blocks should be enclosed by if (!inh)
{} block./* Calculate the number of all-visible and all-frozen bit */
if (!inh)
relallvisible = visibilitymap_count(onerel, &relallfrozen);
if (!inh)
vac_update_relstats(onerel,
if (!inh && !(options & VACOPT_VACUUM))
{
for (ind = 0; ind < nindexes; ind++)...
}
if (!inh)
pgstat_report_analyze(onerel, totalrows, totaldeadrows, relallfrozen);
Fixed.
vacuum.c:
- * relpages and relallvisible, we try to maintain certain lazily-updated
- * DDL flags such as relhasindex, by clearing them if no longer correct.
- * It's safe to do this in VACUUM, which can't run in parallel with
- * CREATE INDEX/RULE/TRIGGER and can't be part of a transaction block.
- * However, it's *not* safe to do it in an ANALYZE that's within an+ * relpages, relallvisible, we try to maintain certain lazily-updated
Why did you just drop the 'and' after relpages? And this seems
the only change of this file except the additinally missing
letter just below:p+ * DDL flags such as relhasindex, by clearing them if no onger correct. + * It's safe to do this in VACUUM, which can't run in + * parallel with CREATE INDEX/RULE/TRIGGER and can't be part of a transaction + * block. However, it's *not* safe to do it in an ANALYZE that's within an
Fixed.
nodeIndexonlyscan.c:
A duplicate letters. And the line exceeds right margin.
- * Note on Memory Ordering Effects: visibilitymap_test does not lock
-> + * Note on Memory Ordering Effects: visibilitymap_get_stattus does not lock + * Note on Memory Ordering Effects: visibilitymap_get_status does not lock
Fixed.
The edited line exceeds right margin by removing a newline.
- if (!visibilitymap_test(scandesc->heapRelation, - ItemPointerGetBlockNumber(tid), - &node->ioss_VMBuffer)) + if (!VM_ALL_VISIBLE(scandesc->heapRelation, ItemPointerGetBlockNumber(tid), + &node->ioss_VMBuffer))
Fixed.
costsize.c:
Duplicate words and it is the only change.
- * pages for which the visibility map shows all tuples are visible.
-> + * pages for which the visibility map map shows all tuples are visible. + * pages for which the visibility map shows all tuples are visible.
Fixed.
pgstat.c:
The new parameter frozenpages of pgstat_report_vacuum() is
defined as int32, but its callers give BlockNumber(=uint32). I
recommend to define the frozenpages as BlockNumber.
PgStat_MsgVacuum has a corresponding member defined as int32.
I agree with you.
Fixed.
pg_upgrade.c:
BITS_PER_HEAPBLOCK is defined in two c files with the same
definition. This might be better to be merged into some header
file.
Fixed.
I moved these definition to visibilitymap.h.
heapam_xlog.h, hio.h, execnodes.h:
Have we decided to rename vm to pim? Anyway it is inconsistent
with that of corresponding definition of the function body
remains as 'vm_buffer'. (I'm not confident on that, though.)- Buffer vm_buffer, TransactionId cutoff_xid); + Buffer pim_buffer, TransactionId cutoff_xid, uint8 flags);
Fixed.
Regards,
--
Masahiko Sawada
Attachments:
000_add_frozen_bit_into_visibilitymap_v30.patchapplication/octet-stream; name=000_add_frozen_bit_into_visibilitymap_v30.patchDownload
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 22c5f7a..e8ebfe9 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -87,7 +87,7 @@ statapprox_heap(Relation rel, output_type *stat)
* If the page has only visible tuples, then we can find out the free
* space from the FSM and move on.
*/
- if (visibilitymap_test(rel, blkno, &vmbuffer))
+ if (VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
{
freespace = GetRecordedFreeSpace(rel, blkno);
stat->tuple_len += BLCKSZ - freespace;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 392eb70..c43443a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5916,7 +5916,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs an eager freezing if the table's
<structname>pg_class</>.<structfield>relfrozenxid</> field has reached
the age specified by this setting. The default is 150 million
transactions. Although users can set this value anywhere from zero to
@@ -5960,7 +5960,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs an eager freezing if the table's
<structname>pg_class</>.<structfield>relminmxid</> field has reached
the age specified by this setting. The default is 150 million multixacts.
Although users can set this value anywhere from zero to two billions,
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 5204b34..5a43c28 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -352,9 +352,9 @@
Vacuum maintains a <link linkend="storage-vm">visibility map</> for each
table to keep track of which pages contain only tuples that are known to be
visible to all active transactions (and all future transactions, until the
- page is again modified). This has two purposes. First, vacuum
- itself can skip such pages on the next run, since there is nothing to
- clean up.
+ page is again modified), and pages contain only frozen tuples.
+ This has two purposes. First, vacuum itself can skip such pages
+ on the next run, since there is nothing to clean up.
</para>
<para>
@@ -438,28 +438,25 @@
</para>
<para>
- <command>VACUUM</> normally skips pages that don't have any dead row
- versions, but those pages might still have row versions with old XID
- values. To ensure all old row versions have been frozen, a
- scan of the whole table is needed.
+ <command>VACUUM</> skips pages that don't have any dead row
+ versions, and pages that have only frozen rows. To ensure all old
+ row versions have been frozen, a scan of all unfrozen pages is needed.
<xref linkend="guc-vacuum-freeze-table-age"> controls when
- <command>VACUUM</> does that: a whole table sweep is forced if
- the table hasn't been fully scanned for <varname>vacuum_freeze_table_age</>
- minus <varname>vacuum_freeze_min_age</> transactions. Setting it to 0
- forces <command>VACUUM</> to always scan all pages, effectively ignoring
- the visibility map.
+ <command>VACUUM</> does that: a whole-table freezing is forced if
+ the table hasn't been ensured all row versions are frozen for
+ <varname>vacuum_freeze_table_age</> minus <varname>vacuum_freeze_min_age</>
+ transcations.
</para>
<para>
The maximum time that a table can go unvacuumed is two billion
transactions minus the <varname>vacuum_freeze_min_age</> value at
- the time <command>VACUUM</> last scanned the whole table. If it were to go
- unvacuumed for longer than
- that, data loss could result. To ensure that this does not happen,
- autovacuum is invoked on any table that might contain unfrozen rows with
- XIDs older than the age specified by the configuration parameter <xref
- linkend="guc-autovacuum-freeze-max-age">. (This will happen even if
- autovacuum is disabled.)
+ the time <command>VACUUM</> last scanned unfrozen pages. If it were to go
+ unvacuumed for longer than that, data loss could result. To ensure
+ that this does not happen, autovacuum is invoked on any table that might
+ contain unfrozen rows with XIDs older than the age specified by the
+ configuration parameter <xref linkend="guc-autovacuum-freeze-max-age">.
+ (This will happen even if autovacuum is disabled.)
</para>
<para>
@@ -490,8 +487,7 @@
a regularly scheduled <command>VACUUM</> or an autovacuum triggered by
normal delete and update activity is run in that window. Setting it too
close could lead to anti-wraparound autovacuums, even though the table
- was recently vacuumed to reclaim space, whereas lower values lead to more
- frequent whole-table scans.
+ was recently vacuumed to reclaim space.
</para>
<para>
@@ -527,7 +523,7 @@
<structname>pg_database</>. In particular,
the <structfield>relfrozenxid</> column of a table's
<structname>pg_class</> row contains the freeze cutoff XID that was used
- by the last whole-table <command>VACUUM</> for that table. All rows
+ by the last whole-table freezing for that table. All rows
inserted by transactions with XIDs older than this cutoff XID are
guaranteed to have been frozen. Similarly,
the <structfield>datfrozenxid</> column of a database's
@@ -555,17 +551,19 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
<command>VACUUM</> normally
only scans pages that have been modified since the last vacuum, but
<structfield>relfrozenxid</> can only be advanced when the whole table is
- scanned. The whole table is scanned when <structfield>relfrozenxid</> is
- more than <varname>vacuum_freeze_table_age</> transactions old, when
- <command>VACUUM</>'s <literal>FREEZE</> option is used, or when all pages
- happen to
- require vacuuming to remove dead row versions. When <command>VACUUM</>
- scans the whole table, after it's finished <literal>age(relfrozenxid)</>
- should be a little more than the <varname>vacuum_freeze_min_age</> setting
- that was used (more by the number of transactions started since the
- <command>VACUUM</> started). If no whole-table-scanning <command>VACUUM</>
- is issued on the table until <varname>autovacuum_freeze_max_age</> is
- reached, an autovacuum will soon be forced for the table.
+ frozen. The whole-table freezing is occuerred only when all pages happen to
+ require freezing to freeze rows. In other cases such as where
+ <structfield>relfrozenxid</> is more than <varname>vacuum_freeze_table_age</>
+ transcations old, where <command>VACUUM</>'s <literal>FREEZE</> option is used,
+ <command>VACUUM</> can skip the pages that all tuples on the page itself are
+ marked as frozen.
+ When all pages of table are eventually marked as frozen by <command>VACUUM</>,
+ after it's finished <literal>age(relfrozenxid)</> should be a little more
+ than the <varname>vacuum_freeze_min_age</> setting that was used (more by
+ the number of transcations started since the <command>VACUUM</> started).
+ If the advancing of <structfield>relfrozenxid</> is not happend until
+ <varname>autovacuum_freeze_max_age</> is reached, an autovacuum will soon
+ be forced for the table.
</para>
<para>
@@ -642,28 +640,28 @@ HINT: Stop the postmaster and vacuum that database in single-user mode.
<structname>pg_class</>.<structfield>relminmxid</> stores the oldest
possible multixact ID still appearing in any tuple of that table.
If this value is older than
- <xref linkend="guc-vacuum-multixact-freeze-table-age">, a whole-table
+ <xref linkend="guc-vacuum-multixact-freeze-table-age">, a table
scan is forced. <function>mxid_age()</> can be used on
<structname>pg_class</>.<structfield>relminmxid</> to find its age.
</para>
<para>
- Whole-table <command>VACUUM</> scans, regardless of
- what causes them, enable advancing the value for that table.
+ When <command>VACUUM</> scans all unfrozen pages, regardless of what causes
+ them, enable advancing the value for that table.
Eventually, as all tables in all databases are scanned and their
oldest multixact values are advanced, on-disk storage for older
multixacts can be removed.
</para>
<para>
- As a safety device, a whole-table vacuum scan will occur for any table
- whose multixact-age is greater than
- <xref linkend="guc-autovacuum-multixact-freeze-max-age">. Whole-table
+ As a safety device, a vacuum scan will occur for any table whose
+ multixact-age is greater than
+ <xref linkend="guc-autovacuum-multixact-freeze-max-age">. A
vacuum scans will also occur progressively for all tables, starting with
those that have the oldest multixact-age, if the amount of used member
storage space exceeds the amount 50% of the addressable storage space.
- Both of these kinds of whole-table scans will occur even if autovacuum is
- nominally disabled.
+ Both of these kinds of scans will occur even if autovacuum is nominally
+ disabled.
</para>
</sect3>
</sect2>
@@ -743,8 +741,8 @@ vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuple
<command>UPDATE</command> and <command>DELETE</command> operation. (It
is only semi-accurate because some information might be lost under heavy
load.) If the <structfield>relfrozenxid</> value of the table is more
- than <varname>vacuum_freeze_table_age</> transactions old, the whole
- table is scanned to freeze old tuples and advance
+ than <varname>vacuum_freeze_table_age</> transactions old, the table is
+ scanned to freeze old tuples and advance
<structfield>relfrozenxid</>, otherwise only pages that have been modified
since the last vacuum are scanned.
</para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e64b7ef..1908a4d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1332,6 +1332,11 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
<entry>Estimated number of rows modified since this table was last analyzed</entry>
</row>
<row>
+ <entry><structfield>n_frozen_page</></entry>
+ <entry><type>integer</></entry>
+ <entry>Number of frozen pages</entry>
+ </row>
+ <row>
<entry><structfield>last_vacuum</></entry>
<entry><type>timestamp with time zone</></entry>
<entry>Last time at which this table was manually vacuumed
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 164d08c..5dc8b04 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -623,18 +623,20 @@ can be used to examine the information stored in free space maps.
<para>
Each heap relation has a Visibility Map
(VM) to keep track of which pages contain only tuples that are known to be
-visible to all active transactions. It's stored
-alongside the main relation data in a separate relation fork, named after the
-filenode number of the relation, plus a <literal>_vm</> suffix. For example,
-if the filenode of a relation is 12345, the VM is stored in a file called
-<filename>12345_vm</>, in the same directory as the main relation file.
+visible to all active transactions, and pages contain only unfrozen tuples.
+It's stored alongside the main relation data in a separate relation fork,
+named after the filenode number of the relation, plus a <literal>_vm</> suffix.
+For example, if the filenode of a relation is 12345, the VM is stored in a file
+called <filename>12345_vm</>, in the same directory as the main relation file.
Note that indexes do not have VMs.
</para>
<para>
-The visibility map simply stores one bit per heap page. A set bit means
-that all tuples on the page are known to be visible to all transactions.
-This means that the page does not contain any tuples that need to be vacuumed.
+The visibility map stores two bits per heap page: all-visible, all-frozen.
+A set all-visible bit means that all tuples on the page are known to be visible
+to all transactions. A set all-frozen bit means that all tuples on the page are
+completely marked as frozen. This means that the page does not contain any tuples
+that need to be vacuumed, even if a scan of whole table is required.
This information can also be used by <firstterm>index-only scans</> to answer
queries using only the index tuple.
</para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9ff7a41..651dd0e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3034,9 +3034,9 @@ heap_delete(Relation relation, ItemPointer tid,
/*
* If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, we'll have to unlock and
- * re-lock, to avoid holding the buffer lock across an I/O. That's a bit
- * unfortunate, but hopefully shouldn't happen often.
+ * visible or all frozen while we were busy locking the buffer, we'll
+ * have to unlock and re-lock, to avoid holding the buffer lock across an
+ * I/O. That's a bit unfortunate, but hopefully shouldn't happen often.
*/
if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
{
@@ -7205,7 +7205,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
*/
XLogRecPtr
log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
- TransactionId cutoff_xid)
+ TransactionId cutoff_xid, uint8 vmflags)
{
xl_heap_visible xlrec;
XLogRecPtr recptr;
@@ -7215,6 +7215,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
Assert(BufferIsValid(vm_buffer));
xlrec.cutoff_xid = cutoff_xid;
+ xlrec.flags = vmflags;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
@@ -7804,7 +7805,12 @@ heap_xlog_visible(XLogReaderState *record)
* the subsequent update won't be replayed to clear the flag.
*/
page = BufferGetPage(buffer);
- PageSetAllVisible(page);
+
+ if (xlrec->flags & VISIBILITYMAP_ALL_VISIBLE)
+ PageSetAllVisible(page);
+ if (xlrec->flags & VISIBILITYMAP_ALL_FROZEN)
+ PageSetAllFrozen(page);
+
MarkBufferDirty(buffer);
}
else if (action == BLK_RESTORED)
@@ -7856,7 +7862,7 @@ heap_xlog_visible(XLogReaderState *record)
*/
if (lsn > PageGetLSN(vmpage))
visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
- xlrec->cutoff_xid);
+ xlrec->cutoff_xid, xlrec->flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 6db73bf..0fe49eb 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -119,7 +119,7 @@ ReadBufferBI(Relation relation, BlockNumber targetBlock,
* be less than buffer2.
*/
static void
-GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
+GetVisibilitymapPins(Relation relation, Buffer buffer1, Buffer buffer2,
BlockNumber block1, BlockNumber block2,
Buffer *vmbuffer1, Buffer *vmbuffer2)
{
@@ -380,11 +380,11 @@ RelationGetBufferForTuple(Relation relation, Size len,
* done.
*/
if (otherBuffer == InvalidBuffer || buffer <= otherBuffer)
- GetVisibilityMapPins(relation, buffer, otherBuffer,
+ GetVisibilitymapPins(relation, buffer, otherBuffer,
targetBlock, otherBlock, vmbuffer,
vmbuffer_other);
else
- GetVisibilityMapPins(relation, otherBuffer, buffer,
+ GetVisibilitymapPins(relation, otherBuffer, buffer,
otherBlock, targetBlock, vmbuffer_other,
vmbuffer);
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 7c38772..5b4300d 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -15,39 +15,46 @@
* visibilitymap_pin - pin a map page for setting a bit
* visibilitymap_pin_ok - check whether correct map page is already pinned
* visibilitymap_set - set a bit in a previously pinned page
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_get_status - get status of bits
* visibilitymap_count - count number of bits set in visibility map
* visibilitymap_truncate - truncate the visibility map
*
* NOTES
*
- * The visibility map is a bitmap with one bit per heap page. A set bit means
- * that all tuples on the page are known visible to all transactions, and
- * therefore the page doesn't need to be vacuumed. The map is conservative in
- * the sense that we make sure that whenever a bit is set, we know the
- * condition is true, but if a bit is not set, it might or might not be true.
+ * The visibility map is a bitmap with two bits (all-visible and all-frozen)
+ * per heap page. A set all-visible bit means that all tuples on the page are
+ * known visible to all transactions, and therefore the page doesn't need to
+ * be vacuumed. A set all-frozen bit means that all tuples on the page are
+ * completely frozen, and therefore the page doesn't need to be vacuumed even
+ * if whole table scanning vacuum is required (e.g. anti-wraparound vacuum).
+ * The all-frozen bit must be set only when the page is already all-visible.
+ *
+ * The map is conservative in the sense that we make sure that whenever a bit
+ * is set, we know the condition is true, but if a bit is not set, it might or
+ * might not be true.
*
* Clearing a visibility map bit is not separately WAL-logged. The callers
* must make sure that whenever a bit is cleared, the bit is cleared on WAL
- * replay of the updating operation as well.
+ * replay of the updating operation as well. And all-frozen bit must be
+ * cleared with all-visible at the same time.
*
* When we *set* a visibility map during VACUUM, we must write WAL. This may
* seem counterintuitive, since the bit is basically a hint: if it is clear,
- * it may still be the case that every tuple on the page is visible to all
- * transactions; we just don't know that for certain. The difficulty is that
- * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
- * on the page itself, and the visibility map bit. If a crash occurs after the
- * visibility map page makes it to disk and before the updated heap page makes
- * it to disk, redo must set the bit on the heap page. Otherwise, the next
- * insert, update, or delete on the heap page will fail to realize that the
- * visibility map bit must be cleared, possibly causing index-only scans to
- * return wrong answers.
+ * it may still be the case that every tuple on the page is all-visible or
+ * all-frozen we just don't know that for certain. The difficulty is that
+ * there are two bits which are typically set together: the PD_ALL_VISIBLE
+ * or PD_ALL_FROZEN bit on the page itself, and the corresponding visibility map
+ * bit. If a crash occurs after the visibility map page makes it to disk and before
+ * the updated heap page makes it to disk, redo must set the bit on the heap page.
+ * Otherwise, the next insert, update, or delete on the heap page will fail to
+ * realize that the visibility map bit must be cleared, possibly causing index-only
+ * scans to return wrong answers.
*
* VACUUM will normally skip pages for which the visibility map bit is set;
* such pages can't contain any dead tuples and therefore don't need vacuuming.
- * The visibility map is not used for anti-wraparound vacuums, because
- * an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
- * present in the table, even on pages that don't have any dead tuples.
+ * The visibility map has the all-frozen bit which indicates all tuples on
+ * corresponding page has been completely frozen, so the visibility map is also
+ * used for anti-wraparound vacuum, even if freezing of tuples is required.
*
* LOCKING
*
@@ -58,14 +65,14 @@
* section that logs the page modification. However, we don't want to hold
* the buffer lock over any I/O that may be required to read in the visibility
* map page. To avoid this, we examine the heap page before locking it;
- * if the page-level PD_ALL_VISIBLE bit is set, we pin the visibility map
- * bit. Then, we lock the buffer. But this creates a race condition: there
- * is a possibility that in the time it takes to lock the buffer, the
- * PD_ALL_VISIBLE bit gets set. If that happens, we have to unlock the
- * buffer, pin the visibility map page, and relock the buffer. This shouldn't
- * happen often, because only VACUUM currently sets visibility map bits,
- * and the race will only occur if VACUUM processes a given page at almost
- * exactly the same time that someone tries to further modify it.
+ * if the page-level PD_ALL_VISIBLE or PD_ALL_FROZEN bit is set, we pin the
+ * visibility map bit. Then, we lock the buffer. But this creates a race
+ * condition: there is a possibility that in the time it takes to lock the
+ * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that happens,
+ * we have to unlock the buffer, pin the visibility map page, and relock the
+ * buffer. This shouldn't happen often, because only VACUUM currently sets
+ * visibility map bits, and the race will only occur if VACUUM processes a given
+ * page at almost exactly the same time that someone tries to further modify it.
*
* To set a bit, you need to hold a lock on the heap page. That prevents
* the race condition where VACUUM sees that all tuples on the page are
@@ -101,38 +108,50 @@
*/
#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
-/* Number of bits allocated for each heap block. */
-#define BITS_PER_HEAPBLOCK 1
-
-/* Number of heap blocks we can represent in one byte. */
-#define HEAPBLOCKS_PER_BYTE 8
+/* Mapping from heap block number to the right bit in the visibility map */
+#define HEAPBLK_TO_MAPBLOCK(x) ((x) / HEAPBLOCKS_PER_PAGE)
+#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
+#define HEAPBLK_TO_MAPBIT(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)
/* Number of heap blocks we can represent in one visibility map page. */
#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
-/* Mapping from heap block number to the right bit in the visibility map */
-#define HEAPBLK_TO_MAPBLOCK(x) ((x) / HEAPBLOCKS_PER_PAGE)
-#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
-#define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
-
-/* table for fast counting of set bits */
-static const uint8 number_of_ones[256] = {
- 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
+/* tables for fast counting of set bits for visible and freeze */
+static const uint8 number_of_ones_for_visible[256] = {
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4
+};
+static const uint8 number_of_ones_for_frozen[256] = {
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4
};
/* prototypes for internal routines */
@@ -141,7 +160,7 @@ static void vm_extend(Relation rel, BlockNumber nvmblocks);
/*
- * visibilitymap_clear - clear a bit in visibility map
+ * visibilitymap_clear - clear all bits in visibility map
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -153,11 +172,11 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
int mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- uint8 mask = 1 << mapBit;
+ uint8 mask = VISIBILITYMAP_VALID_BITS << mapBit;
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_clear %s block %d", RelationGetRelationName(rel), heapBlk);
#endif
if (!BufferIsValid(buf) || BufferGetBlockNumber(buf) != mapBlock)
@@ -186,7 +205,7 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
* visibilitymap_set to actually set the bit.
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by
- * an earlier call to visibilitymap_pin or visibilitymap_test on the same
+ * an earlier call to visibilitymap_pin or visibilitymap_get_status on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk.
*
@@ -212,7 +231,7 @@ visibilitymap_pin(Relation rel, BlockNumber heapBlk, Buffer *buf)
* visibilitymap_pin_ok - do we already have the correct page pinned?
*
* On entry, buf should be InvalidBuffer or a valid buffer returned by
- * an earlier call to visibilitymap_pin or visibilitymap_test on the same
+ * an earlier call to visibilitymap_pin or visibilitymap_get_status on the same
* relation. The return value indicates whether the buffer covers the
* given heapBlk.
*/
@@ -225,7 +244,7 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
}
/*
- * visibilitymap_set - set a bit on a previously pinned page
+ * visibilitymap_set - set bit(s) on a previously pinned page
*
* recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
* or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
@@ -234,10 +253,11 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
* marked all-visible; it is needed for Hot Standby, and can be
* InvalidTransactionId if the page contains no tuples.
*
- * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
- * this function. Except in recovery, caller should also pass the heap
- * buffer. When checksums are enabled and we're not in recovery, we must add
- * the heap buffer to the WAL chain to protect it from being torn.
+ * Caller is expected to set the heap page's PD_ALL_VISIBLE or PD_ALL_FROZEN
+ * bit before calling this function. Except in recovery, caller should also
+ * pass the heap buffer and flags which indicates what flag we want to set.
+ * When checksums are enabled and we're not in recovery, we must add the heap
+ * buffer to the WAL chain to protect it from being torn.
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -245,20 +265,22 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
*/
void
visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid)
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
uint8 mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
Page page;
- char *map;
+ uint8 *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_set %s block %d, flag %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
Assert(InRecovery || BufferIsValid(heapBuf));
+ Assert(flags & VISIBILITYMAP_VALID_BITS);
/* Check that we have the right heap page pinned, if present */
if (BufferIsValid(heapBuf) && BufferGetBlockNumber(heapBuf) != heapBlk)
@@ -269,14 +291,14 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
elog(ERROR, "wrong VM buffer passed to visibilitymap_set");
page = BufferGetPage(vmBuf);
- map = PageGetContents(page);
+ map = (uint8 *)PageGetContents(page);
LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
- if (!(map[mapByte] & (1 << mapBit)))
+ if (flags != (map[mapByte] >> mapBit & VISIBILITYMAP_VALID_BITS))
{
START_CRIT_SECTION();
- map[mapByte] |= (1 << mapBit);
+ map[mapByte] |= (flags << mapBit);
MarkBufferDirty(vmBuf);
if (RelationNeedsWAL(rel))
@@ -285,7 +307,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Assert(!InRecovery);
recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
- cutoff_xid);
+ cutoff_xid, flags);
/*
* If data checksums are enabled (or wal_log_hints=on), we
@@ -295,11 +317,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Page heapPage = BufferGetPage(heapBuf);
- /* caller is expected to set PD_ALL_VISIBLE first */
- Assert(PageIsAllVisible(heapPage));
+ /*
+ * Caller is expected to set PD_ALL_VISIBLE or
+ * PD_ALL_FROZEN first.
+ */
+ Assert(((flags | VISIBILITYMAP_ALL_VISIBLE) && PageIsAllVisible(heapPage)) ||
+ ((flags | VISIBILITYMAP_ALL_FROZEN) && PageIsAllFrozen(heapPage)));
PageSetLSN(heapPage, recptr);
}
}
+
PageSetLSN(page, recptr);
}
@@ -310,15 +337,17 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
}
/*
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_get_status - get status of bits
*
- * Are all tuples on heapBlk visible to all, according to the visibility map?
+ * Are all tuples on heapBlk visible to all or are marked frozen, according
+ * to the visibility map?
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by an
- * earlier call to visibilitymap_pin or visibilitymap_test on the same
+ * earlier call to visibilitymap_pin or visibilitymap_get_status on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk, or InvalidBuffer. The caller is responsible for
- * releasing *buf after it's done testing and setting bits.
+ * releasing *buf after it's done testing and setting bits, and must pass flags
+ * for which it needs to check the value in visibility map.
*
* NOTE: This function is typically called without a lock on the heap page,
* so somebody else could change the bit just after we look at it. In fact,
@@ -327,17 +356,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
* we might see the old value. It is the caller's responsibility to deal with
* all concurrency issues!
*/
-bool
-visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
+uint8
+visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *buf)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
uint8 mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- bool result;
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_test %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_get_status %s, block %d", RelationGetRelationName(rel), heapBlk);
#endif
/* Reuse the old pinned buffer if possible */
@@ -360,13 +388,11 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
map = PageGetContents(BufferGetPage(*buf));
/*
- * A single-bit read is atomic. There could be memory-ordering effects
+ * The double bits read is atomic. There could be memory-ordering effects
* here, but for performance reasons we make it the caller's job to worry
* about that.
*/
- result = (map[mapByte] & (1 << mapBit)) ? true : false;
-
- return result;
+ return ((map[mapByte] >> mapBit) & VISIBILITYMAP_VALID_BITS);
}
/*
@@ -374,13 +400,17 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
*
* Note: we ignore the possibility of race conditions when the table is being
* extended concurrently with the call. New pages added to the table aren't
- * going to be marked all-visible, so they won't affect the result.
+ * going to be marked all-visible or all-frozen, so they won't affect the result.
+ * The caller must set the flags which indicates what flag we want to count.
*/
BlockNumber
-visibilitymap_count(Relation rel)
+visibilitymap_count(Relation rel, BlockNumber *all_frozen)
{
- BlockNumber result = 0;
BlockNumber mapBlock;
+ BlockNumber all_visible = 0;
+
+ if (all_frozen)
+ *all_frozen = 0;
for (mapBlock = 0;; mapBlock++)
{
@@ -406,13 +436,15 @@ visibilitymap_count(Relation rel)
for (i = 0; i < MAPSIZE; i++)
{
- result += number_of_ones[map[i]];
+ all_visible += number_of_ones_for_visible[map[i]];
+ if (all_frozen)
+ *all_frozen += number_of_ones_for_frozen[map[i]];
}
ReleaseBuffer(mapBuffer);
}
- return result;
+ return all_visible;
}
/*
@@ -435,7 +467,7 @@ visibilitymap_truncate(Relation rel, BlockNumber nheapblocks)
uint8 truncBit = HEAPBLK_TO_MAPBIT(nheapblocks);
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_truncate %s %d", RelationGetRelationName(rel), nheapblocks);
+ elog(DEBUG1, "vm_truncate %s block %d", RelationGetRelationName(rel), nheapblocks);
#endif
RelationOpenSmgr(rel);
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e59b163..b173e7b 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1921,7 +1921,7 @@ index_update_stats(Relation rel,
BlockNumber relallvisible;
if (rd_rel->relkind != RELKIND_INDEX)
- relallvisible = visibilitymap_count(rel);
+ relallvisible = visibilitymap_count(rel, NULL);
else /* don't bother for indexes */
relallvisible = 0;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index ccc030f..8c555eb 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -444,6 +444,7 @@ CREATE VIEW pg_stat_all_tables AS
pg_stat_get_live_tuples(C.oid) AS n_live_tup,
pg_stat_get_dead_tuples(C.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(C.oid) AS n_mod_since_analyze,
+ pg_stat_get_frozen_pages(C.oid) AS n_frozen_page,
pg_stat_get_last_vacuum_time(C.oid) as last_vacuum,
pg_stat_get_last_autovacuum_time(C.oid) as last_autovacuum,
pg_stat_get_last_analyze_time(C.oid) as last_analyze,
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index ddb68ab..d53fa06 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -314,6 +314,8 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
+ BlockNumber relallvisible,
+ relallfrozen;
if (inh)
ereport(elevel,
@@ -564,51 +566,57 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
}
}
- /*
- * Update pages/tuples stats in pg_class ... but not if we're doing
- * inherited stats.
- */
if (!inh)
+ {
+ /* Calculate the number of all-visible and all-frozen bit */
+ relallvisible = visibilitymap_count(onerel, &relallfrozen);
+
+ /*
+ * Update pages/tuples stats in pg_class ... but not if we're doing
+ * inherited stats.
+ */
vac_update_relstats(onerel,
relpages,
totalrows,
- visibilitymap_count(onerel),
+ relallvisible,
hasindex,
InvalidTransactionId,
InvalidMultiXactId,
in_outer_xact);
- /*
- * Same for indexes. Vacuum always scans all indexes, so if we're part of
- * VACUUM ANALYZE, don't overwrite the accurate count already inserted by
- * VACUUM.
- */
- if (!inh && !(options & VACOPT_VACUUM))
- {
- for (ind = 0; ind < nindexes; ind++)
+ /*
+ * Same for indexes. Vacuum always scans all indexes, so if we're part of
+ * VACUUM ANALYZE, don't overwrite the accurate count already inserted by
+ * VACUUM.
+ */
+ if (!(options & VACOPT_VACUUM))
{
- AnlIndexData *thisdata = &indexdata[ind];
- double totalindexrows;
-
- totalindexrows = ceil(thisdata->tupleFract * totalrows);
- vac_update_relstats(Irel[ind],
- RelationGetNumberOfBlocks(Irel[ind]),
- totalindexrows,
- 0,
- false,
- InvalidTransactionId,
- InvalidMultiXactId,
- in_outer_xact);
+ for (ind = 0; ind < nindexes; ind++)
+ {
+ AnlIndexData *thisdata = &indexdata[ind];
+ double totalindexrows;
+
+ totalindexrows = ceil(thisdata->tupleFract * totalrows);
+ vac_update_relstats(Irel[ind],
+ RelationGetNumberOfBlocks(Irel[ind]),
+ totalindexrows,
+ 0,
+ false,
+ InvalidTransactionId,
+ InvalidMultiXactId,
+ in_outer_xact);
+ }
}
- }
- /*
- * Report ANALYZE to the stats collector, too. However, if doing
- * inherited stats we shouldn't report, because the stats collector only
- * tracks per-table stats.
- */
- if (!inh)
- pgstat_report_analyze(onerel, totalrows, totaldeadrows);
+ /*
+ * Report ANALYZE to the stats collector, too. However, if doing
+ * inherited stats we shouldn't report, because the stats collector only
+ * tracks per-table stats.
+ */
+ if (!inh)
+ pgstat_report_analyze(onerel, totalrows, totaldeadrows, relallfrozen);
+
+ }
/* If this isn't part of VACUUM ANALYZE, let index AMs do cleanup */
if (!(options & VACOPT_VACUUM))
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 2429889..ac64c11 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -106,6 +106,8 @@ typedef struct LVRelStats
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* number of pages we examined */
BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber vmskipped_frozen_pages; /* # of pages we skipped by all-frozen bit
+ of visibility map */
double scanned_tuples; /* counts only tuples on scanned pages */
double old_rel_tuples; /* previous value of pg_class.reltuples */
double new_rel_tuples; /* new estimated total # of tuples */
@@ -156,7 +158,7 @@ static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
static int vac_cmp_itemptr(const void *left, const void *right);
static bool heap_page_is_all_visible(Relation rel, Buffer buf,
- TransactionId *visibility_cutoff_xid);
+ TransactionId *visibility_cutoff_xid, bool *all_frozen);
/*
@@ -188,7 +190,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
MultiXactId mxactFullScanLimit;
BlockNumber new_rel_pages;
double new_rel_tuples;
- BlockNumber new_rel_allvisible;
+ BlockNumber new_rel_allvisible,
+ new_rel_allfrozen;
double new_live_tuples;
TransactionId new_frozen_xid;
MultiXactId new_min_multi;
@@ -221,7 +224,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* We request a full scan if either the table's frozen Xid is now older
* than or equal to the requested Xid full-table scan limit; or if the
* table's minimum MultiXactId is older than or equal to the requested
- * mxid full-table scan limit.
+ * mxid full-table scan limit. During full scan, we could skip some pags
+ * according to all-frozen bit of visibility map.
*/
scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
xidFullScanLimit);
@@ -253,7 +257,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
- if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages + vacrelstats->vmskipped_frozen_pages)
+ < vacrelstats->rel_pages)
{
Assert(!scan_all);
scanned_all = false;
@@ -301,10 +306,13 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_tuples = vacrelstats->old_rel_tuples;
}
- new_rel_allvisible = visibilitymap_count(onerel);
+ new_rel_allvisible = visibilitymap_count(onerel, &new_rel_allfrozen);
if (new_rel_allvisible > new_rel_pages)
new_rel_allvisible = new_rel_pages;
+ if (new_rel_allfrozen > new_rel_pages)
+ new_rel_allfrozen = new_rel_pages;
+
new_frozen_xid = scanned_all ? FreezeLimit : InvalidTransactionId;
new_min_multi = scanned_all ? MultiXactCutoff : InvalidMultiXactId;
@@ -325,7 +333,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
pgstat_report_vacuum(RelationGetRelid(onerel),
onerel->rd_rel->relisshared,
new_live_tuples,
- vacrelstats->new_dead_tuples);
+ vacrelstats->new_dead_tuples,
+ new_rel_allfrozen);
/* and log the action if appropriate */
if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
@@ -360,10 +369,11 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel),
vacrelstats->num_index_scans);
- appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins\n"),
+ appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped according to vm\n"),
vacrelstats->pages_removed,
vacrelstats->rel_pages,
- vacrelstats->pinskipped_pages);
+ vacrelstats->pinskipped_pages,
+ vacrelstats->vmskipped_frozen_pages);
appendStringInfo(&buf,
_("tuples: %.0f removed, %.0f remain, %.0f are dead but not yet removable\n"),
vacrelstats->tuples_deleted,
@@ -486,9 +496,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* consecutive pages. Since we're reading sequentially, the OS should be
* doing readahead for us, so there's no gain in skipping a page now and
* then; that's likely to disable readahead and so be counterproductive.
- * Also, skipping even a single page means that we can't update
- * relfrozenxid, so we only want to do it if we can skip a goodly number
- * of pages.
+ * Also, skipping even a single page accorinding to all-visible bit of
+ * visibility map means that we can't update relfrozenxid, so we only want
+ * to do it if we can skip a goodly number. On the other hand, we count
+ * both how many pages we skipped according to all-frozen bit of visibility
+ * map and how many pages we freeze, so we can update relfrozenxid if
+ * the sum of two is as many as pages of table.
*
* Before entering the main loop, establish the invariant that
* next_not_all_visible_block is the next block number >= blkno that's not
@@ -498,24 +511,24 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* started skipping blocks, we may as well skip everything up to the next
* not-all-visible block.
*
- * Note: if scan_all is true, we won't actually skip any pages; but we
+ * Note: if scan_all is true, we might not actually skip any pages; but we
* maintain next_not_all_visible_block anyway, so as to set up the
* all_visible_according_to_vm flag correctly for each page.
*
- * Note: The value returned by visibilitymap_test could be slightly
+ * Note: The value returned by visibilitymap_get_status could be slightly
* out-of-date, since we make this test before reading the corresponding
* heap page or locking the buffer. This is OK. If we mistakenly think
- * that the page is all-visible when in fact the flag's just been cleared,
- * we might fail to vacuum the page. But it's OK to skip pages when
- * scan_all is not set, so no great harm done; the next vacuum will find
- * them. If we make the reverse mistake and vacuum a page unnecessarily,
- * it'll just be a no-op.
+ * that the page is all-visible/all-frozen when in fact the flag's just
+ * been cleared, we might fail to vacuum the page. But it's OK to skip
+ * pages when scan_all is not set, so no great harm done; the next vacuum
+ * will find them. If we make the reverse mistake and vacuum a page
+ * unnecessarily, it'll just be a no-op.
*/
for (next_not_all_visible_block = 0;
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer))
+ if (!VM_ALL_VISIBLE(onerel, next_not_all_visible_block, &vmbuffer))
break;
vacuum_delay_point();
}
@@ -533,9 +546,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
bool tupgone,
hastup;
int prev_dead_count;
- int nfrozen;
+ int nfrozen; /* # of tuples is frozen */
+ int nalready_frozen; /* # of tuples is already frozen */
+ int ntotal_frozen; /* # of frozen tuples in single page */
+ int ntup_per_page;
Size freespace;
bool all_visible_according_to_vm;
+ bool all_frozen_according_to_vm;
bool all_visible;
bool has_dead_tuples;
TransactionId visibility_cutoff_xid = InvalidTransactionId;
@@ -547,8 +564,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block,
- &vmbuffer))
+ if (!VM_ALL_VISIBLE(onerel, next_not_all_visible_block, &vmbuffer))
break;
vacuum_delay_point();
}
@@ -562,14 +578,29 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
skipping_all_visible_blocks = true;
else
skipping_all_visible_blocks = false;
+
all_visible_according_to_vm = false;
+ all_frozen_according_to_vm = false;
}
else
{
- /* Current block is all-visible */
- if (skipping_all_visible_blocks && !scan_all)
- continue;
+ /*
+ * This block is at least all-visible according to the visibility map.
+ * We check whether this block is all-frozen or not, to skip to
+ * vacuum this page even if scan_all is true.
+ */
+ bool all_frozen = VM_ALL_FROZEN(onerel, blkno, &vmbuffer);
+
+ if (scan_all && all_frozen)
+ {
+ vacrelstats->vmskipped_frozen_pages++;
+ continue;
+ }
+ else if (!scan_all && skipping_all_visible_blocks)
+ continue;
+
all_visible_according_to_vm = true;
+ all_frozen_according_to_vm = all_frozen;
}
vacuum_delay_point();
@@ -716,7 +747,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
empty_pages++;
freespace = PageGetHeapFreeSpace(page);
- /* empty pages are always all-visible */
+ /* empty pages are always all-visible and all-frozen */
if (!PageIsAllVisible(page))
{
START_CRIT_SECTION();
@@ -739,8 +770,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
log_newpage_buffer(buf, true);
PageSetAllVisible(page);
+ PageSetAllFrozen(page);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, InvalidTransactionId);
+ vmbuffer, InvalidTransactionId,
+ VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
END_CRIT_SECTION();
}
@@ -764,6 +797,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
all_visible = true;
has_dead_tuples = false;
nfrozen = 0;
+ nalready_frozen = 0;
+ ntup_per_page = 0;
hastup = false;
prev_dead_count = vacrelstats->num_dead_tuples;
maxoff = PageGetMaxOffsetNumber(page);
@@ -918,8 +953,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
{
num_tuples += 1;
+ ntup_per_page += 1;;
hastup = true;
+ /* Check whether this tuple is alrady frozen or not */
+ if (HeapTupleHeaderXminFrozen(tuple.t_data))
+ nalready_frozen += 1;
+
/*
* Each non-removable tuple must be checked to see if it needs
* freezing. Note we already have exclusive buffer lock.
@@ -966,6 +1006,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
END_CRIT_SECTION();
}
+ /* Compute the number of frozen tuples in a page */
+ ntotal_frozen = nfrozen + nalready_frozen;
+
/*
* If there are no indexes then we can vacuum the page right now
* instead of doing a second scan.
@@ -988,26 +1031,46 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
freespace = PageGetHeapFreeSpace(page);
- /* mark page all-visible, if appropriate */
- if (all_visible && !all_visible_according_to_vm)
+ /* This page is all visible */
+ if (all_visible)
{
- /*
- * It should never be the case that the visibility map page is set
- * while the page-level bit is clear, but the reverse is allowed
- * (if checksums are not enabled). Regardless, set the both bits
- * so that we get back in sync.
- *
- * NB: If the heap page is all-visible but the VM bit is not set,
- * we don't need to dirty the heap page. However, if checksums
- * are enabled, we do need to make sure that the heap page is
- * dirtied before passing it to visibilitymap_set(), because it
- * may be logged. Given that this situation should only happen in
- * rare cases after a crash, it is not worth optimizing.
- */
- PageSetAllVisible(page);
- MarkBufferDirty(buf);
- visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ /* mark page all-visible, if appropriate */
+ if (!all_visible_according_to_vm)
+ {
+ /*
+ * It should never be the case that the visibility map page is set
+ * while the page-level bit is clear, but the reverse is allowed
+ * (if checksums are not enabled). Regardless, set the both bits
+ * so that we get back in sync.
+ *
+ * NB: If the heap page is all-visible but the VM bit is not set,
+ * we don't need to dirty the heap page. However, if checksums
+ * are enabled, we do need to make sure that the heap page is
+ * dirtied before passing it to visibilitymap_set(), because it
+ * may be logged. Given that this situation should only happen in
+ * rare cases after a crash, it is not worth optimizing.
+ */
+ PageSetAllVisible(page);
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+ }
+
+ /* mark page all-frozen, if all tuples are frozen and not marked yet */
+ if ((ntotal_frozen == ntup_per_page) && !all_frozen_according_to_vm)
+ {
+ Assert(PageIsAllVisible(page));
+
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ {
+ MarkBufferDirty(buf);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+ vmbuffer, visibility_cutoff_xid, flags);
+ }
}
/*
@@ -1018,9 +1081,14 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* that something bad has happened.
*/
else if (all_visible_according_to_vm && !PageIsAllVisible(page)
- && visibilitymap_test(onerel, blkno, &vmbuffer))
+ && VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
{
- elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
+ /* If the all-frozen is set then all-visible must be set */
+ if (all_frozen_according_to_vm)
+ Assert(VM_ALL_FROZEN(onerel, blkno, &vmbuffer) &&
+ VM_ALL_VISIBLE(onerel, blkno, &vmbuffer));
+
+ elog(WARNING, "page is not marked all-visible (and all-frozen) but visibility map bit(s) is set in relation \"%s\" page %u",
relname, blkno);
visibilitymap_clear(onerel, blkno, vmbuffer);
}
@@ -1028,19 +1096,25 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
/*
* It's possible for the value returned by GetOldestXmin() to move
* backwards, so it's not wrong for us to see tuples that appear to
- * not be visible to everyone yet, while PD_ALL_VISIBLE is already
- * set. The real safe xmin value never moves backwards, but
- * GetOldestXmin() is conservative and sometimes returns a value
+ * not be visible to everyone yet, while PD_ALL_VISIBLE (and PD_ALL_FROZEN)
+ * are already set. The real safe xmin value never moves backwards,
+ * but GetOldestXmin() is conservative and sometimes returns a value
* that's unnecessarily small, so if we see that contradiction it just
* means that the tuples that we think are not visible to everyone yet
- * actually are, and the PD_ALL_VISIBLE flag is correct.
+ * actually are, and the PD_ALL_VISIBLE (and PD_ALL_FROZEN) flags are
+ * correct.
*
* There should never be dead tuples on a page with PD_ALL_VISIBLE
* set, however.
*/
else if (PageIsAllVisible(page) && has_dead_tuples)
{
- elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
+ /* If the all-frozen is set then all-visible must be set */
+ if (PageIsAllFrozen(page))
+ Assert(VM_ALL_FROZEN(onerel, blkno, &vmbuffer) &&
+ VM_ALL_VISIBLE(onerel, blkno, &vmbuffer));
+
+ elog(WARNING, "page containing dead tuples is marked as all-visible (and all-frozen) in relation \"%s\" page %u",
relname, blkno);
PageClearAllVisible(page);
MarkBufferDirty(buf);
@@ -1114,6 +1188,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
RelationGetRelationName(onerel),
tups_vacuumed, vacuumed_pages)));
+ /* Report how many frozen pages vacuum skipped according to visibility map */
+ ereport(elevel,
+ (errmsg_plural("skipped %d frozen page according to visibility map",
+ "skipped %d frozen pages according to visibility map",
+ vacrelstats->vmskipped_frozen_pages,
+ vacrelstats->vmskipped_frozen_pages)));
+
/*
* This is pretty messy, but we split it up so that we can skip emitting
* individual parts of the message when not applicable.
@@ -1230,6 +1311,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
OffsetNumber unused[MaxOffsetNumber];
int uncnt = 0;
TransactionId visibility_cutoff_xid;
+ bool all_frozen;
START_CRIT_SECTION();
@@ -1281,19 +1363,34 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
* dirty, exclusively locked, and, if needed, a full page image has been
* emitted in the log_heap_clean() above.
*/
- if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid))
+ if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid, &all_frozen))
PageSetAllVisible(page);
/*
* All the changes to the heap page have been done. If the all-visible
- * flag is now set, also set the VM bit.
+ * flag is now set, also set the VM all-visible bit.
+ * Also, if this page is all-frozen, set the VM all-frozen bit and flag.
*/
- if (PageIsAllVisible(page) &&
- !visibilitymap_test(onerel, blkno, vmbuffer))
+ if (PageIsAllVisible(page))
{
+ uint8 vm_status = visibilitymap_get_status(onerel, blkno, vmbuffer);
+ uint8 flags = 0;
+
+ if (!(vm_status & VISIBILITYMAP_ALL_VISIBLE))
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+
+ /* Set the VM all-frozen bit to flag, if needed */
+ if (all_frozen && !(vm_status & VISIBILITYMAP_ALL_FROZEN))
+ {
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
Assert(BufferIsValid(*vmbuffer));
- visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
- visibility_cutoff_xid);
+
+ if (vm_status != flags)
+ visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
+ visibility_cutoff_xid, flags);
}
return tupindex;
@@ -1783,10 +1880,12 @@ vac_cmp_itemptr(const void *left, const void *right)
/*
* Check if every tuple in the given page is visible to all current and future
* transactions. Also return the visibility_cutoff_xid which is the highest
- * xmin amongst the visible tuples.
+ * xmin amongst the visible tuples, and all_frozen which implies that all tuples
+ * of this page are frozen.
*/
static bool
-heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid)
+heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid,
+ bool *all_frozen)
{
Page page = BufferGetPage(buf);
BlockNumber blockno = BufferGetBlockNumber(buf);
@@ -1795,6 +1894,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
bool all_visible = true;
*visibility_cutoff_xid = InvalidTransactionId;
+ *all_frozen = true;
/*
* This is a stripped down version of the line pointer scan in
@@ -1818,11 +1918,12 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/*
* Dead line pointers can have index pointers pointing to them. So
- * they can't be treated as visible
+ * they can't be treated as visible and frozen.
*/
if (ItemIdIsDead(itemid))
{
all_visible = false;
+ *all_frozen = false;
break;
}
@@ -1859,6 +1960,10 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
+
+ /* Check whether this tuple is alrady frozen or not */
+ if (!HeapTupleHeaderXminFrozen(tuple.t_data))
+ *all_frozen = false;
}
break;
@@ -1867,6 +1972,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
case HEAPTUPLE_INSERT_IN_PROGRESS:
case HEAPTUPLE_DELETE_IN_PROGRESS:
all_visible = false;
+ *all_frozen = false;
break;
default:
@@ -1875,5 +1981,8 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
}
} /* scan along page */
+ if (!all_visible)
+ *all_frozen = false;
+
return all_visible;
}
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 9f54c46..e9cf4c8 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -85,9 +85,9 @@ IndexOnlyNext(IndexOnlyScanState *node)
* which all tuples are known visible to everybody. In any case,
* we'll use the index tuple not the heap tuple as the data source.
*
- * Note on Memory Ordering Effects: visibilitymap_test does not lock
- * the visibility map buffer, and therefore the result we read here
- * could be slightly stale. However, it can't be stale enough to
+ * Note on Memory Ordering Effects: visibilitymap_get_status does not
+ * lock. The visibility map buffer, and therefore the result we read
+ * here could be slightly stale. However, it can't be stale enough to
* matter.
*
* We need to detect clearing a VM bit due to an insert right away,
@@ -114,9 +114,9 @@ IndexOnlyNext(IndexOnlyScanState *node)
* It's worth going through this complexity to avoid needing to lock
* the VM buffer, which could cause significant contention.
*/
- if (!visibilitymap_test(scandesc->heapRelation,
- ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
+ if (!VM_ALL_VISIBLE(scandesc->heapRelation,
+ ItemPointerGetBlockNumber(tid),
+ &node->ioss_VMBuffer))
{
/*
* Rats, we have to visit the heap to check visibility.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index ab018c4..ca7257a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1329,7 +1329,8 @@ pgstat_report_autovac(Oid dboid)
*/
void
pgstat_report_vacuum(Oid tableoid, bool shared,
- PgStat_Counter livetuples, PgStat_Counter deadtuples)
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages)
{
PgStat_MsgVacuum msg;
@@ -1343,6 +1344,7 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
msg.m_vacuumtime = GetCurrentTimestamp();
msg.m_live_tuples = livetuples;
msg.m_dead_tuples = deadtuples;
+ msg.m_frozen_pages = frozenpages;
pgstat_send(&msg, sizeof(msg));
}
@@ -1354,7 +1356,8 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
*/
void
pgstat_report_analyze(Relation rel,
- PgStat_Counter livetuples, PgStat_Counter deadtuples)
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages)
{
PgStat_MsgAnalyze msg;
@@ -1394,6 +1397,7 @@ pgstat_report_analyze(Relation rel,
msg.m_analyzetime = GetCurrentTimestamp();
msg.m_live_tuples = livetuples;
msg.m_dead_tuples = deadtuples;
+ msg.m_frozen_pages = frozenpages;
pgstat_send(&msg, sizeof(msg));
}
@@ -3702,6 +3706,7 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
result->n_live_tuples = 0;
result->n_dead_tuples = 0;
result->changes_since_analyze = 0;
+ result->n_frozen_pages = 0;
result->blocks_fetched = 0;
result->blocks_hit = 0;
result->vacuum_timestamp = 0;
@@ -5069,6 +5074,7 @@ pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
tabentry->n_live_tuples = msg->m_live_tuples;
tabentry->n_dead_tuples = msg->m_dead_tuples;
+ tabentry->n_frozen_pages = msg->m_frozen_pages;
if (msg->m_autovacuum)
{
@@ -5103,6 +5109,7 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
tabentry->n_live_tuples = msg->m_live_tuples;
tabentry->n_dead_tuples = msg->m_dead_tuples;
+ tabentry->n_frozen_pages = msg->m_frozen_pages;
/*
* We reset changes_since_analyze to zero, forgetting any changes that
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index f7c9bf6..98c14f9 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -46,6 +46,7 @@ extern Datum pg_stat_get_vacuum_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_autovacuum_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_analyze_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_autoanalyze_count(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_frozen_pages(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_function_calls(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_function_total_time(PG_FUNCTION_ARGS);
@@ -450,6 +451,21 @@ pg_stat_get_autoanalyze_count(PG_FUNCTION_ARGS)
}
Datum
+pg_stat_get_frozen_pages(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int32 result;
+ PgStat_StatTabEntry *tabentry;
+
+ if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+ result = 0;
+ else
+ result = (int32) (tabentry->n_frozen_pages);
+
+ PG_RETURN_INT32(result);
+}
+
+Datum
pg_stat_get_function_calls(PG_FUNCTION_ARGS)
{
Oid funcid = PG_GETARG_OID(0);
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index c84783c..312dca6 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -9,7 +9,11 @@
#include "postgres_fe.h"
+#include "access/visibilitymap.h"
#include "pg_upgrade.h"
+#include "storage/bufpage.h"
+#include "storage/checksum.h"
+#include "storage/checksum_impl.h"
#include <fcntl.h>
@@ -21,6 +25,43 @@ static int copy_file(const char *fromfile, const char *tofile, bool force);
static int win32_pghardlink(const char *src, const char *dst);
#endif
+static const char *rewriteVisibilitymap(const char *fromfile, const char *tofile,
+ bool force);
+
+/* table for fast rewriting vm file in order to add all-frozen information */
+static const uint16 rewrite_vm_table[256] = {
+ 0, 1, 4, 5, 16, 17, 20, 21, 64, 65, 68, 69, 80, 81, 84, 85,
+ 256, 257, 260, 261, 272, 273, 276, 277, 320, 321, 324, 325, 336, 337, 340, 341,
+ 1024, 1025, 1028, 1029, 1040, 1041, 1044, 1045, 1088, 1089, 1092, 1093, 1104, 1105, 1108, 1109,
+ 1280, 1281, 1284, 1285, 1296, 1297, 1300, 1301, 1344, 1345, 1348, 1349, 1360, 1361, 1364, 1365,
+ 4096, 4097, 4100, 4101, 4112, 4113, 4116, 4117, 4160, 4161, 4164, 4165, 4176, 4177, 4180, 4181,
+ 4352, 4353, 4356, 4357, 4368, 4369, 4372, 4373, 4416, 4417, 4420, 4421, 4432, 4433, 4436, 4437,
+ 5120, 5121, 5124, 5125, 5136, 5137, 5140, 5141, 5184, 5185, 5188, 5189, 5200, 5201, 5204, 5205,
+ 5376, 5377, 5380, 5381, 5392, 5393, 5396, 5397, 5440, 5441, 5444, 5445, 5456, 5457, 5460, 5461,
+ 16384, 16385, 16388, 16389, 16400, 16401, 16404, 16405, 16448, 16449, 16452, 16453, 16464, 16465, 16468, 16469,
+ 16640, 16641, 16644, 16645, 16656, 16657, 16660, 16661, 16704, 16705, 16708, 16709, 16720, 16721, 16724, 16725,
+ 17408, 17409, 17412, 17413, 17424, 17425, 17428, 17429, 17472, 17473, 17476, 17477, 17488, 17489, 17492, 17493,
+ 17664, 17665, 17668, 17669, 17680, 17681, 17684, 17685, 17728, 17729, 17732, 17733, 17744, 17745, 17748, 17749,
+ 20480, 20481, 20484, 20485, 20496, 20497, 20500, 20501, 20544, 20545, 20548, 20549, 20560, 20561, 20564, 20565,
+ 20736, 20737, 20740, 20741, 20752, 20753, 20756, 20757, 20800, 20801, 20804, 20805, 20816, 20817, 20820, 20821,
+ 21504, 21505, 21508, 21509, 21520, 21521, 21524, 21525, 21568, 21569, 21572, 21573, 21584, 21585, 21588, 21589,
+ 21760, 21761, 21764, 21765, 21776, 21777, 21780, 21781, 21824, 21825, 21828, 21829, 21840, 21841, 21844, 21845
+};
+
+/*
+ * copyOrRewriteFile()
+ * This function copies file or rewrite visibility map file.
+ * If rewrite_vm is true, we have to rewrite visibility map regardless value of pageConverter.
+ */
+const char *
+copyOrRewriteFile(pageCnvCtx *pageConverter,
+ const char *src, const char *dst, bool force, bool rewrite_vm)
+{
+ if (rewrite_vm)
+ return rewriteVisibilitymap(src, dst, force);
+ else
+ return copyAndUpdateFile(pageConverter, src, dst, force);
+}
/*
* copyAndUpdateFile()
@@ -115,12 +156,14 @@ copyAndUpdateFile(pageCnvCtx *pageConverter,
*/
const char *
linkAndUpdateFile(pageCnvCtx *pageConverter,
- const char *src, const char *dst)
+ const char *src, const char *dst, bool rewrite_vm)
{
if (pageConverter != NULL)
return "Cannot in-place update this cluster, page-by-page conversion is required";
- if (pg_link_file(src, dst) == -1)
+ if (rewrite_vm)
+ return rewriteVisibilitymap(src, dst, true);
+ else if (pg_link_file(src, dst) == -1)
return getErrorText();
else
return NULL;
@@ -205,6 +248,96 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
#endif
+/*
+ * rewriteVisibilitymap()
+ *
+ * Copies a visibility map file while adding all-frozen bit(0) into each bit.
+ */
+static const char *
+
+rewriteVisibilitymap(const char *fromfile, const char *tofile, bool force)
+{
+ int src_fd = 0;
+ int dst_fd = 0;
+ char buffer[BLCKSZ];
+ ssize_t bytesRead;
+ int rewriteVmBytesPerPage = (BLCKSZ - SizeOfPageHeaderData) / 2;
+ BlockNumber blkno = 0;
+
+ /* Reset errno */
+ errno = 0;
+
+ if ((fromfile == NULL) || (tofile == NULL))
+ return getErrorText();
+
+ if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
+ goto err;
+
+ if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+ goto err;
+
+ /* Perform data rewriting per page */
+ while ((bytesRead = read(src_fd, buffer, BLCKSZ)) == BLCKSZ)
+ {
+ char *cur, *end, *blkend;
+ PageHeaderData pageheader;
+ uint16 vm_bits;
+
+ /* Save the page header data */
+ memcpy(&pageheader, buffer, SizeOfPageHeaderData);
+
+ cur = buffer + SizeOfPageHeaderData;
+ end = buffer + SizeOfPageHeaderData + rewriteVmBytesPerPage;
+ blkend = buffer + bytesRead;
+
+ while (blkend >= end)
+ {
+ char vmbuf[BLCKSZ];
+ char *vmtmp = vmbuf;
+
+ /* Copy page header in advance */
+ memcpy(vmbuf, &pageheader, SizeOfPageHeaderData);
+
+ vmtmp += SizeOfPageHeaderData;
+
+ /* Rewrite visibility map bit one by one */
+ while (end > cur)
+ {
+ /* Write rewritten bit from table and its string representation */
+ vm_bits = rewrite_vm_table[(uint8) *cur];
+ memcpy(vmtmp, &vm_bits, BITS_PER_HEAPBLOCK);
+
+ cur++;
+ vmtmp += BITS_PER_HEAPBLOCK;
+ }
+
+ /* Set new checksum for a visibility map page, If enabled */
+ if (old_cluster.controldata.data_checksum_version != 0 &&
+ new_cluster.controldata.data_checksum_version != 0)
+ ((PageHeader) vmbuf)->pd_checksum = pg_checksum_page(vmbuf, blkno);
+
+ if (write(dst_fd, vmbuf, BLCKSZ) != BLCKSZ)
+ {
+ if (errno == 0)
+ errno = ENOSPC;
+ goto err;
+ }
+
+ end += rewriteVmBytesPerPage;
+ blkno++;
+ }
+ }
+
+err:
+ if (src_fd != 0)
+ close(src_fd);
+
+ if (dst_fd != 0)
+ close(dst_fd);
+
+ return (errno == 0) ? NULL : getErrorText();
+}
+
void
check_hard_link(void)
{
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index a43dff5..c5ad9fb 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -110,6 +110,10 @@ extern char *output_files[];
#define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031
/*
+ * The format of visibility map is changed with this 9.6 commit,
+ */
+#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201512021
+/*
* pg_multixact format changed in 9.3 commit 0ac5ad5134f2769ccbaefec73844f85,
* ("Improve concurrency of foreign key locking") which also updated catalog
* version to this value. pg_upgrade behavior depends on whether old and new
@@ -394,10 +398,12 @@ const pageCnvCtx *setupPageConverter(void);
typedef void *pageCnvCtx;
#endif
+const char *copyOrRewriteFile(pageCnvCtx *pageConverter, const char *src,
+ const char *dst, bool force, bool rewrite_vm);
const char *copyAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
const char *dst, bool force);
const char *linkAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
- const char *dst);
+ const char *dst, bool rewrite_vm);
void check_hard_link(void);
FILE *fopen_priv(const char *path, const char *mode);
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index bfde1b1..5992bda 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -18,7 +18,7 @@
static void transfer_single_new_db(pageCnvCtx *pageConverter,
FileNameMap *maps, int size, char *old_tablespace);
static void transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *suffix);
+ const char *suffix, bool vm_need_rewrite);
/*
@@ -171,6 +171,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
{
int mapnum;
bool vm_crashsafe_match = true;
+ bool vm_need_rewrite = false;
/*
* Do the old and new cluster disagree on the crash-safetiness of the vm
@@ -180,13 +181,20 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
vm_crashsafe_match = false;
+ /*
+ * Do we need to rewrite visibilitymap?
+ */
+ if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ vm_need_rewrite = true;
+
for (mapnum = 0; mapnum < size; mapnum++)
{
if (old_tablespace == NULL ||
strcmp(maps[mapnum].old_tablespace, old_tablespace) == 0)
{
/* transfer primary file */
- transfer_relfile(pageConverter, &maps[mapnum], "");
+ transfer_relfile(pageConverter, &maps[mapnum], "", false);
/* fsm/vm files added in PG 8.4 */
if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)
@@ -194,9 +202,14 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
/*
* Copy/link any fsm and vm files, if they exist
*/
- transfer_relfile(pageConverter, &maps[mapnum], "_fsm");
+ transfer_relfile(pageConverter, &maps[mapnum], "_fsm", false);
if (vm_crashsafe_match)
- transfer_relfile(pageConverter, &maps[mapnum], "_vm");
+ {
+ if (vm_need_rewrite)
+ transfer_relfile(pageConverter, &maps[mapnum], "_vm", true);
+ else
+ transfer_relfile(pageConverter, &maps[mapnum], "_vm", false);
+ }
}
}
}
@@ -210,7 +223,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
*/
static void
transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *type_suffix)
+ const char *type_suffix, bool vm_need_rewrite)
{
const char *msg;
char old_file[MAXPGPATH];
@@ -276,7 +289,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
+ if ((msg = copyOrRewriteFile(pageConverter, old_file, new_file, true, vm_need_rewrite)) != NULL)
pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
@@ -284,7 +297,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "linking \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = linkAndUpdateFile(pageConverter, old_file, new_file)) != NULL)
+ if ((msg = linkAndUpdateFile(pageConverter, old_file, new_file, vm_need_rewrite)) != NULL)
pg_fatal("error while creating link for relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
diff --git a/src/bin/pg_upgrade/test.sh b/src/bin/pg_upgrade/test.sh
index aa7f399..6b058d4 100644
--- a/src/bin/pg_upgrade/test.sh
+++ b/src/bin/pg_upgrade/test.sh
@@ -174,6 +174,11 @@ if "$MAKE" -C "$oldsrc" installcheck; then
mv "$temp_root"/dump1.sql "$temp_root"/dump1.sql.orig
sed "s;$oldsrc;$newsrc;g" "$temp_root"/dump1.sql.orig >"$temp_root"/dump1.sql
fi
+
+ vm_sql="SELECT c.relname, c.relallvisible FROM pg_class as c, pg_namespace as n WHERE c.relnamespace = n.oid AND n.nspname NOT IN ('information_schema', 'pg_toast', 'pg_catalog') ORDER BY c.relname;"
+ # Test for rewriting visibility map
+ vacuumdb -d regression || visibilitymap_vacuum1_status=$?
+ psql -d regression -c "$vm_sql" > "$temp_root"/vm_test1.txt || visibilitymap_test1_status=$?
else
make_installcheck_status=$?
fi
@@ -188,6 +193,14 @@ if [ -n "$pg_dumpall1_status" ]; then
echo "pg_dumpall of pre-upgrade database cluster failed"
exit 1
fi
+if [ -n "$visibilitymap_vacuum1_status" ];then
+ echo "VACUUM of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+if [ -n "$visibilitymap_test1_status" ];then
+ echo "SELECT of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
PGDATA=$BASE_PGDATA
@@ -203,6 +216,8 @@ case $testhost in
esac
pg_dumpall -f "$temp_root"/dump2.sql || pg_dumpall2_status=$?
+vacuumdb -d regression || visibilitymap_vacuum2_status=$?
+psql -d regression -c "$vm_sql" > "$temp_root"/vm_test2.txt || visibilitymap_test2_status=$?
pg_ctl -m fast stop
# no need to echo commands anymore
@@ -214,11 +229,26 @@ if [ -n "$pg_dumpall2_status" ]; then
exit 1
fi
+if [ -n "$visibilitymap_vacuum2_status" ];then
+ echo "VACUUM of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
+if [ -n "$visibilitymap_test2_status" ];then
+ echo "SELECT of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
case $testhost in
MINGW*) cmd /c delete_old_cluster.bat ;;
*) sh ./delete_old_cluster.sh ;;
esac
+if ! diff "$temp_root"/vm_test1.txt "$temp_root"/vm_test2.txt >/dev/null; then
+ echo "Visibility map rewriting test failed"
+ exit 1
+fi
+
if diff "$temp_root"/dump1.sql "$temp_root"/dump2.sql >/dev/null; then
echo PASSED
exit 0
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index caa0f14..93afb10 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -320,9 +320,10 @@ typedef struct xl_heap_freeze_page
typedef struct xl_heap_visible
{
TransactionId cutoff_xid;
+ uint8 flags;
} xl_heap_visible;
-#define SizeOfHeapVisible (offsetof(xl_heap_visible, cutoff_xid) + sizeof(TransactionId))
+#define SizeOfHeapVisible (offsetof(xl_heap_visible, flags) + sizeof(uint8))
typedef struct xl_heap_new_cid
{
@@ -389,6 +390,6 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
xl_heap_freeze_tuple *xlrec_tp);
extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
- Buffer vm_buffer, TransactionId cutoff_xid);
+ Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
#endif /* HEAPAM_XLOG_H */
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 0c0e0ef..115c9b2 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -19,15 +19,36 @@
#include "storage/buf.h"
#include "utils/relcache.h"
+/*
+ * Number of bits allocated for each heap block.
+ * One for all-visible, other for all-frozen.
+*/
+#define BITS_PER_HEAPBLOCK 2
+
+/* Number of heap blocks we can represent in one byte. */
+#define HEAPBLOCKS_PER_BYTE 4
+
+/* Flags for bit map */
+#define VISIBILITYMAP_ALL_VISIBLE 0x01
+#define VISIBILITYMAP_ALL_FROZEN 0x02
+#define VISIBILITYMAP_VALID_BITS 0x03 /* OR of all valid visiblitymap flags bits */
+
+/* Macros for visibilitymap test */
+#define VM_ALL_VISIBLE(r, b, v) \
+ ((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_VISIBLE) != 0)
+#define VM_ALL_FROZEN(r, b, v) \
+ ((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_FROZEN) != 0)
+
extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk,
Buffer vmbuf);
extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
Buffer *vmbuf);
extern bool visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid);
-extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
-extern BlockNumber visibilitymap_count(Relation rel);
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags);
+extern uint8 visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
+extern BlockNumber visibilitymap_count(Relation rel, BlockNumber *all_frozen);
extern void visibilitymap_truncate(Relation rel, BlockNumber nheapblocks);
#endif /* VISIBILITYMAP_H */
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index eba4150..b1c300b 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 201511071
+#define CATALOG_VERSION_NO 201512021
#endif
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index d8640db..9a77d7d 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2779,6 +2779,8 @@ DATA(insert OID = 3056 ( pg_stat_get_analyze_count PGNSP PGUID 12 1 0 0 0 f f f
DESCR("statistics: number of manual analyzes for a table");
DATA(insert OID = 3057 ( pg_stat_get_autoanalyze_count PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_autoanalyze_count _null_ _null_ _null_ ));
DESCR("statistics: number of auto analyzes for a table");
+DATA(insert OID = 6015 ( pg_stat_get_frozen_pages PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 23 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_frozen_pages _null_ _null_ _null_ ));
+DESCR("statistics: number of frozen pages of table");
DATA(insert OID = 1936 ( pg_stat_get_backend_idset PGNSP PGUID 12 1 100 0 0 f f f f t t s r 0 0 23 "" _null_ _null_ _null_ _null_ _null_ pg_stat_get_backend_idset _null_ _null_ _null_ ));
DESCR("statistics: currently active backend IDs");
DATA(insert OID = 2022 ( pg_stat_get_activity PGNSP PGUID 12 1 100 0 0 f f f f f t s r 1 0 2249 "23" "{23,26,23,26,25,25,25,16,1184,1184,1184,1184,869,25,23,28,28,16,25,25,23,16,25}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,datid,pid,usesysid,application_name,state,query,waiting,xact_start,query_start,backend_start,state_change,client_addr,client_hostname,client_port,backend_xid,backend_xmin,ssl,sslversion,sslcipher,sslbits,sslcompression,sslclientdn}" _null_ _null_ pg_stat_get_activity _null_ _null_ _null_ ));
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 9ecc163..ed784bc 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -17,6 +17,7 @@
#include "portability/instr_time.h"
#include "postmaster/pgarch.h"
#include "storage/barrier.h"
+#include "storage/block.h"
#include "utils/hsearch.h"
#include "utils/relcache.h"
@@ -355,6 +356,7 @@ typedef struct PgStat_MsgVacuum
TimestampTz m_vacuumtime;
PgStat_Counter m_live_tuples;
PgStat_Counter m_dead_tuples;
+ BlockNumber m_frozen_pages;
} PgStat_MsgVacuum;
@@ -372,6 +374,7 @@ typedef struct PgStat_MsgAnalyze
TimestampTz m_analyzetime;
PgStat_Counter m_live_tuples;
PgStat_Counter m_dead_tuples;
+ int32 m_frozen_pages;
} PgStat_MsgAnalyze;
@@ -551,7 +554,7 @@ typedef union PgStat_Msg
* ------------------------------------------------------------
*/
-#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9D
+#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9E
/* ----------
* PgStat_StatDBEntry The collector's data per database
@@ -615,6 +618,8 @@ typedef struct PgStat_StatTabEntry
PgStat_Counter n_dead_tuples;
PgStat_Counter changes_since_analyze;
+ int32 n_frozen_pages;
+
PgStat_Counter blocks_fetched;
PgStat_Counter blocks_hit;
@@ -917,9 +922,11 @@ extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type t
extern void pgstat_report_autovac(Oid dboid);
extern void pgstat_report_vacuum(Oid tableoid, bool shared,
- PgStat_Counter livetuples, PgStat_Counter deadtuples);
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages);
extern void pgstat_report_analyze(Relation rel,
- PgStat_Counter livetuples, PgStat_Counter deadtuples);
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages);
extern void pgstat_report_recovery_conflict(int reason);
extern void pgstat_report_deadlock(void);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index a2f78ee..102aa81 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -178,8 +178,10 @@ typedef PageHeaderData *PageHeader;
* tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
+#define PD_ALL_FROZEN 0x0008 /* all tuples on page are completely
+ frozen */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x000F /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -367,7 +369,12 @@ typedef PageHeaderData *PageHeader;
#define PageSetAllVisible(page) \
(((PageHeader) (page))->pd_flags |= PD_ALL_VISIBLE)
#define PageClearAllVisible(page) \
- (((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
+ (((PageHeader) (page))->pd_flags &= ~(PD_ALL_VISIBLE | PD_ALL_FROZEN))
+
+#define PageIsAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags & PD_ALL_FROZEN)
+#define PageSetAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags |= PD_ALL_FROZEN)
#define PageIsPrunable(page, oldestxmin) \
( \
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 80374e4..b259e65 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1683,6 +1683,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,
pg_stat_get_live_tuples(c.oid) AS n_live_tup,
pg_stat_get_dead_tuples(c.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(c.oid) AS n_mod_since_analyze,
+ pg_stat_get_frozen_pages(c.oid) AS n_frozen_page,
pg_stat_get_last_vacuum_time(c.oid) AS last_vacuum,
pg_stat_get_last_autovacuum_time(c.oid) AS last_autovacuum,
pg_stat_get_last_analyze_time(c.oid) AS last_analyze,
@@ -1795,6 +1796,7 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
+ pg_stat_all_tables.n_frozen_page,
pg_stat_all_tables.last_vacuum,
pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,
@@ -1838,6 +1840,7 @@ pg_stat_user_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
+ pg_stat_all_tables.n_frozen_page,
pg_stat_all_tables.last_vacuum,
pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index f5be70f..dd49786 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -127,6 +127,8 @@ SELECT count(*) FROM tenk2 WHERE unique1 = 1;
1
(1 row)
+-- do VACUUM FREEZE
+VACUUM FREEZE tenk2;
-- force the rate-limiting logic in pgstat_report_tabstat() to time out
-- and send a message
SELECT pg_sleep(1.0);
@@ -175,6 +177,14 @@ SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages,
t | t
(1 row)
+SELECT n_frozen_page = (pg_relation_size('tenk2') / current_setting('block_size')::int)
+ FROM pg_stat_user_tables
+ WHERE relname ='tenk2';
+ ?column?
+----------
+ t
+(1 row)
+
SELECT pr.snap_ts < pg_stat_get_snapshot_timestamp() as snapshot_newer
FROM prevstats AS pr;
snapshot_newer
diff --git a/src/test/regress/expected/visibilitymap.out b/src/test/regress/expected/visibilitymap.out
new file mode 100644
index 0000000..3be0354
--- /dev/null
+++ b/src/test/regress/expected/visibilitymap.out
@@ -0,0 +1,22 @@
+--
+-- Visibility Map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+-- All pages are become all-visible
+VACUUM FREEZE vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+ ?column?
+----------
+ t
+(1 row)
+
+-- Check whether vacuum skips all-frozen pages
+\set VERBOSITY terse
+VACUUM FREEZE VERBOSE vmtest;
+INFO: vacuuming "public.vmtest"
+INFO: index "vmtest_pkey" now contains 10000 row versions in 30 pages
+INFO: skipped 45 frozen pages according to visibility map
+INFO: "vmtest": found 0 removable, 0 nonremovable row versions in 0 out of 45 pages
+\set VERBOSITY default
+DROP TABLE vmtest;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index b1bc7c7..e31fa76 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -110,3 +110,6 @@ test: event_trigger
# run stats by itself because its delay may be insufficient under heavy load
test: stats
+
+# page info map and vacuum test cannot run concurrently with any test that runs SQL
+test: visibilitymap
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index ade9ef1..666e40c 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -161,3 +161,4 @@ test: with
test: xml
test: event_trigger
test: stats
+test: visibilitymap
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index cd2d592..b3c640f 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -120,6 +120,8 @@ ROLLBACK;
SELECT count(*) FROM tenk2;
-- do an indexscan
SELECT count(*) FROM tenk2 WHERE unique1 = 1;
+-- do VACUUM FREEZE
+VACUUM FREEZE tenk2;
-- force the rate-limiting logic in pgstat_report_tabstat() to time out
-- and send a message
@@ -145,6 +147,10 @@ SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages,
FROM pg_statio_user_tables AS st, pg_class AS cl, prevstats AS pr
WHERE st.relname='tenk2' AND cl.relname='tenk2';
+SELECT n_frozen_page = (pg_relation_size('tenk2') / current_setting('block_size')::int)
+ FROM pg_stat_user_tables
+ WHERE relname ='tenk2';
+
SELECT pr.snap_ts < pg_stat_get_snapshot_timestamp() as snapshot_newer
FROM prevstats AS pr;
diff --git a/src/test/regress/sql/visibilitymap.sql b/src/test/regress/sql/visibilitymap.sql
new file mode 100644
index 0000000..365570b
--- /dev/null
+++ b/src/test/regress/sql/visibilitymap.sql
@@ -0,0 +1,16 @@
+--
+-- Visibility Map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+
+-- All pages are become all-visible
+VACUUM FREEZE vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+
+-- Check whether vacuum skips all-frozen pages
+\set VERBOSITY terse
+VACUUM FREEZE VERBOSE vmtest;
+\set VERBOSITY default
+
+DROP TABLE vmtest;
On Tue, Dec 1, 2015 at 10:40 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Tue, Dec 1, 2015 at 3:04 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Mon, Nov 30, 2015 at 9:18 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Sun, Nov 29, 2015 at 2:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Yeah, we need to consider to compute checksum if enabled.
I've changed the patch, and attached.
Please review it.Thanks for the update. This now conflicts with the updates doesn to
fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the
conflict in order to do some testing, but I'd like to get an updated
patch from the author in case I did it wrong. I don't want to find
bugs that I just introduced myself.Thank you for having a look.
Attached updated v28 patch.
Please review it.Regards,
After running pg_upgrade, if I manually vacuum a table a start getting warnings:
WARNING: page is not marked all-visible (and all-frozen) but
visibility map bit(s) is set in relation "foo" page 32756
WARNING: page is not marked all-visible (and all-frozen) but
visibility map bit(s) is set in relation "foo" page 32756
WARNING: page is not marked all-visible (and all-frozen) but
visibility map bit(s) is set in relation "foo" page 32757
WARNING: page is not marked all-visible (and all-frozen) but
visibility map bit(s) is set in relation "foo" page 32757The warnings are right where the blocks would start using the 2nd page
of the _vm, so I think the problem is there. And looking at the code,
I think that "cur += SizeOfPageHeaderData;" in the inner loop cannot
be correct. We can't skip a header in the current (old) block each
time we reach the end of the new block. The thing we are skipping in
the current block is half the time not a header, but the data at the
halfway point through the block.Thank you for reviewing.
You're right, it's not necessary.
Attached latest v29 patch which removes the mention in pg_upgrade documentation.
I could successfully upgrade with this patch, with the link option and
without. After the update the tables seemed to have their correct
visibility status, and after a VACUUM FREEZE then had the correct
freeze status as well.
Then I manually corrupted the vm file, just to make sure a corrupted
one would get detected. And much to my surprise, I didn't get any
errors or warning when starting it back up and running vacuum freeze
(unless I had page checksums turned on, then I got warnings and zeroed
out pages). But I guess this is not considered a warnable condition
for bits to be off when they should be on, only the opposite.
Consecutive VACUUM FREEZE operations with no DML activity between were
not sped up by as much as I thought they would be, because it still
had to walk all the indexes even though it didn't touch the table at
all. In real-world usage there would almost always be some dead
tuples that would require an index scan anyway for a normal vacuum.
Cheers,
Jeff
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Dec 4, 2015 at 9:51 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Tue, Dec 1, 2015 at 10:40 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Tue, Dec 1, 2015 at 3:04 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Mon, Nov 30, 2015 at 9:18 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Sun, Nov 29, 2015 at 2:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Yeah, we need to consider to compute checksum if enabled.
I've changed the patch, and attached.
Please review it.Thanks for the update. This now conflicts with the updates doesn to
fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the
conflict in order to do some testing, but I'd like to get an updated
patch from the author in case I did it wrong. I don't want to find
bugs that I just introduced myself.Thank you for having a look.
Attached updated v28 patch.
Please review it.Regards,
After running pg_upgrade, if I manually vacuum a table a start getting warnings:
WARNING: page is not marked all-visible (and all-frozen) but
visibility map bit(s) is set in relation "foo" page 32756
WARNING: page is not marked all-visible (and all-frozen) but
visibility map bit(s) is set in relation "foo" page 32756
WARNING: page is not marked all-visible (and all-frozen) but
visibility map bit(s) is set in relation "foo" page 32757
WARNING: page is not marked all-visible (and all-frozen) but
visibility map bit(s) is set in relation "foo" page 32757The warnings are right where the blocks would start using the 2nd page
of the _vm, so I think the problem is there. And looking at the code,
I think that "cur += SizeOfPageHeaderData;" in the inner loop cannot
be correct. We can't skip a header in the current (old) block each
time we reach the end of the new block. The thing we are skipping in
the current block is half the time not a header, but the data at the
halfway point through the block.Thank you for reviewing.
You're right, it's not necessary.
Attached latest v29 patch which removes the mention in pg_upgrade documentation.I could successfully upgrade with this patch, with the link option and
without. After the update the tables seemed to have their correct
visibility status, and after a VACUUM FREEZE then had the correct
freeze status as well.
Thank you for tesing!
Then I manually corrupted the vm file, just to make sure a corrupted
one would get detected. And much to my surprise, I didn't get any
errors or warning when starting it back up and running vacuum freeze
(unless I had page checksums turned on, then I got warnings and zeroed
out pages). But I guess this is not considered a warnable condition
for bits to be off when they should be on, only the opposite.
How did you break the vm file?
The inconsistent flags state (set all-frozen but not set all-visible)
will be detected in visibility map code.But the vm file has
consecutive bits simply after its page header, so detecting its
corruption would be difficult unless whole page is corrupted.
Consecutive VACUUM FREEZE operations with no DML activity between were
not sped up by as much as I thought they would be, because it still
had to walk all the indexes even though it didn't touch the table at
all. In real-world usage there would almost always be some dead
tuples that would require an index scan anyway for a normal vacuum.
The another reason why consecutive VACUUM FREEZE were not sped up is
the many pages of that table were on disk cache, right?
In case of very large database, vacuuming large table would engage the
total vacuum time, so it would be more effective.
Regards,
--
Masahiko Sawada
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Nov 30, 2015 at 12:58 PM, Bruce Momjian <bruce@momjian.us> wrote:
On Mon, Nov 30, 2015 at 10:48:04PM +0530, Masahiko Sawada wrote:
On Sun, Nov 29, 2015 at 2:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Yeah, we need to consider to compute checksum if enabled.
I've changed the patch, and attached.
Please review it.Thanks for the update. This now conflicts with the updates doesn to
fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the
conflict in order to do some testing, but I'd like to get an updated
patch from the author in case I did it wrong. I don't want to find
bugs that I just introduced myself.Thank you for having a look.
I would not bother mentioning this detail in the pg_upgrade manual page:
+ Since the format of visibility map has been changed in version 9.6, + <application>pg_upgrade</> creates and rewrite new <literal>'_vm'</literal> + file even if upgrading from 9.5 or before to 9.6 or later with link mode (-k).
Really? I know we don't always document things like this, but it
seems like a good idea to me that we do so.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Dec 10, 2015 at 3:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Nov 30, 2015 at 12:58 PM, Bruce Momjian <bruce@momjian.us> wrote:
On Mon, Nov 30, 2015 at 10:48:04PM +0530, Masahiko Sawada wrote:
On Sun, Nov 29, 2015 at 2:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Yeah, we need to consider to compute checksum if enabled.
I've changed the patch, and attached.
Please review it.Thanks for the update. This now conflicts with the updates doesn to
fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the
conflict in order to do some testing, but I'd like to get an updated
patch from the author in case I did it wrong. I don't want to find
bugs that I just introduced myself.Thank you for having a look.
I would not bother mentioning this detail in the pg_upgrade manual page:
+ Since the format of visibility map has been changed in version 9.6, + <application>pg_upgrade</> creates and rewrite new <literal>'_vm'</literal> + file even if upgrading from 9.5 or before to 9.6 or later with link mode (-k).Really? I know we don't always document things like this, but it
seems like a good idea to me that we do so.
Just going though v30...
+ frozen. The whole-table freezing is occuerred only when all pages happen to
+ require freezing to freeze rows. In other cases such as where
I am not really getting the meaning of this sentence. Shouldn't this
be reworded something like:
"Freezing occurs on the whole table once all pages of this relation require it."
+ <structfield>relfrozenxid</> is more than
<varname>vacuum_freeze_table_age</>
+ transcations old, where <command>VACUUM</>'s <literal>FREEZE</>
option is used,
+ <command>VACUUM</> can skip the pages that all tuples on the page
itself are
+ marked as frozen.
+ When all pages of table are eventually marked as frozen by
<command>VACUUM</>,
+ after it's finished <literal>age(relfrozenxid)</> should be a little more
+ than the <varname>vacuum_freeze_min_age</> setting that was used (more by
+ the number of transcations started since the <command>VACUUM</> started).
+ If the advancing of <structfield>relfrozenxid</> is not happend until
+ <varname>autovacuum_freeze_max_age</> is reached, an autovacuum will soon
+ be forced for the table.
s/transcations/transactions.
+ <entry><structfield>n_frozen_page</></entry>
+ <entry><type>integer</></entry>
+ <entry>Number of frozen pages</entry>
n_frozen_pages?
make check with pg_upgrade is failing for me:
Visibility map rewriting test failed
make: *** [check] Error 1
-GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
+GetVisibilitymapPins(Relation relation, Buffer buffer1, Buffer buffer2,
This looks like an unrelated change.
* Clearing a visibility map bit is not separately WAL-logged. The callers
* must make sure that whenever a bit is cleared, the bit is cleared on WAL
- * replay of the updating operation as well.
+ * replay of the updating operation as well. And all-frozen bit must be
+ * cleared with all-visible at the same time.
This could be reformulated. This is just an addition on top of the
existing paragraph.
+ * The visibility map has the all-frozen bit which indicates all tuples on
+ * corresponding page has been completely frozen, so the visibility map is also
"have been completely frozen".
-/* Mapping from heap block number to the right bit in the visibility map */
-#define HEAPBLK_TO_MAPBLOCK(x) ((x) / HEAPBLOCKS_PER_PAGE)
-#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) /
HEAPBLOCKS_PER_BYTE)
Those two declarations are just noise in the patch: those definitions
are unchanged.
- elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_clear %s block %d",
RelationGetRelationName(rel), heapBlk);
This may be better as a separate patch.
-visibilitymap_count(Relation rel)
+visibilitymap_count(Relation rel, BlockNumber *all_frozen)
I think that this routine would gain in clarity if reworked as follows:
visibilitymap_count(Relation rel, BlockNumber *all_visible,
BlockNumber *all_frozen)
+ /*
+ * Report ANALYZE to the stats collector, too.
However, if doing
+ * inherited stats we shouldn't report, because the
stats collector only
+ * tracks per-table stats.
+ */
+ if (!inh)
+ pgstat_report_analyze(onerel, totalrows,
totaldeadrows, relallfrozen);
Here we already know that this is working in the non-inherited code
path. As a whole, why that? Why isn't relallfrozen passed as an
argument of vac_update_relstats and then inserted in pg_class? Maybe I
am missing something..
+ * mxid full-table scan limit. During full scan, we could skip some pags
+ * according to all-frozen bit of visibility map.
s/pags/pages
+ * Also, skipping even a single page accorinding to all-visible bit of
s/accorinding/according.
So, lazy_scan_heap is the central and really vital portion of the patch...
+ /* Check whether this tuple is alrady
frozen or not */
s/alrady/already
-heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId
*visibility_cutoff_xid)
+heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId
*visibility_cutoff_xid,
+ bool *all_frozen)
I think you would want to change that to heap_page_visible_status that
returns *all_visible as well. At least it seems to be a more
consistent style of routine.
+ * The format of visibility map is changed with this 9.6 commit,
+ */
+#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201512021
It looks a bit strange to have a different flag for the vm with the
new frozen bit. Couldn't we merge that into a unique version number? I
guess that we should just ask for a vm rewrite anyway in any case if
pg_upgrade is used on the version of pg with the new vm format, no?
Sawada-san, are you planning to continue working on that? At this
stage it seems that this patch is not in commitable state and should
at best be moved to next CF, or at worst returned with feedback.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 9 December 2015 at 18:31, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Nov 30, 2015 at 12:58 PM, Bruce Momjian <bruce@momjian.us> wrote:
On Mon, Nov 30, 2015 at 10:48:04PM +0530, Masahiko Sawada wrote:
On Sun, Nov 29, 2015 at 2:21 AM, Jeff Janes <jeff.janes@gmail.com>
wrote:
On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada <
sawada.mshk@gmail.com> wrote:
Yeah, we need to consider to compute checksum if enabled.
I've changed the patch, and attached.
Please review it.Thanks for the update. This now conflicts with the updates doesn to
fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the
conflict in order to do some testing, but I'd like to get an updated
patch from the author in case I did it wrong. I don't want to find
bugs that I just introduced myself.Thank you for having a look.
I would not bother mentioning this detail in the pg_upgrade manual page:
+ Since the format of visibility map has been changed in version 9.6, + <application>pg_upgrade</> creates and rewrite new<literal>'_vm'</literal>
+ file even if upgrading from 9.5 or before to 9.6 or later with link
mode (-k).
Really? I know we don't always document things like this, but it
seems like a good idea to me that we do so.
Agreed.
For me, rewriting the visibility map is a new data loss bug waiting to
happen. I am worried that the group is not taking seriously the potential
for catastrophe here. I think we can do it, but I think it needs these
things
* Clear notice that it is happening unconditionally and unavoidably
* Log files showing it has happened, action by action
* Very clear mechanism for resolving an incomplete or interrupted upgrade
process. Which VMs got upgraded? Which didn't?
* Ability to undo an upgrade attempt, somehow, ideally automatically by
default
* Ability to restart a failed upgrade attempt without doing a "double
upgrade", i.e. ensure transformation is immutable
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Dec 17, 2015 at 3:44 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
For me, rewriting the visibility map is a new data loss bug waiting to
happen. I am worried that the group is not taking seriously the potential
for catastrophe here.
FWIW, I'm following this line and merging the vm file into a single
unit looks like a ticking bomb. We may really want a separate _vm
file, like _vmf to track this new bit flag but this has already been
mentioned largely upthread...
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-12-17 15:56:35 +0900, Michael Paquier wrote:
On Thu, Dec 17, 2015 at 3:44 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
For me, rewriting the visibility map is a new data loss bug waiting to
happen. I am worried that the group is not taking seriously the potential
for catastrophe here.FWIW, I'm following this line and merging the vm file into a single
unit looks like a ticking bomb.
And what are those risks?
We may really want a separate _vm file, like _vmf to track this new
bit flag but this has already been mentioned largely upthread...
That'd double the overhead when those bits get unset.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Dec 17, 2015 at 4:10 PM, Andres Freund <andres@anarazel.de> wrote:
On 2015-12-17 15:56:35 +0900, Michael Paquier wrote:
On Thu, Dec 17, 2015 at 3:44 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
For me, rewriting the visibility map is a new data loss bug waiting to
happen. I am worried that the group is not taking seriously the potential
for catastrophe here.FWIW, I'm following this line and merging the vm file into a single
unit looks like a ticking bomb.And what are those risks?
Incorrect vm file rewrite after a pg_upgrade run.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-12-17 16:22:24 +0900, Michael Paquier wrote:
On Thu, Dec 17, 2015 at 4:10 PM, Andres Freund <andres@anarazel.de> wrote:
On 2015-12-17 15:56:35 +0900, Michael Paquier wrote:
On Thu, Dec 17, 2015 at 3:44 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
For me, rewriting the visibility map is a new data loss bug waiting to
happen. I am worried that the group is not taking seriously the potential
for catastrophe here.FWIW, I'm following this line and merging the vm file into a single
unit looks like a ticking bomb.And what are those risks?
Incorrect vm file rewrite after a pg_upgrade run.
If we can't manage to rewrite a file, replacing a binary b1 with a b10,
then we shouldn't be working on a database. And if we screw up, recovery
i is an rm *_vm away. I can't imagine that this is going to be the
actually complicated part of this feature.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Dec 17, 2015 at 11:47 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
On Thu, Dec 10, 2015 at 3:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Nov 30, 2015 at 12:58 PM, Bruce Momjian <bruce@momjian.us> wrote:
On Mon, Nov 30, 2015 at 10:48:04PM +0530, Masahiko Sawada wrote:
On Sun, Nov 29, 2015 at 2:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Yeah, we need to consider to compute checksum if enabled.
I've changed the patch, and attached.
Please review it.Thanks for the update. This now conflicts with the updates doesn to
fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the
conflict in order to do some testing, but I'd like to get an updated
patch from the author in case I did it wrong. I don't want to find
bugs that I just introduced myself.Thank you for having a look.
I would not bother mentioning this detail in the pg_upgrade manual page:
+ Since the format of visibility map has been changed in version 9.6, + <application>pg_upgrade</> creates and rewrite new <literal>'_vm'</literal> + file even if upgrading from 9.5 or before to 9.6 or later with link mode (-k).Really? I know we don't always document things like this, but it
seems like a good idea to me that we do so.Just going though v30...
+ frozen. The whole-table freezing is occuerred only when all pages happen to + require freezing to freeze rows. In other cases such as whereI am not really getting the meaning of this sentence. Shouldn't this
be reworded something like:
"Freezing occurs on the whole table once all pages of this relation require it."+ <structfield>relfrozenxid</> is more than <varname>vacuum_freeze_table_age</> + transcations old, where <command>VACUUM</>'s <literal>FREEZE</> option is used, + <command>VACUUM</> can skip the pages that all tuples on the page itself are + marked as frozen. + When all pages of table are eventually marked as frozen by <command>VACUUM</>, + after it's finished <literal>age(relfrozenxid)</> should be a little more + than the <varname>vacuum_freeze_min_age</> setting that was used (more by + the number of transcations started since the <command>VACUUM</> started). + If the advancing of <structfield>relfrozenxid</> is not happend until + <varname>autovacuum_freeze_max_age</> is reached, an autovacuum will soon + be forced for the table.s/transcations/transactions.
+ <entry><structfield>n_frozen_page</></entry> + <entry><type>integer</></entry> + <entry>Number of frozen pages</entry> n_frozen_pages?make check with pg_upgrade is failing for me:
Visibility map rewriting test failed
make: *** [check] Error 1
make check with pg_upgrade is done successfully on my environment.
Could you give me more information about this?
-GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2, +GetVisibilitymapPins(Relation relation, Buffer buffer1, Buffer buffer2, This looks like an unrelated change.* Clearing a visibility map bit is not separately WAL-logged. The callers * must make sure that whenever a bit is cleared, the bit is cleared on WAL - * replay of the updating operation as well. + * replay of the updating operation as well. And all-frozen bit must be + * cleared with all-visible at the same time. This could be reformulated. This is just an addition on top of the existing paragraph.+ * The visibility map has the all-frozen bit which indicates all tuples on + * corresponding page has been completely frozen, so the visibility map is also "have been completely frozen".-/* Mapping from heap block number to the right bit in the visibility map */
-#define HEAPBLK_TO_MAPBLOCK(x) ((x) / HEAPBLOCKS_PER_PAGE)
-#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) /
HEAPBLOCKS_PER_BYTE)
Those two declarations are just noise in the patch: those definitions
are unchanged.- elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk); + elog(DEBUG1, "vm_clear %s block %d", RelationGetRelationName(rel), heapBlk); This may be better as a separate patch.
I've attached 001 patch for this separately.
-visibilitymap_count(Relation rel) +visibilitymap_count(Relation rel, BlockNumber *all_frozen) I think that this routine would gain in clarity if reworked as follows: visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen)+ /* + * Report ANALYZE to the stats collector, too. However, if doing + * inherited stats we shouldn't report, because the stats collector only + * tracks per-table stats. + */ + if (!inh) + pgstat_report_analyze(onerel, totalrows, totaldeadrows, relallfrozen); Here we already know that this is working in the non-inherited code path. As a whole, why that? Why isn't relallfrozen passed as an argument of vac_update_relstats and then inserted in pg_class? Maybe I am missing something..
IIUC, as per discussion, the number of frozen pages should not be
inserted into pg_class. Because it's not information used by query
planning like relallvisible, repages.
+ * mxid full-table scan limit. During full scan, we could skip some pags + * according to all-frozen bit of visibility map. s/pags/pages+ * Also, skipping even a single page accorinding to all-visible bit of
s/accorinding/according.So, lazy_scan_heap is the central and really vital portion of the patch...
+ /* Check whether this tuple is alrady
frozen or not */
s/alrady/already-heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid) +heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid, + bool *all_frozen) I think you would want to change that to heap_page_visible_status that returns *all_visible as well. At least it seems to be a more consistent style of routine.+ * The format of visibility map is changed with this 9.6 commit, + */ +#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201512021 It looks a bit strange to have a different flag for the vm with the new frozen bit. Couldn't we merge that into a unique version number? I guess that we should just ask for a vm rewrite anyway in any case if pg_upgrade is used on the version of pg with the new vm format, no?
Thank you for your review.
Please find the attached latest v31 patches.
Sawada-san, are you planning to continue working on that? At this
stage it seems that this patch is not in commitable state and should
at best be moved to next CF, or at worst returned with feedback.
Yes, of course.
This patch should be marked as "Move to next CF".
Regards,
--
Masahiko Sawada
Attachments:
000_add_frozen_bit_into_visibilitymap_v31.patchapplication/octet-stream; name=000_add_frozen_bit_into_visibilitymap_v31.patchDownload
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 22c5f7a..e8ebfe9 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -87,7 +87,7 @@ statapprox_heap(Relation rel, output_type *stat)
* If the page has only visible tuples, then we can find out the free
* space from the FSM and move on.
*/
- if (visibilitymap_test(rel, blkno, &vmbuffer))
+ if (VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
{
freespace = GetRecordedFreeSpace(rel, blkno);
stat->tuple_len += BLCKSZ - freespace;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 392eb70..c43443a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5916,7 +5916,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs an eager freezing if the table's
<structname>pg_class</>.<structfield>relfrozenxid</> field has reached
the age specified by this setting. The default is 150 million
transactions. Although users can set this value anywhere from zero to
@@ -5960,7 +5960,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs an eager freezing if the table's
<structname>pg_class</>.<structfield>relminmxid</> field has reached
the age specified by this setting. The default is 150 million multixacts.
Although users can set this value anywhere from zero to two billions,
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 5204b34..7cc975d 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -352,9 +352,9 @@
Vacuum maintains a <link linkend="storage-vm">visibility map</> for each
table to keep track of which pages contain only tuples that are known to be
visible to all active transactions (and all future transactions, until the
- page is again modified). This has two purposes. First, vacuum
- itself can skip such pages on the next run, since there is nothing to
- clean up.
+ page is again modified), and pages contain only frozen tuples.
+ This has two purposes. First, vacuum itself can skip such pages
+ on the next run, since there is nothing to clean up.
</para>
<para>
@@ -438,28 +438,25 @@
</para>
<para>
- <command>VACUUM</> normally skips pages that don't have any dead row
- versions, but those pages might still have row versions with old XID
- values. To ensure all old row versions have been frozen, a
- scan of the whole table is needed.
+ <command>VACUUM</> skips pages that don't have any dead row
+ versions, and pages that have only frozen rows. To ensure all old
+ row versions have been frozen, a scan of all unfrozen pages is needed.
<xref linkend="guc-vacuum-freeze-table-age"> controls when
- <command>VACUUM</> does that: a whole table sweep is forced if
- the table hasn't been fully scanned for <varname>vacuum_freeze_table_age</>
- minus <varname>vacuum_freeze_min_age</> transactions. Setting it to 0
- forces <command>VACUUM</> to always scan all pages, effectively ignoring
- the visibility map.
+ <command>VACUUM</> does that: a whole-table freezing is forced if
+ the table hasn't been ensured all row versions are frozen for
+ <varname>vacuum_freeze_table_age</> minus <varname>vacuum_freeze_min_age</>
+ transcation.
</para>
<para>
The maximum time that a table can go unvacuumed is two billion
transactions minus the <varname>vacuum_freeze_min_age</> value at
- the time <command>VACUUM</> last scanned the whole table. If it were to go
- unvacuumed for longer than
- that, data loss could result. To ensure that this does not happen,
- autovacuum is invoked on any table that might contain unfrozen rows with
- XIDs older than the age specified by the configuration parameter <xref
- linkend="guc-autovacuum-freeze-max-age">. (This will happen even if
- autovacuum is disabled.)
+ the time <command>VACUUM</> last scanned unfrozen pages. If it were to go
+ unvacuumed for longer than that, data loss could result. To ensure
+ that this does not happen, autovacuum is invoked on any table that might
+ contain unfrozen rows with XIDs older than the age specified by the
+ configuration parameter <xref linkend="guc-autovacuum-freeze-max-age">.
+ (This will happen even if autovacuum is disabled.)
</para>
<para>
@@ -490,8 +487,7 @@
a regularly scheduled <command>VACUUM</> or an autovacuum triggered by
normal delete and update activity is run in that window. Setting it too
close could lead to anti-wraparound autovacuums, even though the table
- was recently vacuumed to reclaim space, whereas lower values lead to more
- frequent whole-table scans.
+ was recently vacuumed to reclaim space.
</para>
<para>
@@ -527,7 +523,7 @@
<structname>pg_database</>. In particular,
the <structfield>relfrozenxid</> column of a table's
<structname>pg_class</> row contains the freeze cutoff XID that was used
- by the last whole-table <command>VACUUM</> for that table. All rows
+ by the last whole-table freezing for that table. All rows
inserted by transactions with XIDs older than this cutoff XID are
guaranteed to have been frozen. Similarly,
the <structfield>datfrozenxid</> column of a database's
@@ -555,17 +551,18 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
<command>VACUUM</> normally
only scans pages that have been modified since the last vacuum, but
<structfield>relfrozenxid</> can only be advanced when the whole table is
- scanned. The whole table is scanned when <structfield>relfrozenxid</> is
- more than <varname>vacuum_freeze_table_age</> transactions old, when
- <command>VACUUM</>'s <literal>FREEZE</> option is used, or when all pages
- happen to
- require vacuuming to remove dead row versions. When <command>VACUUM</>
- scans the whole table, after it's finished <literal>age(relfrozenxid)</>
- should be a little more than the <varname>vacuum_freeze_min_age</> setting
- that was used (more by the number of transactions started since the
- <command>VACUUM</> started). If no whole-table-scanning <command>VACUUM</>
- is issued on the table until <varname>autovacuum_freeze_max_age</> is
- reached, an autovacuum will soon be forced for the table.
+ frozen. Freezing occurs on the whole table once all pages of this relation
+ require it. In other cases such as where <structfield>relfrozenxid</> is more
+ than <varname>vacuum_freeze_table_age</> transactions old, where
+ <command>VACUUM</>'s <literal>FREEZE</> option is used, <command>VACUUM</>
+ can skip the pages that all tuples on the page itself are marked as frozen.
+ When all pages of table are eventually marked as frozen by <command>VACUUM</>,
+ after it's finished <literal>age(relfrozenxid)</> should be a little more
+ than the <varname>vacuum_freeze_min_age</> setting that was used (more by
+ the number of transactions started since the <command>VACUUM</> started).
+ If the advancing of <structfield>relfrozenxid</> is not happend until
+ <varname>autovacuum_freeze_max_age</> is reached, an autovacuum will soon
+ be forced for the table.
</para>
<para>
@@ -642,28 +639,28 @@ HINT: Stop the postmaster and vacuum that database in single-user mode.
<structname>pg_class</>.<structfield>relminmxid</> stores the oldest
possible multixact ID still appearing in any tuple of that table.
If this value is older than
- <xref linkend="guc-vacuum-multixact-freeze-table-age">, a whole-table
+ <xref linkend="guc-vacuum-multixact-freeze-table-age">, a table
scan is forced. <function>mxid_age()</> can be used on
<structname>pg_class</>.<structfield>relminmxid</> to find its age.
</para>
<para>
- Whole-table <command>VACUUM</> scans, regardless of
- what causes them, enable advancing the value for that table.
+ When <command>VACUUM</> scans all unfrozen pages, regardless of what causes
+ them, enable advancing the value for that table.
Eventually, as all tables in all databases are scanned and their
oldest multixact values are advanced, on-disk storage for older
multixacts can be removed.
</para>
<para>
- As a safety device, a whole-table vacuum scan will occur for any table
- whose multixact-age is greater than
- <xref linkend="guc-autovacuum-multixact-freeze-max-age">. Whole-table
+ As a safety device, a vacuum scan will occur for any table whose
+ multixact-age is greater than
+ <xref linkend="guc-autovacuum-multixact-freeze-max-age">. A
vacuum scans will also occur progressively for all tables, starting with
those that have the oldest multixact-age, if the amount of used member
storage space exceeds the amount 50% of the addressable storage space.
- Both of these kinds of whole-table scans will occur even if autovacuum is
- nominally disabled.
+ Both of these kinds of scans will occur even if autovacuum is nominally
+ disabled.
</para>
</sect3>
</sect2>
@@ -743,8 +740,8 @@ vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuple
<command>UPDATE</command> and <command>DELETE</command> operation. (It
is only semi-accurate because some information might be lost under heavy
load.) If the <structfield>relfrozenxid</> value of the table is more
- than <varname>vacuum_freeze_table_age</> transactions old, the whole
- table is scanned to freeze old tuples and advance
+ than <varname>vacuum_freeze_table_age</> transactions old, the table is
+ scanned to freeze old tuples and advance
<structfield>relfrozenxid</>, otherwise only pages that have been modified
since the last vacuum are scanned.
</para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index c503636..d3ecc38 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1332,6 +1332,11 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
<entry>Estimated number of rows modified since this table was last analyzed</entry>
</row>
<row>
+ <entry><structfield>n_frozen_pages</></entry>
+ <entry><type>integer</></entry>
+ <entry>Number of frozen pages</entry>
+ </row>
+ <row>
<entry><structfield>last_vacuum</></entry>
<entry><type>timestamp with time zone</></entry>
<entry>Last time at which this table was manually vacuumed
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 164d08c..5dc8b04 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -623,18 +623,20 @@ can be used to examine the information stored in free space maps.
<para>
Each heap relation has a Visibility Map
(VM) to keep track of which pages contain only tuples that are known to be
-visible to all active transactions. It's stored
-alongside the main relation data in a separate relation fork, named after the
-filenode number of the relation, plus a <literal>_vm</> suffix. For example,
-if the filenode of a relation is 12345, the VM is stored in a file called
-<filename>12345_vm</>, in the same directory as the main relation file.
+visible to all active transactions, and pages contain only unfrozen tuples.
+It's stored alongside the main relation data in a separate relation fork,
+named after the filenode number of the relation, plus a <literal>_vm</> suffix.
+For example, if the filenode of a relation is 12345, the VM is stored in a file
+called <filename>12345_vm</>, in the same directory as the main relation file.
Note that indexes do not have VMs.
</para>
<para>
-The visibility map simply stores one bit per heap page. A set bit means
-that all tuples on the page are known to be visible to all transactions.
-This means that the page does not contain any tuples that need to be vacuumed.
+The visibility map stores two bits per heap page: all-visible, all-frozen.
+A set all-visible bit means that all tuples on the page are known to be visible
+to all transactions. A set all-frozen bit means that all tuples on the page are
+completely marked as frozen. This means that the page does not contain any tuples
+that need to be vacuumed, even if a scan of whole table is required.
This information can also be used by <firstterm>index-only scans</> to answer
queries using only the index tuple.
</para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9ff7a41..651dd0e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3034,9 +3034,9 @@ heap_delete(Relation relation, ItemPointer tid,
/*
* If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, we'll have to unlock and
- * re-lock, to avoid holding the buffer lock across an I/O. That's a bit
- * unfortunate, but hopefully shouldn't happen often.
+ * visible or all frozen while we were busy locking the buffer, we'll
+ * have to unlock and re-lock, to avoid holding the buffer lock across an
+ * I/O. That's a bit unfortunate, but hopefully shouldn't happen often.
*/
if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
{
@@ -7205,7 +7205,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
*/
XLogRecPtr
log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
- TransactionId cutoff_xid)
+ TransactionId cutoff_xid, uint8 vmflags)
{
xl_heap_visible xlrec;
XLogRecPtr recptr;
@@ -7215,6 +7215,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
Assert(BufferIsValid(vm_buffer));
xlrec.cutoff_xid = cutoff_xid;
+ xlrec.flags = vmflags;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
@@ -7804,7 +7805,12 @@ heap_xlog_visible(XLogReaderState *record)
* the subsequent update won't be replayed to clear the flag.
*/
page = BufferGetPage(buffer);
- PageSetAllVisible(page);
+
+ if (xlrec->flags & VISIBILITYMAP_ALL_VISIBLE)
+ PageSetAllVisible(page);
+ if (xlrec->flags & VISIBILITYMAP_ALL_FROZEN)
+ PageSetAllFrozen(page);
+
MarkBufferDirty(buffer);
}
else if (action == BLK_RESTORED)
@@ -7856,7 +7862,7 @@ heap_xlog_visible(XLogReaderState *record)
*/
if (lsn > PageGetLSN(vmpage))
visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
- xlrec->cutoff_xid);
+ xlrec->cutoff_xid, xlrec->flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 7c38772..6186caf 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -15,39 +15,45 @@
* visibilitymap_pin - pin a map page for setting a bit
* visibilitymap_pin_ok - check whether correct map page is already pinned
* visibilitymap_set - set a bit in a previously pinned page
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_get_status - get status of bits
* visibilitymap_count - count number of bits set in visibility map
* visibilitymap_truncate - truncate the visibility map
*
* NOTES
*
- * The visibility map is a bitmap with one bit per heap page. A set bit means
- * that all tuples on the page are known visible to all transactions, and
- * therefore the page doesn't need to be vacuumed. The map is conservative in
- * the sense that we make sure that whenever a bit is set, we know the
- * condition is true, but if a bit is not set, it might or might not be true.
+ * The visibility map is a bitmap with two bits (all-visible and all-frozen)
+ * per heap page. A set all-visible bit means that all tuples on the page are
+ * known visible to all transactions, and therefore the page doesn't need to
+ * be vacuumed. A set all-frozen bit means that all tuples on the page are
+ * completely frozen, and therefore the page doesn't need to be vacuumed even
+ * if whole table scanning vacuum is required (e.g. anti-wraparound vacuum).
+ * The all-frozen bit must be set only when the page is already all-visible.
*
- * Clearing a visibility map bit is not separately WAL-logged. The callers
+ * The map is conservative in the sense that we make sure that whenever a bit
+ * is set, we know the condition is true, but if a bit is not set, it might or
+ * might not be true.
+ *
+ * Clearing both visibility map bits is not separately WAL-logged. The callers
* must make sure that whenever a bit is cleared, the bit is cleared on WAL
* replay of the updating operation as well.
*
* When we *set* a visibility map during VACUUM, we must write WAL. This may
* seem counterintuitive, since the bit is basically a hint: if it is clear,
- * it may still be the case that every tuple on the page is visible to all
- * transactions; we just don't know that for certain. The difficulty is that
- * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
- * on the page itself, and the visibility map bit. If a crash occurs after the
- * visibility map page makes it to disk and before the updated heap page makes
- * it to disk, redo must set the bit on the heap page. Otherwise, the next
- * insert, update, or delete on the heap page will fail to realize that the
- * visibility map bit must be cleared, possibly causing index-only scans to
- * return wrong answers.
+ * it may still be the case that every tuple on the page is all-visible or
+ * all-frozen we just don't know that for certain. The difficulty is that
+ * there are two bits which are typically set together: the PD_ALL_VISIBLE
+ * or PD_ALL_FROZEN bit on the page itself, and the corresponding visibility map
+ * bit. If a crash occurs after the visibility map page makes it to disk and before
+ * the updated heap page makes it to disk, redo must set the bit on the heap page.
+ * Otherwise, the next insert, update, or delete on the heap page will fail to
+ * realize that the visibility map bit must be cleared, possibly causing index-only
+ * scans to return wrong answers.
*
* VACUUM will normally skip pages for which the visibility map bit is set;
* such pages can't contain any dead tuples and therefore don't need vacuuming.
- * The visibility map is not used for anti-wraparound vacuums, because
- * an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
- * present in the table, even on pages that don't have any dead tuples.
+ * The visibility map has the all-frozen bit which indicates all tuples on
+ * corresponding page have been completely frozen, so the visibility map is also
+ * used for anti-wraparound vacuum, even if freezing of tuples is required.
*
* LOCKING
*
@@ -58,14 +64,14 @@
* section that logs the page modification. However, we don't want to hold
* the buffer lock over any I/O that may be required to read in the visibility
* map page. To avoid this, we examine the heap page before locking it;
- * if the page-level PD_ALL_VISIBLE bit is set, we pin the visibility map
- * bit. Then, we lock the buffer. But this creates a race condition: there
- * is a possibility that in the time it takes to lock the buffer, the
- * PD_ALL_VISIBLE bit gets set. If that happens, we have to unlock the
- * buffer, pin the visibility map page, and relock the buffer. This shouldn't
- * happen often, because only VACUUM currently sets visibility map bits,
- * and the race will only occur if VACUUM processes a given page at almost
- * exactly the same time that someone tries to further modify it.
+ * if the page-level PD_ALL_VISIBLE or PD_ALL_FROZEN bit is set, we pin the
+ * visibility map bit. Then, we lock the buffer. But this creates a race
+ * condition: there is a possibility that in the time it takes to lock the
+ * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that happens,
+ * we have to unlock the buffer, pin the visibility map page, and relock the
+ * buffer. This shouldn't happen often, because only VACUUM currently sets
+ * visibility map bits, and the race will only occur if VACUUM processes a given
+ * page at almost exactly the same time that someone tries to further modify it.
*
* To set a bit, you need to hold a lock on the heap page. That prevents
* the race condition where VACUUM sees that all tuples on the page are
@@ -101,38 +107,50 @@
*/
#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
-/* Number of bits allocated for each heap block. */
-#define BITS_PER_HEAPBLOCK 1
-
-/* Number of heap blocks we can represent in one byte. */
-#define HEAPBLOCKS_PER_BYTE 8
-
/* Number of heap blocks we can represent in one visibility map page. */
#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
/* Mapping from heap block number to the right bit in the visibility map */
#define HEAPBLK_TO_MAPBLOCK(x) ((x) / HEAPBLOCKS_PER_PAGE)
#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
-#define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
-
-/* table for fast counting of set bits */
-static const uint8 number_of_ones[256] = {
- 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
+#define HEAPBLK_TO_MAPBIT(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)
+
+/* tables for fast counting of set bits for visible and freeze */
+static const uint8 number_of_ones_for_visible[256] = {
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4
+};
+static const uint8 number_of_ones_for_frozen[256] = {
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4
};
/* prototypes for internal routines */
@@ -141,7 +159,7 @@ static void vm_extend(Relation rel, BlockNumber nvmblocks);
/*
- * visibilitymap_clear - clear a bit in visibility map
+ * visibilitymap_clear - clear all bits in visibility map
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -153,7 +171,7 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
int mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- uint8 mask = 1 << mapBit;
+ uint8 mask = VISIBILITYMAP_VALID_BITS << mapBit;
char *map;
#ifdef TRACE_VISIBILITYMAP
@@ -186,7 +204,7 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
* visibilitymap_set to actually set the bit.
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by
- * an earlier call to visibilitymap_pin or visibilitymap_test on the same
+ * an earlier call to visibilitymap_pin or visibilitymap_get_status on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk.
*
@@ -212,7 +230,7 @@ visibilitymap_pin(Relation rel, BlockNumber heapBlk, Buffer *buf)
* visibilitymap_pin_ok - do we already have the correct page pinned?
*
* On entry, buf should be InvalidBuffer or a valid buffer returned by
- * an earlier call to visibilitymap_pin or visibilitymap_test on the same
+ * an earlier call to visibilitymap_pin or visibilitymap_get_status on the same
* relation. The return value indicates whether the buffer covers the
* given heapBlk.
*/
@@ -225,7 +243,7 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
}
/*
- * visibilitymap_set - set a bit on a previously pinned page
+ * visibilitymap_set - set bit(s) on a previously pinned page
*
* recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
* or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
@@ -234,10 +252,11 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
* marked all-visible; it is needed for Hot Standby, and can be
* InvalidTransactionId if the page contains no tuples.
*
- * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
- * this function. Except in recovery, caller should also pass the heap
- * buffer. When checksums are enabled and we're not in recovery, we must add
- * the heap buffer to the WAL chain to protect it from being torn.
+ * Caller is expected to set the heap page's PD_ALL_VISIBLE or PD_ALL_FROZEN
+ * bit before calling this function. Except in recovery, caller should also
+ * pass the heap buffer and flags which indicates what flag we want to set.
+ * When checksums are enabled and we're not in recovery, we must add the heap
+ * buffer to the WAL chain to protect it from being torn.
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -245,13 +264,14 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
*/
void
visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid)
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
uint8 mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
Page page;
- char *map;
+ uint8 *map;
#ifdef TRACE_VISIBILITYMAP
elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
@@ -259,6 +279,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
Assert(InRecovery || BufferIsValid(heapBuf));
+ Assert(flags & VISIBILITYMAP_VALID_BITS);
/* Check that we have the right heap page pinned, if present */
if (BufferIsValid(heapBuf) && BufferGetBlockNumber(heapBuf) != heapBlk)
@@ -269,14 +290,14 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
elog(ERROR, "wrong VM buffer passed to visibilitymap_set");
page = BufferGetPage(vmBuf);
- map = PageGetContents(page);
+ map = (uint8 *)PageGetContents(page);
LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
- if (!(map[mapByte] & (1 << mapBit)))
+ if (flags != (map[mapByte] >> mapBit & VISIBILITYMAP_VALID_BITS))
{
START_CRIT_SECTION();
- map[mapByte] |= (1 << mapBit);
+ map[mapByte] |= (flags << mapBit);
MarkBufferDirty(vmBuf);
if (RelationNeedsWAL(rel))
@@ -285,7 +306,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Assert(!InRecovery);
recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
- cutoff_xid);
+ cutoff_xid, flags);
/*
* If data checksums are enabled (or wal_log_hints=on), we
@@ -295,11 +316,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Page heapPage = BufferGetPage(heapBuf);
- /* caller is expected to set PD_ALL_VISIBLE first */
- Assert(PageIsAllVisible(heapPage));
+ /*
+ * Caller is expected to set PD_ALL_VISIBLE or
+ * PD_ALL_FROZEN first.
+ */
+ Assert(((flags | VISIBILITYMAP_ALL_VISIBLE) && PageIsAllVisible(heapPage)) ||
+ ((flags | VISIBILITYMAP_ALL_FROZEN) && PageIsAllFrozen(heapPage)));
PageSetLSN(heapPage, recptr);
}
}
+
PageSetLSN(page, recptr);
}
@@ -310,15 +336,17 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
}
/*
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_get_status - get status of bits
*
- * Are all tuples on heapBlk visible to all, according to the visibility map?
+ * Are all tuples on heapBlk visible to all or are marked frozen, according
+ * to the visibility map?
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by an
- * earlier call to visibilitymap_pin or visibilitymap_test on the same
+ * earlier call to visibilitymap_pin or visibilitymap_get_status on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk, or InvalidBuffer. The caller is responsible for
- * releasing *buf after it's done testing and setting bits.
+ * releasing *buf after it's done testing and setting bits, and must pass flags
+ * for which it needs to check the value in visibility map.
*
* NOTE: This function is typically called without a lock on the heap page,
* so somebody else could change the bit just after we look at it. In fact,
@@ -327,17 +355,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
* we might see the old value. It is the caller's responsibility to deal with
* all concurrency issues!
*/
-bool
-visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
+uint8
+visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *buf)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
uint8 mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- bool result;
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_test %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_get_status %s %d", RelationGetRelationName(rel), heapBlk);
#endif
/* Reuse the old pinned buffer if possible */
@@ -360,13 +387,11 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
map = PageGetContents(BufferGetPage(*buf));
/*
- * A single-bit read is atomic. There could be memory-ordering effects
+ * The double bits read is atomic. There could be memory-ordering effects
* here, but for performance reasons we make it the caller's job to worry
* about that.
*/
- result = (map[mapByte] & (1 << mapBit)) ? true : false;
-
- return result;
+ return ((map[mapByte] >> mapBit) & VISIBILITYMAP_VALID_BITS);
}
/*
@@ -374,14 +399,21 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
*
* Note: we ignore the possibility of race conditions when the table is being
* extended concurrently with the call. New pages added to the table aren't
- * going to be marked all-visible, so they won't affect the result.
+ * going to be marked all-visible or all-frozen, so they won't affect the result.
+ * The caller must set the flags which indicates what flag we want to count.
*/
-BlockNumber
-visibilitymap_count(Relation rel)
+void
+visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen)
{
- BlockNumber result = 0;
BlockNumber mapBlock;
+ /* all_visible must be specified */
+ Assert(all_visible);
+
+ *all_visible = 0;
+ if (all_frozen)
+ *all_frozen = 0;
+
for (mapBlock = 0;; mapBlock++)
{
Buffer mapBuffer;
@@ -406,13 +438,13 @@ visibilitymap_count(Relation rel)
for (i = 0; i < MAPSIZE; i++)
{
- result += number_of_ones[map[i]];
+ *all_visible += number_of_ones_for_visible[map[i]];
+ if (all_frozen)
+ *all_frozen += number_of_ones_for_frozen[map[i]];
}
ReleaseBuffer(mapBuffer);
}
-
- return result;
}
/*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index c10be3d..071992d 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1923,7 +1923,7 @@ index_update_stats(Relation rel,
BlockNumber relallvisible;
if (rd_rel->relkind != RELKIND_INDEX)
- relallvisible = visibilitymap_count(rel);
+ visibilitymap_count(rel, &relallvisible, NULL);
else /* don't bother for indexes */
relallvisible = 0;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index ccc030f..727d2a4 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -444,6 +444,7 @@ CREATE VIEW pg_stat_all_tables AS
pg_stat_get_live_tuples(C.oid) AS n_live_tup,
pg_stat_get_dead_tuples(C.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(C.oid) AS n_mod_since_analyze,
+ pg_stat_get_frozen_pages(C.oid) AS n_frozen_pages,
pg_stat_get_last_vacuum_time(C.oid) as last_vacuum,
pg_stat_get_last_autovacuum_time(C.oid) as last_autovacuum,
pg_stat_get_last_analyze_time(C.oid) as last_analyze,
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index ddb68ab..2b2fd0d 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -314,6 +314,8 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
+ BlockNumber relallvisible,
+ relallfrozen;
if (inh)
ereport(elevel,
@@ -564,51 +566,56 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
}
}
- /*
- * Update pages/tuples stats in pg_class ... but not if we're doing
- * inherited stats.
- */
if (!inh)
+ {
+ /* Calculate the number of all-visible and all-frozen bit */
+ visibilitymap_count(onerel, &relallvisible, &relallfrozen);
+
+ /*
+ * Update pages/tuples stats in pg_class ... but not if we're doing
+ * inherited stats.
+ */
vac_update_relstats(onerel,
relpages,
totalrows,
- visibilitymap_count(onerel),
+ relallvisible,
hasindex,
InvalidTransactionId,
InvalidMultiXactId,
in_outer_xact);
- /*
- * Same for indexes. Vacuum always scans all indexes, so if we're part of
- * VACUUM ANALYZE, don't overwrite the accurate count already inserted by
- * VACUUM.
- */
- if (!inh && !(options & VACOPT_VACUUM))
- {
- for (ind = 0; ind < nindexes; ind++)
+ /*
+ * Same for indexes. Vacuum always scans all indexes, so if we're part of
+ * VACUUM ANALYZE, don't overwrite the accurate count already inserted by
+ * VACUUM.
+ */
+ if (!(options & VACOPT_VACUUM))
{
- AnlIndexData *thisdata = &indexdata[ind];
- double totalindexrows;
-
- totalindexrows = ceil(thisdata->tupleFract * totalrows);
- vac_update_relstats(Irel[ind],
- RelationGetNumberOfBlocks(Irel[ind]),
- totalindexrows,
- 0,
- false,
- InvalidTransactionId,
- InvalidMultiXactId,
- in_outer_xact);
+ for (ind = 0; ind < nindexes; ind++)
+ {
+ AnlIndexData *thisdata = &indexdata[ind];
+ double totalindexrows;
+
+ totalindexrows = ceil(thisdata->tupleFract * totalrows);
+ vac_update_relstats(Irel[ind],
+ RelationGetNumberOfBlocks(Irel[ind]),
+ totalindexrows,
+ 0,
+ false,
+ InvalidTransactionId,
+ InvalidMultiXactId,
+ in_outer_xact);
+ }
}
- }
- /*
- * Report ANALYZE to the stats collector, too. However, if doing
- * inherited stats we shouldn't report, because the stats collector only
- * tracks per-table stats.
- */
- if (!inh)
- pgstat_report_analyze(onerel, totalrows, totaldeadrows);
+ /*
+ * Report ANALYZE to the stats collector, too. However, if doing
+ * inherited stats we shouldn't report, because the stats collector only
+ * tracks per-table stats.
+ */
+ pgstat_report_analyze(onerel, totalrows, totaldeadrows, relallfrozen);
+
+ }
/* If this isn't part of VACUUM ANALYZE, let index AMs do cleanup */
if (!(options & VACOPT_VACUUM))
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 2429889..8dafd1b 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -106,6 +106,8 @@ typedef struct LVRelStats
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* number of pages we examined */
BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber vmskipped_frozen_pages; /* # of pages we skipped by all-frozen bit
+ of visibility map */
double scanned_tuples; /* counts only tuples on scanned pages */
double old_rel_tuples; /* previous value of pg_class.reltuples */
double new_rel_tuples; /* new estimated total # of tuples */
@@ -155,8 +157,9 @@ static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
ItemPointer itemptr);
static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
static int vac_cmp_itemptr(const void *left, const void *right);
-static bool heap_page_is_all_visible(Relation rel, Buffer buf,
- TransactionId *visibility_cutoff_xid);
+static void heap_page_visible_status(Relation rel, Buffer buf,
+ TransactionId *visibility_cutoff_xid,
+ bool *all_visible, bool *all_frozen);
/*
@@ -188,7 +191,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
MultiXactId mxactFullScanLimit;
BlockNumber new_rel_pages;
double new_rel_tuples;
- BlockNumber new_rel_allvisible;
+ BlockNumber new_rel_allvisible,
+ new_rel_allfrozen;
double new_live_tuples;
TransactionId new_frozen_xid;
MultiXactId new_min_multi;
@@ -221,7 +225,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* We request a full scan if either the table's frozen Xid is now older
* than or equal to the requested Xid full-table scan limit; or if the
* table's minimum MultiXactId is older than or equal to the requested
- * mxid full-table scan limit.
+ * mxid full-table scan limit. During full scan, we could skip some pages
+ * according to all-frozen bit of visibility map.
*/
scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
xidFullScanLimit);
@@ -253,7 +258,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
- if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages + vacrelstats->vmskipped_frozen_pages)
+ < vacrelstats->rel_pages)
{
Assert(!scan_all);
scanned_all = false;
@@ -301,10 +307,13 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_tuples = vacrelstats->old_rel_tuples;
}
- new_rel_allvisible = visibilitymap_count(onerel);
+ visibilitymap_count(onerel, &new_rel_allvisible, &new_rel_allfrozen);
if (new_rel_allvisible > new_rel_pages)
new_rel_allvisible = new_rel_pages;
+ if (new_rel_allfrozen > new_rel_pages)
+ new_rel_allfrozen = new_rel_pages;
+
new_frozen_xid = scanned_all ? FreezeLimit : InvalidTransactionId;
new_min_multi = scanned_all ? MultiXactCutoff : InvalidMultiXactId;
@@ -325,7 +334,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
pgstat_report_vacuum(RelationGetRelid(onerel),
onerel->rd_rel->relisshared,
new_live_tuples,
- vacrelstats->new_dead_tuples);
+ vacrelstats->new_dead_tuples,
+ new_rel_allfrozen);
/* and log the action if appropriate */
if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
@@ -360,10 +370,11 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel),
vacrelstats->num_index_scans);
- appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins\n"),
+ appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped according to vm\n"),
vacrelstats->pages_removed,
vacrelstats->rel_pages,
- vacrelstats->pinskipped_pages);
+ vacrelstats->pinskipped_pages,
+ vacrelstats->vmskipped_frozen_pages);
appendStringInfo(&buf,
_("tuples: %.0f removed, %.0f remain, %.0f are dead but not yet removable\n"),
vacrelstats->tuples_deleted,
@@ -486,9 +497,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* consecutive pages. Since we're reading sequentially, the OS should be
* doing readahead for us, so there's no gain in skipping a page now and
* then; that's likely to disable readahead and so be counterproductive.
- * Also, skipping even a single page means that we can't update
- * relfrozenxid, so we only want to do it if we can skip a goodly number
- * of pages.
+ * Also, skipping even a single page according to all-visible bit of
+ * visibility map means that we can't update relfrozenxid, so we only want
+ * to do it if we can skip a goodly number. On the other hand, we count
+ * both how many pages we skipped according to all-frozen bit of visibility
+ * map and how many pages we freeze, so we can update relfrozenxid if
+ * the sum of two is as many as pages of table.
*
* Before entering the main loop, establish the invariant that
* next_not_all_visible_block is the next block number >= blkno that's not
@@ -498,24 +512,24 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* started skipping blocks, we may as well skip everything up to the next
* not-all-visible block.
*
- * Note: if scan_all is true, we won't actually skip any pages; but we
+ * Note: if scan_all is true, we might not actually skip any pages; but we
* maintain next_not_all_visible_block anyway, so as to set up the
* all_visible_according_to_vm flag correctly for each page.
*
- * Note: The value returned by visibilitymap_test could be slightly
+ * Note: The value returned by visibilitymap_get_status could be slightly
* out-of-date, since we make this test before reading the corresponding
* heap page or locking the buffer. This is OK. If we mistakenly think
- * that the page is all-visible when in fact the flag's just been cleared,
- * we might fail to vacuum the page. But it's OK to skip pages when
- * scan_all is not set, so no great harm done; the next vacuum will find
- * them. If we make the reverse mistake and vacuum a page unnecessarily,
- * it'll just be a no-op.
+ * that the page is all-visible/all-frozen when in fact the flag's just
+ * been cleared, we might fail to vacuum the page. But it's OK to skip
+ * pages when scan_all is not set, so no great harm done; the next vacuum
+ * will find them. If we make the reverse mistake and vacuum a page
+ * unnecessarily, it'll just be a no-op.
*/
for (next_not_all_visible_block = 0;
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer))
+ if (!VM_ALL_VISIBLE(onerel, next_not_all_visible_block, &vmbuffer))
break;
vacuum_delay_point();
}
@@ -533,9 +547,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
bool tupgone,
hastup;
int prev_dead_count;
- int nfrozen;
+ int nfrozen; /* # of tuples is frozen */
+ int nalready_frozen; /* # of tuples is already frozen */
+ int ntotal_frozen; /* # of frozen tuples in single page */
+ int ntup_per_page;
Size freespace;
bool all_visible_according_to_vm;
+ bool all_frozen_according_to_vm;
bool all_visible;
bool has_dead_tuples;
TransactionId visibility_cutoff_xid = InvalidTransactionId;
@@ -547,8 +565,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block,
- &vmbuffer))
+ if (!VM_ALL_VISIBLE(onerel, next_not_all_visible_block, &vmbuffer))
break;
vacuum_delay_point();
}
@@ -562,14 +579,29 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
skipping_all_visible_blocks = true;
else
skipping_all_visible_blocks = false;
+
all_visible_according_to_vm = false;
+ all_frozen_according_to_vm = false;
}
else
{
- /* Current block is all-visible */
- if (skipping_all_visible_blocks && !scan_all)
- continue;
+ /*
+ * This block is at least all-visible according to the visibility map.
+ * We check whether this block is all-frozen or not, to skip to
+ * vacuum this page even if scan_all is true.
+ */
+ bool all_frozen = VM_ALL_FROZEN(onerel, blkno, &vmbuffer);
+
+ if (scan_all && all_frozen)
+ {
+ vacrelstats->vmskipped_frozen_pages++;
+ continue;
+ }
+ else if (!scan_all && skipping_all_visible_blocks)
+ continue;
+
all_visible_according_to_vm = true;
+ all_frozen_according_to_vm = all_frozen;
}
vacuum_delay_point();
@@ -716,7 +748,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
empty_pages++;
freespace = PageGetHeapFreeSpace(page);
- /* empty pages are always all-visible */
+ /* empty pages are always all-visible and all-frozen */
if (!PageIsAllVisible(page))
{
START_CRIT_SECTION();
@@ -739,8 +771,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
log_newpage_buffer(buf, true);
PageSetAllVisible(page);
+ PageSetAllFrozen(page);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, InvalidTransactionId);
+ vmbuffer, InvalidTransactionId,
+ VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
END_CRIT_SECTION();
}
@@ -764,13 +798,15 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
all_visible = true;
has_dead_tuples = false;
nfrozen = 0;
+ nalready_frozen = 0;
+ ntup_per_page = 0;
hastup = false;
prev_dead_count = vacrelstats->num_dead_tuples;
maxoff = PageGetMaxOffsetNumber(page);
/*
* Note: If you change anything in the loop below, also look at
- * heap_page_is_all_visible to see if that needs to be changed.
+ * heap_page_visible_status to see if that needs to be changed.
*/
for (offnum = FirstOffsetNumber;
offnum <= maxoff;
@@ -918,8 +954,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
{
num_tuples += 1;
+ ntup_per_page += 1;;
hastup = true;
+ /* Check whether this tuple is already frozen or not */
+ if (HeapTupleHeaderXminFrozen(tuple.t_data))
+ nalready_frozen += 1;
+
/*
* Each non-removable tuple must be checked to see if it needs
* freezing. Note we already have exclusive buffer lock.
@@ -966,6 +1007,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
END_CRIT_SECTION();
}
+ /* Compute the number of frozen tuples in a page */
+ ntotal_frozen = nfrozen + nalready_frozen;
+
/*
* If there are no indexes then we can vacuum the page right now
* instead of doing a second scan.
@@ -988,26 +1032,46 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
freespace = PageGetHeapFreeSpace(page);
- /* mark page all-visible, if appropriate */
- if (all_visible && !all_visible_according_to_vm)
+ /* This page is all visible */
+ if (all_visible)
{
- /*
- * It should never be the case that the visibility map page is set
- * while the page-level bit is clear, but the reverse is allowed
- * (if checksums are not enabled). Regardless, set the both bits
- * so that we get back in sync.
- *
- * NB: If the heap page is all-visible but the VM bit is not set,
- * we don't need to dirty the heap page. However, if checksums
- * are enabled, we do need to make sure that the heap page is
- * dirtied before passing it to visibilitymap_set(), because it
- * may be logged. Given that this situation should only happen in
- * rare cases after a crash, it is not worth optimizing.
- */
- PageSetAllVisible(page);
- MarkBufferDirty(buf);
- visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ /* mark page all-visible, if appropriate */
+ if (!all_visible_according_to_vm)
+ {
+ /*
+ * It should never be the case that the visibility map page is set
+ * while the page-level bit is clear, but the reverse is allowed
+ * (if checksums are not enabled). Regardless, set the both bits
+ * so that we get back in sync.
+ *
+ * NB: If the heap page is all-visible but the VM bit is not set,
+ * we don't need to dirty the heap page. However, if checksums
+ * are enabled, we do need to make sure that the heap page is
+ * dirtied before passing it to visibilitymap_set(), because it
+ * may be logged. Given that this situation should only happen in
+ * rare cases after a crash, it is not worth optimizing.
+ */
+ PageSetAllVisible(page);
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+ }
+
+ /* mark page all-frozen, if all tuples are frozen and not marked yet */
+ if ((ntotal_frozen == ntup_per_page) && !all_frozen_according_to_vm)
+ {
+ Assert(PageIsAllVisible(page));
+
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ {
+ MarkBufferDirty(buf);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+ vmbuffer, visibility_cutoff_xid, flags);
+ }
}
/*
@@ -1018,9 +1082,14 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* that something bad has happened.
*/
else if (all_visible_according_to_vm && !PageIsAllVisible(page)
- && visibilitymap_test(onerel, blkno, &vmbuffer))
+ && VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
{
- elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
+ /* If the all-frozen is set then all-visible must be set */
+ if (all_frozen_according_to_vm)
+ Assert(VM_ALL_FROZEN(onerel, blkno, &vmbuffer) &&
+ VM_ALL_VISIBLE(onerel, blkno, &vmbuffer));
+
+ elog(WARNING, "page is not marked all-visible (and all-frozen) but visibility map bit(s) is set in relation \"%s\" page %u",
relname, blkno);
visibilitymap_clear(onerel, blkno, vmbuffer);
}
@@ -1028,19 +1097,25 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
/*
* It's possible for the value returned by GetOldestXmin() to move
* backwards, so it's not wrong for us to see tuples that appear to
- * not be visible to everyone yet, while PD_ALL_VISIBLE is already
- * set. The real safe xmin value never moves backwards, but
- * GetOldestXmin() is conservative and sometimes returns a value
+ * not be visible to everyone yet, while PD_ALL_VISIBLE (and PD_ALL_FROZEN)
+ * are already set. The real safe xmin value never moves backwards,
+ * but GetOldestXmin() is conservative and sometimes returns a value
* that's unnecessarily small, so if we see that contradiction it just
* means that the tuples that we think are not visible to everyone yet
- * actually are, and the PD_ALL_VISIBLE flag is correct.
+ * actually are, and the PD_ALL_VISIBLE (and PD_ALL_FROZEN) flags are
+ * correct.
*
* There should never be dead tuples on a page with PD_ALL_VISIBLE
* set, however.
*/
else if (PageIsAllVisible(page) && has_dead_tuples)
{
- elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
+ /* If the all-frozen is set then all-visible must be set */
+ if (PageIsAllFrozen(page))
+ Assert(VM_ALL_FROZEN(onerel, blkno, &vmbuffer) &&
+ VM_ALL_VISIBLE(onerel, blkno, &vmbuffer));
+
+ elog(WARNING, "page containing dead tuples is marked as all-visible (and all-frozen) in relation \"%s\" page %u",
relname, blkno);
PageClearAllVisible(page);
MarkBufferDirty(buf);
@@ -1114,6 +1189,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
RelationGetRelationName(onerel),
tups_vacuumed, vacuumed_pages)));
+ /* Report how many frozen pages vacuum skipped according to visibility map */
+ ereport(elevel,
+ (errmsg_plural("skipped %d frozen page according to visibility map",
+ "skipped %d frozen pages according to visibility map",
+ vacrelstats->vmskipped_frozen_pages,
+ vacrelstats->vmskipped_frozen_pages)));
+
/*
* This is pretty messy, but we split it up so that we can skip emitting
* individual parts of the message when not applicable.
@@ -1230,6 +1312,8 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
OffsetNumber unused[MaxOffsetNumber];
int uncnt = 0;
TransactionId visibility_cutoff_xid;
+ bool all_visible;
+ bool all_frozen;
START_CRIT_SECTION();
@@ -1281,19 +1365,36 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
* dirty, exclusively locked, and, if needed, a full page image has been
* emitted in the log_heap_clean() above.
*/
- if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid))
+ heap_page_visible_status(onerel, buffer, &visibility_cutoff_xid,
+ &all_visible, &all_frozen);
+ if (all_visible)
PageSetAllVisible(page);
/*
* All the changes to the heap page have been done. If the all-visible
- * flag is now set, also set the VM bit.
+ * flag is now set, also set the VM all-visible bit.
+ * Also, if this page is all-frozen, set the VM all-frozen bit and flag.
*/
- if (PageIsAllVisible(page) &&
- !visibilitymap_test(onerel, blkno, vmbuffer))
+ if (PageIsAllVisible(page))
{
+ uint8 vm_status = visibilitymap_get_status(onerel, blkno, vmbuffer);
+ uint8 flags = 0;
+
+ if (!(vm_status & VISIBILITYMAP_ALL_VISIBLE))
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+
+ /* Set the VM all-frozen bit to flag, if needed */
+ if (all_frozen && !(vm_status & VISIBILITYMAP_ALL_FROZEN))
+ {
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
Assert(BufferIsValid(*vmbuffer));
- visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
- visibility_cutoff_xid);
+
+ if (vm_status != flags)
+ visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
+ visibility_cutoff_xid, flags);
}
return tupindex;
@@ -1783,18 +1884,21 @@ vac_cmp_itemptr(const void *left, const void *right)
/*
* Check if every tuple in the given page is visible to all current and future
* transactions. Also return the visibility_cutoff_xid which is the highest
- * xmin amongst the visible tuples.
+ * xmin amongst the visible tuples, and all_frozen which implies that all tuples
+ * of this page are frozen.
*/
-static bool
-heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid)
+static void
+heap_page_visible_status(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid,
+ bool *all_visible, bool *all_frozen)
{
Page page = BufferGetPage(buf);
BlockNumber blockno = BufferGetBlockNumber(buf);
OffsetNumber offnum,
maxoff;
- bool all_visible = true;
*visibility_cutoff_xid = InvalidTransactionId;
+ *all_visible = true;
+ *all_frozen = true;
/*
* This is a stripped down version of the line pointer scan in
@@ -1802,7 +1906,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
*/
maxoff = PageGetMaxOffsetNumber(page);
for (offnum = FirstOffsetNumber;
- offnum <= maxoff && all_visible;
+ offnum <= maxoff && *all_visible;
offnum = OffsetNumberNext(offnum))
{
ItemId itemid;
@@ -1818,11 +1922,12 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/*
* Dead line pointers can have index pointers pointing to them. So
- * they can't be treated as visible
+ * they can't be treated as visible and frozen.
*/
if (ItemIdIsDead(itemid))
{
- all_visible = false;
+ *all_visible = false;
+ *all_frozen = false;
break;
}
@@ -1841,7 +1946,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/* Check comments in lazy_scan_heap. */
if (!HeapTupleHeaderXminCommitted(tuple.t_data))
{
- all_visible = false;
+ *all_visible = false;
break;
}
@@ -1852,13 +1957,17 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
xmin = HeapTupleHeaderGetXmin(tuple.t_data);
if (!TransactionIdPrecedes(xmin, OldestXmin))
{
- all_visible = false;
+ *all_visible = false;
break;
}
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
+
+ /* Check whether this tuple is already frozen or not */
+ if (!HeapTupleHeaderXminFrozen(tuple.t_data))
+ *all_frozen = false;
}
break;
@@ -1866,7 +1975,8 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
case HEAPTUPLE_RECENTLY_DEAD:
case HEAPTUPLE_INSERT_IN_PROGRESS:
case HEAPTUPLE_DELETE_IN_PROGRESS:
- all_visible = false;
+ *all_visible = false;
+ *all_frozen = false;
break;
default:
@@ -1875,5 +1985,6 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
}
} /* scan along page */
- return all_visible;
+ if (!(*all_visible))
+ *all_frozen = false;
}
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 9f54c46..e9cf4c8 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -85,9 +85,9 @@ IndexOnlyNext(IndexOnlyScanState *node)
* which all tuples are known visible to everybody. In any case,
* we'll use the index tuple not the heap tuple as the data source.
*
- * Note on Memory Ordering Effects: visibilitymap_test does not lock
- * the visibility map buffer, and therefore the result we read here
- * could be slightly stale. However, it can't be stale enough to
+ * Note on Memory Ordering Effects: visibilitymap_get_status does not
+ * lock. The visibility map buffer, and therefore the result we read
+ * here could be slightly stale. However, it can't be stale enough to
* matter.
*
* We need to detect clearing a VM bit due to an insert right away,
@@ -114,9 +114,9 @@ IndexOnlyNext(IndexOnlyScanState *node)
* It's worth going through this complexity to avoid needing to lock
* the VM buffer, which could cause significant contention.
*/
- if (!visibilitymap_test(scandesc->heapRelation,
- ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
+ if (!VM_ALL_VISIBLE(scandesc->heapRelation,
+ ItemPointerGetBlockNumber(tid),
+ &node->ioss_VMBuffer))
{
/*
* Rats, we have to visit the heap to check visibility.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index ab018c4..ca7257a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1329,7 +1329,8 @@ pgstat_report_autovac(Oid dboid)
*/
void
pgstat_report_vacuum(Oid tableoid, bool shared,
- PgStat_Counter livetuples, PgStat_Counter deadtuples)
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages)
{
PgStat_MsgVacuum msg;
@@ -1343,6 +1344,7 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
msg.m_vacuumtime = GetCurrentTimestamp();
msg.m_live_tuples = livetuples;
msg.m_dead_tuples = deadtuples;
+ msg.m_frozen_pages = frozenpages;
pgstat_send(&msg, sizeof(msg));
}
@@ -1354,7 +1356,8 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
*/
void
pgstat_report_analyze(Relation rel,
- PgStat_Counter livetuples, PgStat_Counter deadtuples)
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages)
{
PgStat_MsgAnalyze msg;
@@ -1394,6 +1397,7 @@ pgstat_report_analyze(Relation rel,
msg.m_analyzetime = GetCurrentTimestamp();
msg.m_live_tuples = livetuples;
msg.m_dead_tuples = deadtuples;
+ msg.m_frozen_pages = frozenpages;
pgstat_send(&msg, sizeof(msg));
}
@@ -3702,6 +3706,7 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
result->n_live_tuples = 0;
result->n_dead_tuples = 0;
result->changes_since_analyze = 0;
+ result->n_frozen_pages = 0;
result->blocks_fetched = 0;
result->blocks_hit = 0;
result->vacuum_timestamp = 0;
@@ -5069,6 +5074,7 @@ pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
tabentry->n_live_tuples = msg->m_live_tuples;
tabentry->n_dead_tuples = msg->m_dead_tuples;
+ tabentry->n_frozen_pages = msg->m_frozen_pages;
if (msg->m_autovacuum)
{
@@ -5103,6 +5109,7 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
tabentry->n_live_tuples = msg->m_live_tuples;
tabentry->n_dead_tuples = msg->m_dead_tuples;
+ tabentry->n_frozen_pages = msg->m_frozen_pages;
/*
* We reset changes_since_analyze to zero, forgetting any changes that
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index f7c9bf6..98c14f9 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -46,6 +46,7 @@ extern Datum pg_stat_get_vacuum_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_autovacuum_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_analyze_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_autoanalyze_count(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_frozen_pages(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_function_calls(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_function_total_time(PG_FUNCTION_ARGS);
@@ -450,6 +451,21 @@ pg_stat_get_autoanalyze_count(PG_FUNCTION_ARGS)
}
Datum
+pg_stat_get_frozen_pages(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int32 result;
+ PgStat_StatTabEntry *tabentry;
+
+ if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+ result = 0;
+ else
+ result = (int32) (tabentry->n_frozen_pages);
+
+ PG_RETURN_INT32(result);
+}
+
+Datum
pg_stat_get_function_calls(PG_FUNCTION_ARGS)
{
Oid funcid = PG_GETARG_OID(0);
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index c84783c..312dca6 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -9,7 +9,11 @@
#include "postgres_fe.h"
+#include "access/visibilitymap.h"
#include "pg_upgrade.h"
+#include "storage/bufpage.h"
+#include "storage/checksum.h"
+#include "storage/checksum_impl.h"
#include <fcntl.h>
@@ -21,6 +25,43 @@ static int copy_file(const char *fromfile, const char *tofile, bool force);
static int win32_pghardlink(const char *src, const char *dst);
#endif
+static const char *rewriteVisibilitymap(const char *fromfile, const char *tofile,
+ bool force);
+
+/* table for fast rewriting vm file in order to add all-frozen information */
+static const uint16 rewrite_vm_table[256] = {
+ 0, 1, 4, 5, 16, 17, 20, 21, 64, 65, 68, 69, 80, 81, 84, 85,
+ 256, 257, 260, 261, 272, 273, 276, 277, 320, 321, 324, 325, 336, 337, 340, 341,
+ 1024, 1025, 1028, 1029, 1040, 1041, 1044, 1045, 1088, 1089, 1092, 1093, 1104, 1105, 1108, 1109,
+ 1280, 1281, 1284, 1285, 1296, 1297, 1300, 1301, 1344, 1345, 1348, 1349, 1360, 1361, 1364, 1365,
+ 4096, 4097, 4100, 4101, 4112, 4113, 4116, 4117, 4160, 4161, 4164, 4165, 4176, 4177, 4180, 4181,
+ 4352, 4353, 4356, 4357, 4368, 4369, 4372, 4373, 4416, 4417, 4420, 4421, 4432, 4433, 4436, 4437,
+ 5120, 5121, 5124, 5125, 5136, 5137, 5140, 5141, 5184, 5185, 5188, 5189, 5200, 5201, 5204, 5205,
+ 5376, 5377, 5380, 5381, 5392, 5393, 5396, 5397, 5440, 5441, 5444, 5445, 5456, 5457, 5460, 5461,
+ 16384, 16385, 16388, 16389, 16400, 16401, 16404, 16405, 16448, 16449, 16452, 16453, 16464, 16465, 16468, 16469,
+ 16640, 16641, 16644, 16645, 16656, 16657, 16660, 16661, 16704, 16705, 16708, 16709, 16720, 16721, 16724, 16725,
+ 17408, 17409, 17412, 17413, 17424, 17425, 17428, 17429, 17472, 17473, 17476, 17477, 17488, 17489, 17492, 17493,
+ 17664, 17665, 17668, 17669, 17680, 17681, 17684, 17685, 17728, 17729, 17732, 17733, 17744, 17745, 17748, 17749,
+ 20480, 20481, 20484, 20485, 20496, 20497, 20500, 20501, 20544, 20545, 20548, 20549, 20560, 20561, 20564, 20565,
+ 20736, 20737, 20740, 20741, 20752, 20753, 20756, 20757, 20800, 20801, 20804, 20805, 20816, 20817, 20820, 20821,
+ 21504, 21505, 21508, 21509, 21520, 21521, 21524, 21525, 21568, 21569, 21572, 21573, 21584, 21585, 21588, 21589,
+ 21760, 21761, 21764, 21765, 21776, 21777, 21780, 21781, 21824, 21825, 21828, 21829, 21840, 21841, 21844, 21845
+};
+
+/*
+ * copyOrRewriteFile()
+ * This function copies file or rewrite visibility map file.
+ * If rewrite_vm is true, we have to rewrite visibility map regardless value of pageConverter.
+ */
+const char *
+copyOrRewriteFile(pageCnvCtx *pageConverter,
+ const char *src, const char *dst, bool force, bool rewrite_vm)
+{
+ if (rewrite_vm)
+ return rewriteVisibilitymap(src, dst, force);
+ else
+ return copyAndUpdateFile(pageConverter, src, dst, force);
+}
/*
* copyAndUpdateFile()
@@ -115,12 +156,14 @@ copyAndUpdateFile(pageCnvCtx *pageConverter,
*/
const char *
linkAndUpdateFile(pageCnvCtx *pageConverter,
- const char *src, const char *dst)
+ const char *src, const char *dst, bool rewrite_vm)
{
if (pageConverter != NULL)
return "Cannot in-place update this cluster, page-by-page conversion is required";
- if (pg_link_file(src, dst) == -1)
+ if (rewrite_vm)
+ return rewriteVisibilitymap(src, dst, true);
+ else if (pg_link_file(src, dst) == -1)
return getErrorText();
else
return NULL;
@@ -205,6 +248,96 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
#endif
+/*
+ * rewriteVisibilitymap()
+ *
+ * Copies a visibility map file while adding all-frozen bit(0) into each bit.
+ */
+static const char *
+
+rewriteVisibilitymap(const char *fromfile, const char *tofile, bool force)
+{
+ int src_fd = 0;
+ int dst_fd = 0;
+ char buffer[BLCKSZ];
+ ssize_t bytesRead;
+ int rewriteVmBytesPerPage = (BLCKSZ - SizeOfPageHeaderData) / 2;
+ BlockNumber blkno = 0;
+
+ /* Reset errno */
+ errno = 0;
+
+ if ((fromfile == NULL) || (tofile == NULL))
+ return getErrorText();
+
+ if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
+ goto err;
+
+ if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+ goto err;
+
+ /* Perform data rewriting per page */
+ while ((bytesRead = read(src_fd, buffer, BLCKSZ)) == BLCKSZ)
+ {
+ char *cur, *end, *blkend;
+ PageHeaderData pageheader;
+ uint16 vm_bits;
+
+ /* Save the page header data */
+ memcpy(&pageheader, buffer, SizeOfPageHeaderData);
+
+ cur = buffer + SizeOfPageHeaderData;
+ end = buffer + SizeOfPageHeaderData + rewriteVmBytesPerPage;
+ blkend = buffer + bytesRead;
+
+ while (blkend >= end)
+ {
+ char vmbuf[BLCKSZ];
+ char *vmtmp = vmbuf;
+
+ /* Copy page header in advance */
+ memcpy(vmbuf, &pageheader, SizeOfPageHeaderData);
+
+ vmtmp += SizeOfPageHeaderData;
+
+ /* Rewrite visibility map bit one by one */
+ while (end > cur)
+ {
+ /* Write rewritten bit from table and its string representation */
+ vm_bits = rewrite_vm_table[(uint8) *cur];
+ memcpy(vmtmp, &vm_bits, BITS_PER_HEAPBLOCK);
+
+ cur++;
+ vmtmp += BITS_PER_HEAPBLOCK;
+ }
+
+ /* Set new checksum for a visibility map page, If enabled */
+ if (old_cluster.controldata.data_checksum_version != 0 &&
+ new_cluster.controldata.data_checksum_version != 0)
+ ((PageHeader) vmbuf)->pd_checksum = pg_checksum_page(vmbuf, blkno);
+
+ if (write(dst_fd, vmbuf, BLCKSZ) != BLCKSZ)
+ {
+ if (errno == 0)
+ errno = ENOSPC;
+ goto err;
+ }
+
+ end += rewriteVmBytesPerPage;
+ blkno++;
+ }
+ }
+
+err:
+ if (src_fd != 0)
+ close(src_fd);
+
+ if (dst_fd != 0)
+ close(dst_fd);
+
+ return (errno == 0) ? NULL : getErrorText();
+}
+
void
check_hard_link(void)
{
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index a43dff5..5a43c18 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -110,6 +110,10 @@ extern char *output_files[];
#define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031
/*
+ * The format of visibility map is changed with this 9.6 commit,
+ */
+#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201512171
+/*
* pg_multixact format changed in 9.3 commit 0ac5ad5134f2769ccbaefec73844f85,
* ("Improve concurrency of foreign key locking") which also updated catalog
* version to this value. pg_upgrade behavior depends on whether old and new
@@ -394,10 +398,12 @@ const pageCnvCtx *setupPageConverter(void);
typedef void *pageCnvCtx;
#endif
+const char *copyOrRewriteFile(pageCnvCtx *pageConverter, const char *src,
+ const char *dst, bool force, bool rewrite_vm);
const char *copyAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
const char *dst, bool force);
const char *linkAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
- const char *dst);
+ const char *dst, bool rewrite_vm);
void check_hard_link(void);
FILE *fopen_priv(const char *path, const char *mode);
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index bfde1b1..5992bda 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -18,7 +18,7 @@
static void transfer_single_new_db(pageCnvCtx *pageConverter,
FileNameMap *maps, int size, char *old_tablespace);
static void transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *suffix);
+ const char *suffix, bool vm_need_rewrite);
/*
@@ -171,6 +171,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
{
int mapnum;
bool vm_crashsafe_match = true;
+ bool vm_need_rewrite = false;
/*
* Do the old and new cluster disagree on the crash-safetiness of the vm
@@ -180,13 +181,20 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
vm_crashsafe_match = false;
+ /*
+ * Do we need to rewrite visibilitymap?
+ */
+ if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ vm_need_rewrite = true;
+
for (mapnum = 0; mapnum < size; mapnum++)
{
if (old_tablespace == NULL ||
strcmp(maps[mapnum].old_tablespace, old_tablespace) == 0)
{
/* transfer primary file */
- transfer_relfile(pageConverter, &maps[mapnum], "");
+ transfer_relfile(pageConverter, &maps[mapnum], "", false);
/* fsm/vm files added in PG 8.4 */
if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)
@@ -194,9 +202,14 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
/*
* Copy/link any fsm and vm files, if they exist
*/
- transfer_relfile(pageConverter, &maps[mapnum], "_fsm");
+ transfer_relfile(pageConverter, &maps[mapnum], "_fsm", false);
if (vm_crashsafe_match)
- transfer_relfile(pageConverter, &maps[mapnum], "_vm");
+ {
+ if (vm_need_rewrite)
+ transfer_relfile(pageConverter, &maps[mapnum], "_vm", true);
+ else
+ transfer_relfile(pageConverter, &maps[mapnum], "_vm", false);
+ }
}
}
}
@@ -210,7 +223,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
*/
static void
transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *type_suffix)
+ const char *type_suffix, bool vm_need_rewrite)
{
const char *msg;
char old_file[MAXPGPATH];
@@ -276,7 +289,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
+ if ((msg = copyOrRewriteFile(pageConverter, old_file, new_file, true, vm_need_rewrite)) != NULL)
pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
@@ -284,7 +297,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "linking \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = linkAndUpdateFile(pageConverter, old_file, new_file)) != NULL)
+ if ((msg = linkAndUpdateFile(pageConverter, old_file, new_file, vm_need_rewrite)) != NULL)
pg_fatal("error while creating link for relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
diff --git a/src/bin/pg_upgrade/test.sh b/src/bin/pg_upgrade/test.sh
index aa7f399..6b058d4 100644
--- a/src/bin/pg_upgrade/test.sh
+++ b/src/bin/pg_upgrade/test.sh
@@ -174,6 +174,11 @@ if "$MAKE" -C "$oldsrc" installcheck; then
mv "$temp_root"/dump1.sql "$temp_root"/dump1.sql.orig
sed "s;$oldsrc;$newsrc;g" "$temp_root"/dump1.sql.orig >"$temp_root"/dump1.sql
fi
+
+ vm_sql="SELECT c.relname, c.relallvisible FROM pg_class as c, pg_namespace as n WHERE c.relnamespace = n.oid AND n.nspname NOT IN ('information_schema', 'pg_toast', 'pg_catalog') ORDER BY c.relname;"
+ # Test for rewriting visibility map
+ vacuumdb -d regression || visibilitymap_vacuum1_status=$?
+ psql -d regression -c "$vm_sql" > "$temp_root"/vm_test1.txt || visibilitymap_test1_status=$?
else
make_installcheck_status=$?
fi
@@ -188,6 +193,14 @@ if [ -n "$pg_dumpall1_status" ]; then
echo "pg_dumpall of pre-upgrade database cluster failed"
exit 1
fi
+if [ -n "$visibilitymap_vacuum1_status" ];then
+ echo "VACUUM of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+if [ -n "$visibilitymap_test1_status" ];then
+ echo "SELECT of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
PGDATA=$BASE_PGDATA
@@ -203,6 +216,8 @@ case $testhost in
esac
pg_dumpall -f "$temp_root"/dump2.sql || pg_dumpall2_status=$?
+vacuumdb -d regression || visibilitymap_vacuum2_status=$?
+psql -d regression -c "$vm_sql" > "$temp_root"/vm_test2.txt || visibilitymap_test2_status=$?
pg_ctl -m fast stop
# no need to echo commands anymore
@@ -214,11 +229,26 @@ if [ -n "$pg_dumpall2_status" ]; then
exit 1
fi
+if [ -n "$visibilitymap_vacuum2_status" ];then
+ echo "VACUUM of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
+if [ -n "$visibilitymap_test2_status" ];then
+ echo "SELECT of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
case $testhost in
MINGW*) cmd /c delete_old_cluster.bat ;;
*) sh ./delete_old_cluster.sh ;;
esac
+if ! diff "$temp_root"/vm_test1.txt "$temp_root"/vm_test2.txt >/dev/null; then
+ echo "Visibility map rewriting test failed"
+ exit 1
+fi
+
if diff "$temp_root"/dump1.sql "$temp_root"/dump2.sql >/dev/null; then
echo PASSED
exit 0
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index caa0f14..93afb10 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -320,9 +320,10 @@ typedef struct xl_heap_freeze_page
typedef struct xl_heap_visible
{
TransactionId cutoff_xid;
+ uint8 flags;
} xl_heap_visible;
-#define SizeOfHeapVisible (offsetof(xl_heap_visible, cutoff_xid) + sizeof(TransactionId))
+#define SizeOfHeapVisible (offsetof(xl_heap_visible, flags) + sizeof(uint8))
typedef struct xl_heap_new_cid
{
@@ -389,6 +390,6 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
xl_heap_freeze_tuple *xlrec_tp);
extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
- Buffer vm_buffer, TransactionId cutoff_xid);
+ Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
#endif /* HEAPAM_XLOG_H */
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 0c0e0ef..bb4e184 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -19,15 +19,36 @@
#include "storage/buf.h"
#include "utils/relcache.h"
+/*
+ * Number of bits allocated for each heap block.
+ * One for all-visible, other for all-frozen.
+*/
+#define BITS_PER_HEAPBLOCK 2
+
+/* Number of heap blocks we can represent in one byte. */
+#define HEAPBLOCKS_PER_BYTE 4
+
+/* Flags for bit map */
+#define VISIBILITYMAP_ALL_VISIBLE 0x01
+#define VISIBILITYMAP_ALL_FROZEN 0x02
+#define VISIBILITYMAP_VALID_BITS 0x03 /* OR of all valid visiblitymap flags bits */
+
+/* Macros for visibilitymap test */
+#define VM_ALL_VISIBLE(r, b, v) \
+ ((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_VISIBLE) != 0)
+#define VM_ALL_FROZEN(r, b, v) \
+ ((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_FROZEN) != 0)
+
extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk,
Buffer vmbuf);
extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
Buffer *vmbuf);
extern bool visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid);
-extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
-extern BlockNumber visibilitymap_count(Relation rel);
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags);
+extern uint8 visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
+extern void visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen);
extern void visibilitymap_truncate(Relation rel, BlockNumber nheapblocks);
#endif /* VISIBILITYMAP_H */
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index eba4150..30f3f79 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 201511071
+#define CATALOG_VERSION_NO 201512171
#endif
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index d8640db..9a77d7d 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2779,6 +2779,8 @@ DATA(insert OID = 3056 ( pg_stat_get_analyze_count PGNSP PGUID 12 1 0 0 0 f f f
DESCR("statistics: number of manual analyzes for a table");
DATA(insert OID = 3057 ( pg_stat_get_autoanalyze_count PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_autoanalyze_count _null_ _null_ _null_ ));
DESCR("statistics: number of auto analyzes for a table");
+DATA(insert OID = 6015 ( pg_stat_get_frozen_pages PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 23 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_frozen_pages _null_ _null_ _null_ ));
+DESCR("statistics: number of frozen pages of table");
DATA(insert OID = 1936 ( pg_stat_get_backend_idset PGNSP PGUID 12 1 100 0 0 f f f f t t s r 0 0 23 "" _null_ _null_ _null_ _null_ _null_ pg_stat_get_backend_idset _null_ _null_ _null_ ));
DESCR("statistics: currently active backend IDs");
DATA(insert OID = 2022 ( pg_stat_get_activity PGNSP PGUID 12 1 100 0 0 f f f f f t s r 1 0 2249 "23" "{23,26,23,26,25,25,25,16,1184,1184,1184,1184,869,25,23,28,28,16,25,25,23,16,25}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,datid,pid,usesysid,application_name,state,query,waiting,xact_start,query_start,backend_start,state_change,client_addr,client_hostname,client_port,backend_xid,backend_xmin,ssl,sslversion,sslcipher,sslbits,sslcompression,sslclientdn}" _null_ _null_ pg_stat_get_activity _null_ _null_ _null_ ));
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 9ecc163..ed784bc 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -17,6 +17,7 @@
#include "portability/instr_time.h"
#include "postmaster/pgarch.h"
#include "storage/barrier.h"
+#include "storage/block.h"
#include "utils/hsearch.h"
#include "utils/relcache.h"
@@ -355,6 +356,7 @@ typedef struct PgStat_MsgVacuum
TimestampTz m_vacuumtime;
PgStat_Counter m_live_tuples;
PgStat_Counter m_dead_tuples;
+ BlockNumber m_frozen_pages;
} PgStat_MsgVacuum;
@@ -372,6 +374,7 @@ typedef struct PgStat_MsgAnalyze
TimestampTz m_analyzetime;
PgStat_Counter m_live_tuples;
PgStat_Counter m_dead_tuples;
+ int32 m_frozen_pages;
} PgStat_MsgAnalyze;
@@ -551,7 +554,7 @@ typedef union PgStat_Msg
* ------------------------------------------------------------
*/
-#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9D
+#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9E
/* ----------
* PgStat_StatDBEntry The collector's data per database
@@ -615,6 +618,8 @@ typedef struct PgStat_StatTabEntry
PgStat_Counter n_dead_tuples;
PgStat_Counter changes_since_analyze;
+ int32 n_frozen_pages;
+
PgStat_Counter blocks_fetched;
PgStat_Counter blocks_hit;
@@ -917,9 +922,11 @@ extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type t
extern void pgstat_report_autovac(Oid dboid);
extern void pgstat_report_vacuum(Oid tableoid, bool shared,
- PgStat_Counter livetuples, PgStat_Counter deadtuples);
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages);
extern void pgstat_report_analyze(Relation rel,
- PgStat_Counter livetuples, PgStat_Counter deadtuples);
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages);
extern void pgstat_report_recovery_conflict(int reason);
extern void pgstat_report_deadlock(void);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index a2f78ee..102aa81 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -178,8 +178,10 @@ typedef PageHeaderData *PageHeader;
* tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
+#define PD_ALL_FROZEN 0x0008 /* all tuples on page are completely
+ frozen */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x000F /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -367,7 +369,12 @@ typedef PageHeaderData *PageHeader;
#define PageSetAllVisible(page) \
(((PageHeader) (page))->pd_flags |= PD_ALL_VISIBLE)
#define PageClearAllVisible(page) \
- (((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
+ (((PageHeader) (page))->pd_flags &= ~(PD_ALL_VISIBLE | PD_ALL_FROZEN))
+
+#define PageIsAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags & PD_ALL_FROZEN)
+#define PageSetAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags |= PD_ALL_FROZEN)
#define PageIsPrunable(page, oldestxmin) \
( \
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 80374e4..c6514ad 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1683,6 +1683,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,
pg_stat_get_live_tuples(c.oid) AS n_live_tup,
pg_stat_get_dead_tuples(c.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(c.oid) AS n_mod_since_analyze,
+ pg_stat_get_frozen_pages(c.oid) AS n_frozen_pages,
pg_stat_get_last_vacuum_time(c.oid) AS last_vacuum,
pg_stat_get_last_autovacuum_time(c.oid) AS last_autovacuum,
pg_stat_get_last_analyze_time(c.oid) AS last_analyze,
@@ -1795,6 +1796,7 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
+ pg_stat_all_tables.n_frozen_pages,
pg_stat_all_tables.last_vacuum,
pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,
@@ -1838,6 +1840,7 @@ pg_stat_user_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
+ pg_stat_all_tables.n_frozen_pages,
pg_stat_all_tables.last_vacuum,
pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index f5be70f..95ababf 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -127,6 +127,8 @@ SELECT count(*) FROM tenk2 WHERE unique1 = 1;
1
(1 row)
+-- do VACUUM FREEZE
+VACUUM FREEZE tenk2;
-- force the rate-limiting logic in pgstat_report_tabstat() to time out
-- and send a message
SELECT pg_sleep(1.0);
@@ -175,6 +177,14 @@ SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages,
t | t
(1 row)
+SELECT n_frozen_pages = (pg_relation_size('tenk2') / current_setting('block_size')::int)
+ FROM pg_stat_user_tables
+ WHERE relname ='tenk2';
+ ?column?
+----------
+ t
+(1 row)
+
SELECT pr.snap_ts < pg_stat_get_snapshot_timestamp() as snapshot_newer
FROM prevstats AS pr;
snapshot_newer
diff --git a/src/test/regress/expected/visibilitymap.out b/src/test/regress/expected/visibilitymap.out
new file mode 100644
index 0000000..3be0354
--- /dev/null
+++ b/src/test/regress/expected/visibilitymap.out
@@ -0,0 +1,22 @@
+--
+-- Visibility Map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+-- All pages are become all-visible
+VACUUM FREEZE vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+ ?column?
+----------
+ t
+(1 row)
+
+-- Check whether vacuum skips all-frozen pages
+\set VERBOSITY terse
+VACUUM FREEZE VERBOSE vmtest;
+INFO: vacuuming "public.vmtest"
+INFO: index "vmtest_pkey" now contains 10000 row versions in 30 pages
+INFO: skipped 45 frozen pages according to visibility map
+INFO: "vmtest": found 0 removable, 0 nonremovable row versions in 0 out of 45 pages
+\set VERBOSITY default
+DROP TABLE vmtest;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index b1bc7c7..e31fa76 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -110,3 +110,6 @@ test: event_trigger
# run stats by itself because its delay may be insufficient under heavy load
test: stats
+
+# page info map and vacuum test cannot run concurrently with any test that runs SQL
+test: visibilitymap
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index ade9ef1..666e40c 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -161,3 +161,4 @@ test: with
test: xml
test: event_trigger
test: stats
+test: visibilitymap
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index cd2d592..dea5553 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -120,6 +120,8 @@ ROLLBACK;
SELECT count(*) FROM tenk2;
-- do an indexscan
SELECT count(*) FROM tenk2 WHERE unique1 = 1;
+-- do VACUUM FREEZE
+VACUUM FREEZE tenk2;
-- force the rate-limiting logic in pgstat_report_tabstat() to time out
-- and send a message
@@ -145,6 +147,10 @@ SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages,
FROM pg_statio_user_tables AS st, pg_class AS cl, prevstats AS pr
WHERE st.relname='tenk2' AND cl.relname='tenk2';
+SELECT n_frozen_pages = (pg_relation_size('tenk2') / current_setting('block_size')::int)
+ FROM pg_stat_user_tables
+ WHERE relname ='tenk2';
+
SELECT pr.snap_ts < pg_stat_get_snapshot_timestamp() as snapshot_newer
FROM prevstats AS pr;
diff --git a/src/test/regress/sql/visibilitymap.sql b/src/test/regress/sql/visibilitymap.sql
new file mode 100644
index 0000000..365570b
--- /dev/null
+++ b/src/test/regress/sql/visibilitymap.sql
@@ -0,0 +1,16 @@
+--
+-- Visibility Map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+
+-- All pages are become all-visible
+VACUUM FREEZE vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+
+-- Check whether vacuum skips all-frozen pages
+\set VERBOSITY terse
+VACUUM FREEZE VERBOSE vmtest;
+\set VERBOSITY default
+
+DROP TABLE vmtest;
001_enhance_visibilitymap_debug_messages_v1.patchapplication/octet-stream; name=001_enhance_visibilitymap_debug_messages_v1.patchDownload
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 6186caf..f4d878b 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -175,7 +175,7 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_clear %s, block %d", RelationGetRelationName(rel), heapBlk);
#endif
if (!BufferIsValid(buf) || BufferGetBlockNumber(buf) != mapBlock)
@@ -274,7 +274,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
uint8 *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_set %s, block %d, flags %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
@@ -364,7 +364,7 @@ visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *buf)
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_get_status %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_get_status %s, block %d", RelationGetRelationName(rel), heapBlk);
#endif
/* Reuse the old pinned buffer if possible */
@@ -467,7 +467,7 @@ visibilitymap_truncate(Relation rel, BlockNumber nheapblocks)
uint8 truncBit = HEAPBLK_TO_MAPBIT(nheapblocks);
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_truncate %s %d", RelationGetRelationName(rel), nheapblocks);
+ elog(DEBUG1, "vm_truncate %s, block %d", RelationGetRelationName(rel), nheapblocks);
#endif
RelationOpenSmgr(rel);
On Fri, Dec 18, 2015 at 3:17 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Thu, Dec 17, 2015 at 11:47 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
make check with pg_upgrade is failing for me:
Visibility map rewriting test failed
make: *** [check] Error 1make check with pg_upgrade is done successfully on my environment.
Could you give me more information about this?
Oh, well I see now after digging into your code. You are missing -X
when running psql, and until recently psql -c implied -X all the time.
The reason why it failed for me is that I have \timing enabled in
psqlrc.
Actually test.sh needs to be fixed as well, see the attached, this is
a separate bug. Could a kind committer look at that if this is
acceptable?
Sawada-san, are you planning to continue working on that? At this
stage it seems that this patch is not in commitable state and should
at best be moved to next CF, or at worst returned with feedback.Yes, of course.
This patch should be marked as "Move to next CF".
OK, done so.
--
Michael
Attachments:
pgupgrade-fix.patchbinary/octet-stream; name=pgupgrade-fix.patchDownload
diff --git a/src/bin/pg_upgrade/test.sh b/src/bin/pg_upgrade/test.sh
index aa7f399..7d5a594 100644
--- a/src/bin/pg_upgrade/test.sh
+++ b/src/bin/pg_upgrade/test.sh
@@ -156,7 +156,7 @@ standard_initdb "$oldbindir"/initdb
if "$MAKE" -C "$oldsrc" installcheck; then
pg_dumpall -f "$temp_root"/dump1.sql || pg_dumpall1_status=$?
if [ "$newsrc" != "$oldsrc" ]; then
- oldpgversion=`psql -A -t -d regression -c "SHOW server_version_num"`
+ oldpgversion=`psql -X -A -t -d regression -c "SHOW server_version_num"`
fix_sql=""
case $oldpgversion in
804??)
@@ -169,7 +169,7 @@ if "$MAKE" -C "$oldsrc" installcheck; then
fix_sql="UPDATE pg_proc SET probin = replace(probin, '$oldsrc', '$newsrc') WHERE probin LIKE '$oldsrc%';"
;;
esac
- psql -d regression -c "$fix_sql;" || psql_fix_sql_status=$?
+ psql -X -d regression -c "$fix_sql;" || psql_fix_sql_status=$?
mv "$temp_root"/dump1.sql "$temp_root"/dump1.sql.orig
sed "s;$oldsrc;$newsrc;g" "$temp_root"/dump1.sql.orig >"$temp_root"/dump1.sql
On Thu, Dec 17, 2015 at 1:17 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
I am not really getting the meaning of this sentence. Shouldn't this
be reworded something like:
"Freezing occurs on the whole table once all pages of this relation require it."
That statement isn't remotely true, and I don't think this patch
changes that. Freezing occurs on the whole table once relfrozenxid is
old enough that we think there might be at least one page in the table
that requires it.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Dec 17, 2015 at 2:26 AM, Andres Freund <andres@anarazel.de> wrote:
On 2015-12-17 16:22:24 +0900, Michael Paquier wrote:
On Thu, Dec 17, 2015 at 4:10 PM, Andres Freund <andres@anarazel.de> wrote:
On 2015-12-17 15:56:35 +0900, Michael Paquier wrote:
On Thu, Dec 17, 2015 at 3:44 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
For me, rewriting the visibility map is a new data loss bug waiting to
happen. I am worried that the group is not taking seriously the potential
for catastrophe here.FWIW, I'm following this line and merging the vm file into a single
unit looks like a ticking bomb.And what are those risks?
Incorrect vm file rewrite after a pg_upgrade run.
If we can't manage to rewrite a file, replacing a binary b1 with a b10,
then we shouldn't be working on a database. And if we screw up, recovery
i is an rm *_vm away. I can't imagine that this is going to be the
actually complicated part of this feature.
Yeah. If that part of this feature isn't right, the chances that the
rest of the patch are robust enough to commit seem extremely low.
That is, as Andres says, not the hard part.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hello,
At Fri, 18 Dec 2015 12:09:43 -0500, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoZCCFwgKL0PmSi=htfZ2aCOZPoTPD73ecvSA9rhXa0zUw@mail.gmail.com>
On Thu, Dec 17, 2015 at 1:17 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:I am not really getting the meaning of this sentence. Shouldn't this
be reworded something like:
"Freezing occurs on the whole table once all pages of this relation require it."That statement isn't remotely true, and I don't think this patch
changes that. Freezing occurs on the whole table once relfrozenxid is
old enough that we think there might be at least one page in the table
that requires it.
I doubt I can explain this accurately, but I took the original
phrase as that if and only if all pages of the table are marked
as "requires freezing" by accident, all pages are frozen. It's
quite obvious but it is what I think "happen to require freezing"
means. Does this make sense?
The phrase might not be necessary if this is correct.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Dec 21, 2015 at 3:27 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Hello,
At Fri, 18 Dec 2015 12:09:43 -0500, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoZCCFwgKL0PmSi=htfZ2aCOZPoTPD73ecvSA9rhXa0zUw@mail.gmail.com>
On Thu, Dec 17, 2015 at 1:17 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:I am not really getting the meaning of this sentence. Shouldn't this
be reworded something like:
"Freezing occurs on the whole table once all pages of this relation require it."That statement isn't remotely true, and I don't think this patch
changes that. Freezing occurs on the whole table once relfrozenxid is
old enough that we think there might be at least one page in the table
that requires it.I doubt I can explain this accurately, but I took the original
phrase as that if and only if all pages of the table are marked
as "requires freezing" by accident, all pages are frozen. It's
quite obvious but it is what I think "happen to require freezing"
means. Does this make sense?The phrase might not be necessary if this is correct.
Maybe you are trying to say something like "only those pages which
require freezing are frozen?".
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Dec 17, 2015 at 06:44:46AM +0000, Simon Riggs wrote:
Thank you for having a look.
I would not bother mentioning this detail in the pg_upgrade manual page:
+� �Since the format of visibility map has been changed in version 9.6, +� �<application>pg_upgrade</> creates and rewrite new <literal>'_vm'</literal>
+� �file even if upgrading from 9.5 or before to 9.6 or later with link
mode (-k).
Really?� I know we don't always document things like this, but it
seems like a good idea to me that we do so.Agreed.
For me, rewriting the visibility map is a new data loss bug waiting to happen.
I am worried that the group is not taking seriously the potential for
catastrophe here. I think we can do it, but I think it needs these things* Clear notice that it is happening unconditionally and unavoidably
* Log files showing it has happened, action by action
* Very clear mechanism for resolving an incomplete or interrupted upgrade
process. Which VMs got upgraded? Which didn't?
* Ability to undo an upgrade attempt, somehow, ideally automatically by default
* Ability to restart a failed upgrade attempt without doing a "double upgrade",
i.e. ensure transformation is immutable
Have you forgotten how pg_upgrade works? This new vm file is only
created on the new cluster, which is not usable if pg_upgrade doesn't
complete successfully. pg_upgrade never modifies the old cluster.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Dec 21, 2015 at 11:54 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Dec 21, 2015 at 3:27 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:Hello,
At Fri, 18 Dec 2015 12:09:43 -0500, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoZCCFwgKL0PmSi=htfZ2aCOZPoTPD73ecvSA9rhXa0zUw@mail.gmail.com>
On Thu, Dec 17, 2015 at 1:17 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:I am not really getting the meaning of this sentence. Shouldn't this
be reworded something like:
"Freezing occurs on the whole table once all pages of this relation require it."That statement isn't remotely true, and I don't think this patch
changes that. Freezing occurs on the whole table once relfrozenxid is
old enough that we think there might be at least one page in the table
that requires it.I doubt I can explain this accurately, but I took the original
phrase as that if and only if all pages of the table are marked
as "requires freezing" by accident, all pages are frozen. It's
quite obvious but it is what I think "happen to require freezing"
means. Does this make sense?The phrase might not be necessary if this is correct.
Maybe you are trying to say something like "only those pages which
require freezing are frozen?".
I was thinking the same as what Horiguchi-san said.
That is, even if relfrozenxid is old enough, freezing on the whole
table is not required if the table are marked as "not requires
freezing".
In other word, only those pages which are marked as "not frozen" are frozen.
Regards,
--
Masahiko Sawada
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Dec 28, 2015 at 6:38 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Mon, Dec 21, 2015 at 11:54 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Dec 21, 2015 at 3:27 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:Hello,
At Fri, 18 Dec 2015 12:09:43 -0500, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoZCCFwgKL0PmSi=htfZ2aCOZPoTPD73ecvSA9rhXa0zUw@mail.gmail.com>
On Thu, Dec 17, 2015 at 1:17 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:I am not really getting the meaning of this sentence. Shouldn't this
be reworded something like:
"Freezing occurs on the whole table once all pages of this relation require it."That statement isn't remotely true, and I don't think this patch
changes that. Freezing occurs on the whole table once relfrozenxid is
old enough that we think there might be at least one page in the table
that requires it.I doubt I can explain this accurately, but I took the original
phrase as that if and only if all pages of the table are marked
as "requires freezing" by accident, all pages are frozen. It's
quite obvious but it is what I think "happen to require freezing"
means. Does this make sense?The phrase might not be necessary if this is correct.
Maybe you are trying to say something like "only those pages which
require freezing are frozen?".I was thinking the same as what Horiguchi-san said.
That is, even if relfrozenxid is old enough, freezing on the whole
table is not required if the table are marked as "not requires
freezing".
In other word, only those pages which are marked as "not frozen" are frozen.
The recently changes to HEAD conflicts with freeze map patch, so I've
updated and attached latest freeze map patch.
The another patch that enhances the debug log message of visibilitymap
is attached to previous mail.
</messages/by-id/CAD21AoBScUD4k_QWrYGRmbXVruiekPY=2BY2Fxhqq55a+tzUxg@mail.gmail.com>.
Please review it.
Regards,
--
Masahiko Sawada
Attachments:
000_add_frozen_bit_into_visibilitymap_v32.patchtext/x-patch; charset=US-ASCII; name=000_add_frozen_bit_into_visibilitymap_v32.patchDownload
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 001988b..5d08c73 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -87,7 +87,7 @@ statapprox_heap(Relation rel, output_type *stat)
* If the page has only visible tuples, then we can find out the free
* space from the FSM and move on.
*/
- if (visibilitymap_test(rel, blkno, &vmbuffer))
+ if (VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
{
freespace = GetRecordedFreeSpace(rel, blkno);
stat->tuple_len += BLCKSZ - freespace;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 392eb70..c43443a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5916,7 +5916,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs an eager freezing if the table's
<structname>pg_class</>.<structfield>relfrozenxid</> field has reached
the age specified by this setting. The default is 150 million
transactions. Although users can set this value anywhere from zero to
@@ -5960,7 +5960,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs an eager freezing if the table's
<structname>pg_class</>.<structfield>relminmxid</> field has reached
the age specified by this setting. The default is 150 million multixacts.
Although users can set this value anywhere from zero to two billions,
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 5204b34..7cc975d 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -352,9 +352,9 @@
Vacuum maintains a <link linkend="storage-vm">visibility map</> for each
table to keep track of which pages contain only tuples that are known to be
visible to all active transactions (and all future transactions, until the
- page is again modified). This has two purposes. First, vacuum
- itself can skip such pages on the next run, since there is nothing to
- clean up.
+ page is again modified), and pages contain only frozen tuples.
+ This has two purposes. First, vacuum itself can skip such pages
+ on the next run, since there is nothing to clean up.
</para>
<para>
@@ -438,28 +438,25 @@
</para>
<para>
- <command>VACUUM</> normally skips pages that don't have any dead row
- versions, but those pages might still have row versions with old XID
- values. To ensure all old row versions have been frozen, a
- scan of the whole table is needed.
+ <command>VACUUM</> skips pages that don't have any dead row
+ versions, and pages that have only frozen rows. To ensure all old
+ row versions have been frozen, a scan of all unfrozen pages is needed.
<xref linkend="guc-vacuum-freeze-table-age"> controls when
- <command>VACUUM</> does that: a whole table sweep is forced if
- the table hasn't been fully scanned for <varname>vacuum_freeze_table_age</>
- minus <varname>vacuum_freeze_min_age</> transactions. Setting it to 0
- forces <command>VACUUM</> to always scan all pages, effectively ignoring
- the visibility map.
+ <command>VACUUM</> does that: a whole-table freezing is forced if
+ the table hasn't been ensured all row versions are frozen for
+ <varname>vacuum_freeze_table_age</> minus <varname>vacuum_freeze_min_age</>
+ transcation.
</para>
<para>
The maximum time that a table can go unvacuumed is two billion
transactions minus the <varname>vacuum_freeze_min_age</> value at
- the time <command>VACUUM</> last scanned the whole table. If it were to go
- unvacuumed for longer than
- that, data loss could result. To ensure that this does not happen,
- autovacuum is invoked on any table that might contain unfrozen rows with
- XIDs older than the age specified by the configuration parameter <xref
- linkend="guc-autovacuum-freeze-max-age">. (This will happen even if
- autovacuum is disabled.)
+ the time <command>VACUUM</> last scanned unfrozen pages. If it were to go
+ unvacuumed for longer than that, data loss could result. To ensure
+ that this does not happen, autovacuum is invoked on any table that might
+ contain unfrozen rows with XIDs older than the age specified by the
+ configuration parameter <xref linkend="guc-autovacuum-freeze-max-age">.
+ (This will happen even if autovacuum is disabled.)
</para>
<para>
@@ -490,8 +487,7 @@
a regularly scheduled <command>VACUUM</> or an autovacuum triggered by
normal delete and update activity is run in that window. Setting it too
close could lead to anti-wraparound autovacuums, even though the table
- was recently vacuumed to reclaim space, whereas lower values lead to more
- frequent whole-table scans.
+ was recently vacuumed to reclaim space.
</para>
<para>
@@ -527,7 +523,7 @@
<structname>pg_database</>. In particular,
the <structfield>relfrozenxid</> column of a table's
<structname>pg_class</> row contains the freeze cutoff XID that was used
- by the last whole-table <command>VACUUM</> for that table. All rows
+ by the last whole-table freezing for that table. All rows
inserted by transactions with XIDs older than this cutoff XID are
guaranteed to have been frozen. Similarly,
the <structfield>datfrozenxid</> column of a database's
@@ -555,17 +551,18 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
<command>VACUUM</> normally
only scans pages that have been modified since the last vacuum, but
<structfield>relfrozenxid</> can only be advanced when the whole table is
- scanned. The whole table is scanned when <structfield>relfrozenxid</> is
- more than <varname>vacuum_freeze_table_age</> transactions old, when
- <command>VACUUM</>'s <literal>FREEZE</> option is used, or when all pages
- happen to
- require vacuuming to remove dead row versions. When <command>VACUUM</>
- scans the whole table, after it's finished <literal>age(relfrozenxid)</>
- should be a little more than the <varname>vacuum_freeze_min_age</> setting
- that was used (more by the number of transactions started since the
- <command>VACUUM</> started). If no whole-table-scanning <command>VACUUM</>
- is issued on the table until <varname>autovacuum_freeze_max_age</> is
- reached, an autovacuum will soon be forced for the table.
+ frozen. Freezing occurs on the whole table once all pages of this relation
+ require it. In other cases such as where <structfield>relfrozenxid</> is more
+ than <varname>vacuum_freeze_table_age</> transactions old, where
+ <command>VACUUM</>'s <literal>FREEZE</> option is used, <command>VACUUM</>
+ can skip the pages that all tuples on the page itself are marked as frozen.
+ When all pages of table are eventually marked as frozen by <command>VACUUM</>,
+ after it's finished <literal>age(relfrozenxid)</> should be a little more
+ than the <varname>vacuum_freeze_min_age</> setting that was used (more by
+ the number of transactions started since the <command>VACUUM</> started).
+ If the advancing of <structfield>relfrozenxid</> is not happend until
+ <varname>autovacuum_freeze_max_age</> is reached, an autovacuum will soon
+ be forced for the table.
</para>
<para>
@@ -642,28 +639,28 @@ HINT: Stop the postmaster and vacuum that database in single-user mode.
<structname>pg_class</>.<structfield>relminmxid</> stores the oldest
possible multixact ID still appearing in any tuple of that table.
If this value is older than
- <xref linkend="guc-vacuum-multixact-freeze-table-age">, a whole-table
+ <xref linkend="guc-vacuum-multixact-freeze-table-age">, a table
scan is forced. <function>mxid_age()</> can be used on
<structname>pg_class</>.<structfield>relminmxid</> to find its age.
</para>
<para>
- Whole-table <command>VACUUM</> scans, regardless of
- what causes them, enable advancing the value for that table.
+ When <command>VACUUM</> scans all unfrozen pages, regardless of what causes
+ them, enable advancing the value for that table.
Eventually, as all tables in all databases are scanned and their
oldest multixact values are advanced, on-disk storage for older
multixacts can be removed.
</para>
<para>
- As a safety device, a whole-table vacuum scan will occur for any table
- whose multixact-age is greater than
- <xref linkend="guc-autovacuum-multixact-freeze-max-age">. Whole-table
+ As a safety device, a vacuum scan will occur for any table whose
+ multixact-age is greater than
+ <xref linkend="guc-autovacuum-multixact-freeze-max-age">. A
vacuum scans will also occur progressively for all tables, starting with
those that have the oldest multixact-age, if the amount of used member
storage space exceeds the amount 50% of the addressable storage space.
- Both of these kinds of whole-table scans will occur even if autovacuum is
- nominally disabled.
+ Both of these kinds of scans will occur even if autovacuum is nominally
+ disabled.
</para>
</sect3>
</sect2>
@@ -743,8 +740,8 @@ vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuple
<command>UPDATE</command> and <command>DELETE</command> operation. (It
is only semi-accurate because some information might be lost under heavy
load.) If the <structfield>relfrozenxid</> value of the table is more
- than <varname>vacuum_freeze_table_age</> transactions old, the whole
- table is scanned to freeze old tuples and advance
+ than <varname>vacuum_freeze_table_age</> transactions old, the table is
+ scanned to freeze old tuples and advance
<structfield>relfrozenxid</>, otherwise only pages that have been modified
since the last vacuum are scanned.
</para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 85459d0..0bcd52d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1423,6 +1423,11 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
<entry>Estimated number of rows modified since this table was last analyzed</entry>
</row>
<row>
+ <entry><structfield>n_frozen_pages</></entry>
+ <entry><type>integer</></entry>
+ <entry>Number of frozen pages</entry>
+ </row>
+ <row>
<entry><structfield>last_vacuum</></entry>
<entry><type>timestamp with time zone</></entry>
<entry>Last time at which this table was manually vacuumed
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 164d08c..5dc8b04 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -623,18 +623,20 @@ can be used to examine the information stored in free space maps.
<para>
Each heap relation has a Visibility Map
(VM) to keep track of which pages contain only tuples that are known to be
-visible to all active transactions. It's stored
-alongside the main relation data in a separate relation fork, named after the
-filenode number of the relation, plus a <literal>_vm</> suffix. For example,
-if the filenode of a relation is 12345, the VM is stored in a file called
-<filename>12345_vm</>, in the same directory as the main relation file.
+visible to all active transactions, and pages contain only unfrozen tuples.
+It's stored alongside the main relation data in a separate relation fork,
+named after the filenode number of the relation, plus a <literal>_vm</> suffix.
+For example, if the filenode of a relation is 12345, the VM is stored in a file
+called <filename>12345_vm</>, in the same directory as the main relation file.
Note that indexes do not have VMs.
</para>
<para>
-The visibility map simply stores one bit per heap page. A set bit means
-that all tuples on the page are known to be visible to all transactions.
-This means that the page does not contain any tuples that need to be vacuumed.
+The visibility map stores two bits per heap page: all-visible, all-frozen.
+A set all-visible bit means that all tuples on the page are known to be visible
+to all transactions. A set all-frozen bit means that all tuples on the page are
+completely marked as frozen. This means that the page does not contain any tuples
+that need to be vacuumed, even if a scan of whole table is required.
This information can also be used by <firstterm>index-only scans</> to answer
queries using only the index tuple.
</para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f443742..e75144f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3034,9 +3034,9 @@ heap_delete(Relation relation, ItemPointer tid,
/*
* If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, we'll have to unlock and
- * re-lock, to avoid holding the buffer lock across an I/O. That's a bit
- * unfortunate, but hopefully shouldn't happen often.
+ * visible or all frozen while we were busy locking the buffer, we'll
+ * have to unlock and re-lock, to avoid holding the buffer lock across an
+ * I/O. That's a bit unfortunate, but hopefully shouldn't happen often.
*/
if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
{
@@ -7205,7 +7205,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
*/
XLogRecPtr
log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
- TransactionId cutoff_xid)
+ TransactionId cutoff_xid, uint8 vmflags)
{
xl_heap_visible xlrec;
XLogRecPtr recptr;
@@ -7215,6 +7215,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
Assert(BufferIsValid(vm_buffer));
xlrec.cutoff_xid = cutoff_xid;
+ xlrec.flags = vmflags;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
@@ -7804,7 +7805,12 @@ heap_xlog_visible(XLogReaderState *record)
* the subsequent update won't be replayed to clear the flag.
*/
page = BufferGetPage(buffer);
- PageSetAllVisible(page);
+
+ if (xlrec->flags & VISIBILITYMAP_ALL_VISIBLE)
+ PageSetAllVisible(page);
+ if (xlrec->flags & VISIBILITYMAP_ALL_FROZEN)
+ PageSetAllFrozen(page);
+
MarkBufferDirty(buffer);
}
else if (action == BLK_RESTORED)
@@ -7856,7 +7862,7 @@ heap_xlog_visible(XLogReaderState *record)
*/
if (lsn > PageGetLSN(vmpage))
visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
- xlrec->cutoff_xid);
+ xlrec->cutoff_xid, xlrec->flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index fc28f3f..7c6634a 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -15,39 +15,45 @@
* visibilitymap_pin - pin a map page for setting a bit
* visibilitymap_pin_ok - check whether correct map page is already pinned
* visibilitymap_set - set a bit in a previously pinned page
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_get_status - get status of bits
* visibilitymap_count - count number of bits set in visibility map
* visibilitymap_truncate - truncate the visibility map
*
* NOTES
*
- * The visibility map is a bitmap with one bit per heap page. A set bit means
- * that all tuples on the page are known visible to all transactions, and
- * therefore the page doesn't need to be vacuumed. The map is conservative in
- * the sense that we make sure that whenever a bit is set, we know the
- * condition is true, but if a bit is not set, it might or might not be true.
+ * The visibility map is a bitmap with two bits (all-visible and all-frozen)
+ * per heap page. A set all-visible bit means that all tuples on the page are
+ * known visible to all transactions, and therefore the page doesn't need to
+ * be vacuumed. A set all-frozen bit means that all tuples on the page are
+ * completely frozen, and therefore the page doesn't need to be vacuumed even
+ * if whole table scanning vacuum is required (e.g. anti-wraparound vacuum).
+ * The all-frozen bit must be set only when the page is already all-visible.
*
- * Clearing a visibility map bit is not separately WAL-logged. The callers
+ * The map is conservative in the sense that we make sure that whenever a bit
+ * is set, we know the condition is true, but if a bit is not set, it might or
+ * might not be true.
+ *
+ * Clearing both visibility map bits is not separately WAL-logged. The callers
* must make sure that whenever a bit is cleared, the bit is cleared on WAL
* replay of the updating operation as well.
*
* When we *set* a visibility map during VACUUM, we must write WAL. This may
* seem counterintuitive, since the bit is basically a hint: if it is clear,
- * it may still be the case that every tuple on the page is visible to all
- * transactions; we just don't know that for certain. The difficulty is that
- * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
- * on the page itself, and the visibility map bit. If a crash occurs after the
- * visibility map page makes it to disk and before the updated heap page makes
- * it to disk, redo must set the bit on the heap page. Otherwise, the next
- * insert, update, or delete on the heap page will fail to realize that the
- * visibility map bit must be cleared, possibly causing index-only scans to
- * return wrong answers.
+ * it may still be the case that every tuple on the page is all-visible or
+ * all-frozen we just don't know that for certain. The difficulty is that
+ * there are two bits which are typically set together: the PD_ALL_VISIBLE
+ * or PD_ALL_FROZEN bit on the page itself, and the corresponding visibility map
+ * bit. If a crash occurs after the visibility map page makes it to disk and before
+ * the updated heap page makes it to disk, redo must set the bit on the heap page.
+ * Otherwise, the next insert, update, or delete on the heap page will fail to
+ * realize that the visibility map bit must be cleared, possibly causing index-only
+ * scans to return wrong answers.
*
* VACUUM will normally skip pages for which the visibility map bit is set;
* such pages can't contain any dead tuples and therefore don't need vacuuming.
- * The visibility map is not used for anti-wraparound vacuums, because
- * an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
- * present in the table, even on pages that don't have any dead tuples.
+ * The visibility map has the all-frozen bit which indicates all tuples on
+ * corresponding page have been completely frozen, so the visibility map is also
+ * used for anti-wraparound vacuum, even if freezing of tuples is required.
*
* LOCKING
*
@@ -58,14 +64,14 @@
* section that logs the page modification. However, we don't want to hold
* the buffer lock over any I/O that may be required to read in the visibility
* map page. To avoid this, we examine the heap page before locking it;
- * if the page-level PD_ALL_VISIBLE bit is set, we pin the visibility map
- * bit. Then, we lock the buffer. But this creates a race condition: there
- * is a possibility that in the time it takes to lock the buffer, the
- * PD_ALL_VISIBLE bit gets set. If that happens, we have to unlock the
- * buffer, pin the visibility map page, and relock the buffer. This shouldn't
- * happen often, because only VACUUM currently sets visibility map bits,
- * and the race will only occur if VACUUM processes a given page at almost
- * exactly the same time that someone tries to further modify it.
+ * if the page-level PD_ALL_VISIBLE or PD_ALL_FROZEN bit is set, we pin the
+ * visibility map bit. Then, we lock the buffer. But this creates a race
+ * condition: there is a possibility that in the time it takes to lock the
+ * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that happens,
+ * we have to unlock the buffer, pin the visibility map page, and relock the
+ * buffer. This shouldn't happen often, because only VACUUM currently sets
+ * visibility map bits, and the race will only occur if VACUUM processes a given
+ * page at almost exactly the same time that someone tries to further modify it.
*
* To set a bit, you need to hold a lock on the heap page. That prevents
* the race condition where VACUUM sees that all tuples on the page are
@@ -101,38 +107,50 @@
*/
#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
-/* Number of bits allocated for each heap block. */
-#define BITS_PER_HEAPBLOCK 1
-
-/* Number of heap blocks we can represent in one byte. */
-#define HEAPBLOCKS_PER_BYTE 8
-
/* Number of heap blocks we can represent in one visibility map page. */
#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
/* Mapping from heap block number to the right bit in the visibility map */
#define HEAPBLK_TO_MAPBLOCK(x) ((x) / HEAPBLOCKS_PER_PAGE)
#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
-#define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
-
-/* table for fast counting of set bits */
-static const uint8 number_of_ones[256] = {
- 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
+#define HEAPBLK_TO_MAPBIT(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)
+
+/* tables for fast counting of set bits for visible and freeze */
+static const uint8 number_of_ones_for_visible[256] = {
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4
+};
+static const uint8 number_of_ones_for_frozen[256] = {
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4
};
/* prototypes for internal routines */
@@ -141,7 +159,7 @@ static void vm_extend(Relation rel, BlockNumber nvmblocks);
/*
- * visibilitymap_clear - clear a bit in visibility map
+ * visibilitymap_clear - clear all bits in visibility map
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -153,7 +171,7 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
int mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- uint8 mask = 1 << mapBit;
+ uint8 mask = VISIBILITYMAP_VALID_BITS << mapBit;
char *map;
#ifdef TRACE_VISIBILITYMAP
@@ -186,7 +204,7 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
* visibilitymap_set to actually set the bit.
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by
- * an earlier call to visibilitymap_pin or visibilitymap_test on the same
+ * an earlier call to visibilitymap_pin or visibilitymap_get_status on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk.
*
@@ -212,7 +230,7 @@ visibilitymap_pin(Relation rel, BlockNumber heapBlk, Buffer *buf)
* visibilitymap_pin_ok - do we already have the correct page pinned?
*
* On entry, buf should be InvalidBuffer or a valid buffer returned by
- * an earlier call to visibilitymap_pin or visibilitymap_test on the same
+ * an earlier call to visibilitymap_pin or visibilitymap_get_status on the same
* relation. The return value indicates whether the buffer covers the
* given heapBlk.
*/
@@ -225,7 +243,7 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
}
/*
- * visibilitymap_set - set a bit on a previously pinned page
+ * visibilitymap_set - set bit(s) on a previously pinned page
*
* recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
* or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
@@ -234,10 +252,11 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
* marked all-visible; it is needed for Hot Standby, and can be
* InvalidTransactionId if the page contains no tuples.
*
- * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
- * this function. Except in recovery, caller should also pass the heap
- * buffer. When checksums are enabled and we're not in recovery, we must add
- * the heap buffer to the WAL chain to protect it from being torn.
+ * Caller is expected to set the heap page's PD_ALL_VISIBLE or PD_ALL_FROZEN
+ * bit before calling this function. Except in recovery, caller should also
+ * pass the heap buffer and flags which indicates what flag we want to set.
+ * When checksums are enabled and we're not in recovery, we must add the heap
+ * buffer to the WAL chain to protect it from being torn.
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -245,13 +264,14 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
*/
void
visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid)
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
uint8 mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
Page page;
- char *map;
+ uint8 *map;
#ifdef TRACE_VISIBILITYMAP
elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
@@ -259,6 +279,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
Assert(InRecovery || BufferIsValid(heapBuf));
+ Assert(flags & VISIBILITYMAP_VALID_BITS);
/* Check that we have the right heap page pinned, if present */
if (BufferIsValid(heapBuf) && BufferGetBlockNumber(heapBuf) != heapBlk)
@@ -269,14 +290,14 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
elog(ERROR, "wrong VM buffer passed to visibilitymap_set");
page = BufferGetPage(vmBuf);
- map = PageGetContents(page);
+ map = (uint8 *)PageGetContents(page);
LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
- if (!(map[mapByte] & (1 << mapBit)))
+ if (flags != (map[mapByte] >> mapBit & VISIBILITYMAP_VALID_BITS))
{
START_CRIT_SECTION();
- map[mapByte] |= (1 << mapBit);
+ map[mapByte] |= (flags << mapBit);
MarkBufferDirty(vmBuf);
if (RelationNeedsWAL(rel))
@@ -285,7 +306,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Assert(!InRecovery);
recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
- cutoff_xid);
+ cutoff_xid, flags);
/*
* If data checksums are enabled (or wal_log_hints=on), we
@@ -295,11 +316,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Page heapPage = BufferGetPage(heapBuf);
- /* caller is expected to set PD_ALL_VISIBLE first */
- Assert(PageIsAllVisible(heapPage));
+ /*
+ * Caller is expected to set PD_ALL_VISIBLE or
+ * PD_ALL_FROZEN first.
+ */
+ Assert(((flags | VISIBILITYMAP_ALL_VISIBLE) && PageIsAllVisible(heapPage)) ||
+ ((flags | VISIBILITYMAP_ALL_FROZEN) && PageIsAllFrozen(heapPage)));
PageSetLSN(heapPage, recptr);
}
}
+
PageSetLSN(page, recptr);
}
@@ -310,15 +336,17 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
}
/*
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_get_status - get status of bits
*
- * Are all tuples on heapBlk visible to all, according to the visibility map?
+ * Are all tuples on heapBlk visible to all or are marked frozen, according
+ * to the visibility map?
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by an
- * earlier call to visibilitymap_pin or visibilitymap_test on the same
+ * earlier call to visibilitymap_pin or visibilitymap_get_status on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk, or InvalidBuffer. The caller is responsible for
- * releasing *buf after it's done testing and setting bits.
+ * releasing *buf after it's done testing and setting bits, and must pass flags
+ * for which it needs to check the value in visibility map.
*
* NOTE: This function is typically called without a lock on the heap page,
* so somebody else could change the bit just after we look at it. In fact,
@@ -327,17 +355,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
* we might see the old value. It is the caller's responsibility to deal with
* all concurrency issues!
*/
-bool
-visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
+uint8
+visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *buf)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
uint8 mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- bool result;
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_test %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_get_status %s %d", RelationGetRelationName(rel), heapBlk);
#endif
/* Reuse the old pinned buffer if possible */
@@ -360,13 +387,11 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
map = PageGetContents(BufferGetPage(*buf));
/*
- * A single-bit read is atomic. There could be memory-ordering effects
+ * The double bits read is atomic. There could be memory-ordering effects
* here, but for performance reasons we make it the caller's job to worry
* about that.
*/
- result = (map[mapByte] & (1 << mapBit)) ? true : false;
-
- return result;
+ return ((map[mapByte] >> mapBit) & VISIBILITYMAP_VALID_BITS);
}
/*
@@ -374,14 +399,21 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
*
* Note: we ignore the possibility of race conditions when the table is being
* extended concurrently with the call. New pages added to the table aren't
- * going to be marked all-visible, so they won't affect the result.
+ * going to be marked all-visible or all-frozen, so they won't affect the result.
+ * The caller must set the flags which indicates what flag we want to count.
*/
-BlockNumber
-visibilitymap_count(Relation rel)
+void
+visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen)
{
- BlockNumber result = 0;
BlockNumber mapBlock;
+ /* all_visible must be specified */
+ Assert(all_visible);
+
+ *all_visible = 0;
+ if (all_frozen)
+ *all_frozen = 0;
+
for (mapBlock = 0;; mapBlock++)
{
Buffer mapBuffer;
@@ -406,13 +438,13 @@ visibilitymap_count(Relation rel)
for (i = 0; i < MAPSIZE; i++)
{
- result += number_of_ones[map[i]];
+ *all_visible += number_of_ones_for_visible[map[i]];
+ if (all_frozen)
+ *all_frozen += number_of_ones_for_frozen[map[i]];
}
ReleaseBuffer(mapBuffer);
}
-
- return result;
}
/*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 062691c..f7f26cd 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1923,7 +1923,7 @@ index_update_stats(Relation rel,
BlockNumber relallvisible;
if (rd_rel->relkind != RELKIND_INDEX)
- relallvisible = visibilitymap_count(rel);
+ visibilitymap_count(rel, &relallvisible, NULL);
else /* don't bother for indexes */
relallvisible = 0;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 923fe58..86437c6 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -452,6 +452,7 @@ CREATE VIEW pg_stat_all_tables AS
pg_stat_get_live_tuples(C.oid) AS n_live_tup,
pg_stat_get_dead_tuples(C.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(C.oid) AS n_mod_since_analyze,
+ pg_stat_get_frozen_pages(C.oid) AS n_frozen_pages,
pg_stat_get_last_vacuum_time(C.oid) as last_vacuum,
pg_stat_get_last_autovacuum_time(C.oid) as last_autovacuum,
pg_stat_get_last_analyze_time(C.oid) as last_analyze,
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 070df29..d7f3035 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -314,6 +314,8 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
+ BlockNumber relallvisible,
+ relallfrozen;
if (inh)
ereport(elevel,
@@ -564,51 +566,56 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
}
}
- /*
- * Update pages/tuples stats in pg_class ... but not if we're doing
- * inherited stats.
- */
if (!inh)
+ {
+ /* Calculate the number of all-visible and all-frozen bit */
+ visibilitymap_count(onerel, &relallvisible, &relallfrozen);
+
+ /*
+ * Update pages/tuples stats in pg_class ... but not if we're doing
+ * inherited stats.
+ */
vac_update_relstats(onerel,
relpages,
totalrows,
- visibilitymap_count(onerel),
+ relallvisible,
hasindex,
InvalidTransactionId,
InvalidMultiXactId,
in_outer_xact);
- /*
- * Same for indexes. Vacuum always scans all indexes, so if we're part of
- * VACUUM ANALYZE, don't overwrite the accurate count already inserted by
- * VACUUM.
- */
- if (!inh && !(options & VACOPT_VACUUM))
- {
- for (ind = 0; ind < nindexes; ind++)
+ /*
+ * Same for indexes. Vacuum always scans all indexes, so if we're part of
+ * VACUUM ANALYZE, don't overwrite the accurate count already inserted by
+ * VACUUM.
+ */
+ if (!(options & VACOPT_VACUUM))
{
- AnlIndexData *thisdata = &indexdata[ind];
- double totalindexrows;
-
- totalindexrows = ceil(thisdata->tupleFract * totalrows);
- vac_update_relstats(Irel[ind],
- RelationGetNumberOfBlocks(Irel[ind]),
- totalindexrows,
- 0,
- false,
- InvalidTransactionId,
- InvalidMultiXactId,
- in_outer_xact);
+ for (ind = 0; ind < nindexes; ind++)
+ {
+ AnlIndexData *thisdata = &indexdata[ind];
+ double totalindexrows;
+
+ totalindexrows = ceil(thisdata->tupleFract * totalrows);
+ vac_update_relstats(Irel[ind],
+ RelationGetNumberOfBlocks(Irel[ind]),
+ totalindexrows,
+ 0,
+ false,
+ InvalidTransactionId,
+ InvalidMultiXactId,
+ in_outer_xact);
+ }
}
- }
- /*
- * Report ANALYZE to the stats collector, too. However, if doing
- * inherited stats we shouldn't report, because the stats collector only
- * tracks per-table stats.
- */
- if (!inh)
- pgstat_report_analyze(onerel, totalrows, totaldeadrows);
+ /*
+ * Report ANALYZE to the stats collector, too. However, if doing
+ * inherited stats we shouldn't report, because the stats collector only
+ * tracks per-table stats.
+ */
+ pgstat_report_analyze(onerel, totalrows, totaldeadrows, relallfrozen);
+
+ }
/* If this isn't part of VACUUM ANALYZE, let index AMs do cleanup */
if (!(options & VACOPT_VACUUM))
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 4f6f6e7..06eadbc 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -106,6 +106,8 @@ typedef struct LVRelStats
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* number of pages we examined */
BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber vmskipped_frozen_pages; /* # of pages we skipped by all-frozen bit
+ of visibility map */
double scanned_tuples; /* counts only tuples on scanned pages */
double old_rel_tuples; /* previous value of pg_class.reltuples */
double new_rel_tuples; /* new estimated total # of tuples */
@@ -156,8 +158,9 @@ static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
ItemPointer itemptr);
static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
static int vac_cmp_itemptr(const void *left, const void *right);
-static bool heap_page_is_all_visible(Relation rel, Buffer buf,
- TransactionId *visibility_cutoff_xid);
+static void heap_page_visible_status(Relation rel, Buffer buf,
+ TransactionId *visibility_cutoff_xid,
+ bool *all_visible, bool *all_frozen);
/*
@@ -188,7 +191,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
MultiXactId mxactFullScanLimit;
BlockNumber new_rel_pages;
double new_rel_tuples;
- BlockNumber new_rel_allvisible;
+ BlockNumber new_rel_allvisible,
+ new_rel_allfrozen;
double new_live_tuples;
TransactionId new_frozen_xid;
MultiXactId new_min_multi;
@@ -221,7 +225,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* We request a full scan if either the table's frozen Xid is now older
* than or equal to the requested Xid full-table scan limit; or if the
* table's minimum MultiXactId is older than or equal to the requested
- * mxid full-table scan limit.
+ * mxid full-table scan limit. During full scan, we could skip some pages
+ * according to all-frozen bit of visibility map.
*/
scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
xidFullScanLimit);
@@ -253,7 +258,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
- if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages + vacrelstats->vmskipped_frozen_pages)
+ < vacrelstats->rel_pages)
{
Assert(!scan_all);
scanned_all = false;
@@ -295,10 +301,13 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_tuples = vacrelstats->old_rel_tuples;
}
- new_rel_allvisible = visibilitymap_count(onerel);
+ visibilitymap_count(onerel, &new_rel_allvisible, &new_rel_allfrozen);
if (new_rel_allvisible > new_rel_pages)
new_rel_allvisible = new_rel_pages;
+ if (new_rel_allfrozen > new_rel_pages)
+ new_rel_allfrozen = new_rel_pages;
+
new_frozen_xid = scanned_all ? FreezeLimit : InvalidTransactionId;
new_min_multi = scanned_all ? MultiXactCutoff : InvalidMultiXactId;
@@ -319,7 +328,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
pgstat_report_vacuum(RelationGetRelid(onerel),
onerel->rd_rel->relisshared,
new_live_tuples,
- vacrelstats->new_dead_tuples);
+ vacrelstats->new_dead_tuples,
+ new_rel_allfrozen);
/* and log the action if appropriate */
if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
@@ -354,10 +364,11 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel),
vacrelstats->num_index_scans);
- appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins\n"),
+ appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped according to vm\n"),
vacrelstats->pages_removed,
vacrelstats->rel_pages,
- vacrelstats->pinskipped_pages);
+ vacrelstats->pinskipped_pages,
+ vacrelstats->vmskipped_frozen_pages);
appendStringInfo(&buf,
_("tuples: %.0f removed, %.0f remain, %.0f are dead but not yet removable\n"),
vacrelstats->tuples_deleted,
@@ -480,9 +491,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* consecutive pages. Since we're reading sequentially, the OS should be
* doing readahead for us, so there's no gain in skipping a page now and
* then; that's likely to disable readahead and so be counterproductive.
- * Also, skipping even a single page means that we can't update
- * relfrozenxid, so we only want to do it if we can skip a goodly number
- * of pages.
+ * Also, skipping even a single page according to all-visible bit of
+ * visibility map means that we can't update relfrozenxid, so we only want
+ * to do it if we can skip a goodly number. On the other hand, we count
+ * both how many pages we skipped according to all-frozen bit of visibility
+ * map and how many pages we freeze, so we can update relfrozenxid if
+ * the sum of two is as many as pages of table.
*
* Before entering the main loop, establish the invariant that
* next_not_all_visible_block is the next block number >= blkno that's not
@@ -492,18 +506,18 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* started skipping blocks, we may as well skip everything up to the next
* not-all-visible block.
*
- * Note: if scan_all is true, we won't actually skip any pages; but we
+ * Note: if scan_all is true, we might not actually skip any pages; but we
* maintain next_not_all_visible_block anyway, so as to set up the
* all_visible_according_to_vm flag correctly for each page.
*
- * Note: The value returned by visibilitymap_test could be slightly
+ * Note: The value returned by visibilitymap_get_status could be slightly
* out-of-date, since we make this test before reading the corresponding
* heap page or locking the buffer. This is OK. If we mistakenly think
- * that the page is all-visible when in fact the flag's just been cleared,
- * we might fail to vacuum the page. But it's OK to skip pages when
- * scan_all is not set, so no great harm done; the next vacuum will find
- * them. If we make the reverse mistake and vacuum a page unnecessarily,
- * it'll just be a no-op.
+ * that the page is all-visible/all-frozen when in fact the flag's just
+ * been cleared, we might fail to vacuum the page. But it's OK to skip
+ * pages when scan_all is not set, so no great harm done; the next vacuum
+ * will find them. If we make the reverse mistake and vacuum a page
+ * unnecessarily, it'll just be a no-op.
*
* We will scan the table's last page, at least to the extent of
* determining whether it has tuples or not, even if it should be skipped
@@ -518,7 +532,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer))
+ if (!VM_ALL_VISIBLE(onerel, next_not_all_visible_block, &vmbuffer))
break;
vacuum_delay_point();
}
@@ -536,9 +550,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
bool tupgone,
hastup;
int prev_dead_count;
- int nfrozen;
+ int nfrozen; /* # of tuples is frozen */
+ int nalready_frozen; /* # of tuples is already frozen */
+ int ntotal_frozen; /* # of frozen tuples in single page */
+ int ntup_per_page;
Size freespace;
bool all_visible_according_to_vm;
+ bool all_frozen_according_to_vm;
bool all_visible;
bool has_dead_tuples;
TransactionId visibility_cutoff_xid = InvalidTransactionId;
@@ -554,8 +572,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block,
- &vmbuffer))
+ if (!VM_ALL_VISIBLE(onerel, next_not_all_visible_block, &vmbuffer))
break;
vacuum_delay_point();
}
@@ -569,14 +586,29 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
skipping_all_visible_blocks = true;
else
skipping_all_visible_blocks = false;
+
all_visible_according_to_vm = false;
+ all_frozen_according_to_vm = false;
}
else
{
- /* Current block is all-visible */
- if (skipping_all_visible_blocks && !scan_all && !FORCE_CHECK_PAGE())
- continue;
+ /*
+ * This block is at least all-visible according to the visibility map.
+ * We check whether this block is all-frozen or not, to skip to
+ * vacuum this page even if scan_all is true.
+ */
+ bool all_frozen = VM_ALL_FROZEN(onerel, blkno, &vmbuffer);
+
+ if (scan_all && all_frozen && !FORCE_CHECK_PAGE())
+ {
+ vacrelstats->vmskipped_frozen_pages++;
+ continue;
+ }
+ else if (!scan_all && skipping_all_visible_blocks && !FORCE_CHECK_PAGE())
+ continue;
+
all_visible_according_to_vm = true;
+ all_frozen_according_to_vm = all_frozen;
}
vacuum_delay_point();
@@ -743,7 +775,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
empty_pages++;
freespace = PageGetHeapFreeSpace(page);
- /* empty pages are always all-visible */
+ /* empty pages are always all-visible and all-frozen */
if (!PageIsAllVisible(page))
{
START_CRIT_SECTION();
@@ -766,8 +798,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
log_newpage_buffer(buf, true);
PageSetAllVisible(page);
+ PageSetAllFrozen(page);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, InvalidTransactionId);
+ vmbuffer, InvalidTransactionId,
+ VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
END_CRIT_SECTION();
}
@@ -791,13 +825,15 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
all_visible = true;
has_dead_tuples = false;
nfrozen = 0;
+ nalready_frozen = 0;
+ ntup_per_page = 0;
hastup = false;
prev_dead_count = vacrelstats->num_dead_tuples;
maxoff = PageGetMaxOffsetNumber(page);
/*
* Note: If you change anything in the loop below, also look at
- * heap_page_is_all_visible to see if that needs to be changed.
+ * heap_page_visible_status to see if that needs to be changed.
*/
for (offnum = FirstOffsetNumber;
offnum <= maxoff;
@@ -945,8 +981,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
{
num_tuples += 1;
+ ntup_per_page += 1;;
hastup = true;
+ /* Check whether this tuple is already frozen or not */
+ if (HeapTupleHeaderXminFrozen(tuple.t_data))
+ nalready_frozen += 1;
+
/*
* Each non-removable tuple must be checked to see if it needs
* freezing. Note we already have exclusive buffer lock.
@@ -993,6 +1034,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
END_CRIT_SECTION();
}
+ /* Compute the number of frozen tuples in a page */
+ ntotal_frozen = nfrozen + nalready_frozen;
+
/*
* If there are no indexes then we can vacuum the page right now
* instead of doing a second scan.
@@ -1015,26 +1059,46 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
freespace = PageGetHeapFreeSpace(page);
- /* mark page all-visible, if appropriate */
- if (all_visible && !all_visible_according_to_vm)
+ /* This page is all visible */
+ if (all_visible)
{
- /*
- * It should never be the case that the visibility map page is set
- * while the page-level bit is clear, but the reverse is allowed
- * (if checksums are not enabled). Regardless, set the both bits
- * so that we get back in sync.
- *
- * NB: If the heap page is all-visible but the VM bit is not set,
- * we don't need to dirty the heap page. However, if checksums
- * are enabled, we do need to make sure that the heap page is
- * dirtied before passing it to visibilitymap_set(), because it
- * may be logged. Given that this situation should only happen in
- * rare cases after a crash, it is not worth optimizing.
- */
- PageSetAllVisible(page);
- MarkBufferDirty(buf);
- visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ /* mark page all-visible, if appropriate */
+ if (!all_visible_according_to_vm)
+ {
+ /*
+ * It should never be the case that the visibility map page is set
+ * while the page-level bit is clear, but the reverse is allowed
+ * (if checksums are not enabled). Regardless, set the both bits
+ * so that we get back in sync.
+ *
+ * NB: If the heap page is all-visible but the VM bit is not set,
+ * we don't need to dirty the heap page. However, if checksums
+ * are enabled, we do need to make sure that the heap page is
+ * dirtied before passing it to visibilitymap_set(), because it
+ * may be logged. Given that this situation should only happen in
+ * rare cases after a crash, it is not worth optimizing.
+ */
+ PageSetAllVisible(page);
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+ }
+
+ /* mark page all-frozen, if all tuples are frozen and not marked yet */
+ if ((ntotal_frozen == ntup_per_page) && !all_frozen_according_to_vm)
+ {
+ Assert(PageIsAllVisible(page));
+
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ {
+ MarkBufferDirty(buf);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+ vmbuffer, visibility_cutoff_xid, flags);
+ }
}
/*
@@ -1045,9 +1109,14 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* that something bad has happened.
*/
else if (all_visible_according_to_vm && !PageIsAllVisible(page)
- && visibilitymap_test(onerel, blkno, &vmbuffer))
+ && VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
{
- elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
+ /* If the all-frozen is set then all-visible must be set */
+ if (all_frozen_according_to_vm)
+ Assert(VM_ALL_FROZEN(onerel, blkno, &vmbuffer) &&
+ VM_ALL_VISIBLE(onerel, blkno, &vmbuffer));
+
+ elog(WARNING, "page is not marked all-visible (and all-frozen) but visibility map bit(s) is set in relation \"%s\" page %u",
relname, blkno);
visibilitymap_clear(onerel, blkno, vmbuffer);
}
@@ -1055,19 +1124,25 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
/*
* It's possible for the value returned by GetOldestXmin() to move
* backwards, so it's not wrong for us to see tuples that appear to
- * not be visible to everyone yet, while PD_ALL_VISIBLE is already
- * set. The real safe xmin value never moves backwards, but
- * GetOldestXmin() is conservative and sometimes returns a value
+ * not be visible to everyone yet, while PD_ALL_VISIBLE (and PD_ALL_FROZEN)
+ * are already set. The real safe xmin value never moves backwards,
+ * but GetOldestXmin() is conservative and sometimes returns a value
* that's unnecessarily small, so if we see that contradiction it just
* means that the tuples that we think are not visible to everyone yet
- * actually are, and the PD_ALL_VISIBLE flag is correct.
+ * actually are, and the PD_ALL_VISIBLE (and PD_ALL_FROZEN) flags are
+ * correct.
*
* There should never be dead tuples on a page with PD_ALL_VISIBLE
* set, however.
*/
else if (PageIsAllVisible(page) && has_dead_tuples)
{
- elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
+ /* If the all-frozen is set then all-visible must be set */
+ if (PageIsAllFrozen(page))
+ Assert(VM_ALL_FROZEN(onerel, blkno, &vmbuffer) &&
+ VM_ALL_VISIBLE(onerel, blkno, &vmbuffer));
+
+ elog(WARNING, "page containing dead tuples is marked as all-visible (and all-frozen) in relation \"%s\" page %u",
relname, blkno);
PageClearAllVisible(page);
MarkBufferDirty(buf);
@@ -1141,6 +1216,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
RelationGetRelationName(onerel),
tups_vacuumed, vacuumed_pages)));
+ /* Report how many frozen pages vacuum skipped according to visibility map */
+ ereport(elevel,
+ (errmsg_plural("skipped %d frozen page according to visibility map",
+ "skipped %d frozen pages according to visibility map",
+ vacrelstats->vmskipped_frozen_pages,
+ vacrelstats->vmskipped_frozen_pages)));
+
/*
* This is pretty messy, but we split it up so that we can skip emitting
* individual parts of the message when not applicable.
@@ -1257,6 +1339,8 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
OffsetNumber unused[MaxOffsetNumber];
int uncnt = 0;
TransactionId visibility_cutoff_xid;
+ bool all_visible;
+ bool all_frozen;
START_CRIT_SECTION();
@@ -1308,19 +1392,36 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
* dirty, exclusively locked, and, if needed, a full page image has been
* emitted in the log_heap_clean() above.
*/
- if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid))
+ heap_page_visible_status(onerel, buffer, &visibility_cutoff_xid,
+ &all_visible, &all_frozen);
+ if (all_visible)
PageSetAllVisible(page);
/*
* All the changes to the heap page have been done. If the all-visible
- * flag is now set, also set the VM bit.
+ * flag is now set, also set the VM all-visible bit.
+ * Also, if this page is all-frozen, set the VM all-frozen bit and flag.
*/
- if (PageIsAllVisible(page) &&
- !visibilitymap_test(onerel, blkno, vmbuffer))
+ if (PageIsAllVisible(page))
{
+ uint8 vm_status = visibilitymap_get_status(onerel, blkno, vmbuffer);
+ uint8 flags = 0;
+
+ if (!(vm_status & VISIBILITYMAP_ALL_VISIBLE))
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+
+ /* Set the VM all-frozen bit to flag, if needed */
+ if (all_frozen && !(vm_status & VISIBILITYMAP_ALL_FROZEN))
+ {
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
Assert(BufferIsValid(*vmbuffer));
- visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
- visibility_cutoff_xid);
+
+ if (vm_status != flags)
+ visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
+ visibility_cutoff_xid, flags);
}
return tupindex;
@@ -1842,18 +1943,21 @@ vac_cmp_itemptr(const void *left, const void *right)
/*
* Check if every tuple in the given page is visible to all current and future
* transactions. Also return the visibility_cutoff_xid which is the highest
- * xmin amongst the visible tuples.
+ * xmin amongst the visible tuples, and all_frozen which implies that all tuples
+ * of this page are frozen.
*/
-static bool
-heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid)
+static void
+heap_page_visible_status(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid,
+ bool *all_visible, bool *all_frozen)
{
Page page = BufferGetPage(buf);
BlockNumber blockno = BufferGetBlockNumber(buf);
OffsetNumber offnum,
maxoff;
- bool all_visible = true;
*visibility_cutoff_xid = InvalidTransactionId;
+ *all_visible = true;
+ *all_frozen = true;
/*
* This is a stripped down version of the line pointer scan in
@@ -1861,7 +1965,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
*/
maxoff = PageGetMaxOffsetNumber(page);
for (offnum = FirstOffsetNumber;
- offnum <= maxoff && all_visible;
+ offnum <= maxoff && *all_visible;
offnum = OffsetNumberNext(offnum))
{
ItemId itemid;
@@ -1877,11 +1981,12 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/*
* Dead line pointers can have index pointers pointing to them. So
- * they can't be treated as visible
+ * they can't be treated as visible and frozen.
*/
if (ItemIdIsDead(itemid))
{
- all_visible = false;
+ *all_visible = false;
+ *all_frozen = false;
break;
}
@@ -1900,7 +2005,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/* Check comments in lazy_scan_heap. */
if (!HeapTupleHeaderXminCommitted(tuple.t_data))
{
- all_visible = false;
+ *all_visible = false;
break;
}
@@ -1911,13 +2016,17 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
xmin = HeapTupleHeaderGetXmin(tuple.t_data);
if (!TransactionIdPrecedes(xmin, OldestXmin))
{
- all_visible = false;
+ *all_visible = false;
break;
}
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
+
+ /* Check whether this tuple is already frozen or not */
+ if (!HeapTupleHeaderXminFrozen(tuple.t_data))
+ *all_frozen = false;
}
break;
@@ -1925,7 +2034,8 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
case HEAPTUPLE_RECENTLY_DEAD:
case HEAPTUPLE_INSERT_IN_PROGRESS:
case HEAPTUPLE_DELETE_IN_PROGRESS:
- all_visible = false;
+ *all_visible = false;
+ *all_frozen = false;
break;
default:
@@ -1934,5 +2044,6 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
}
} /* scan along page */
- return all_visible;
+ if (!(*all_visible))
+ *all_frozen = false;
}
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 90afbdc..13e2d76 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -85,9 +85,9 @@ IndexOnlyNext(IndexOnlyScanState *node)
* which all tuples are known visible to everybody. In any case,
* we'll use the index tuple not the heap tuple as the data source.
*
- * Note on Memory Ordering Effects: visibilitymap_test does not lock
- * the visibility map buffer, and therefore the result we read here
- * could be slightly stale. However, it can't be stale enough to
+ * Note on Memory Ordering Effects: visibilitymap_get_status does not
+ * lock. The visibility map buffer, and therefore the result we read
+ * here could be slightly stale. However, it can't be stale enough to
* matter.
*
* We need to detect clearing a VM bit due to an insert right away,
@@ -114,9 +114,9 @@ IndexOnlyNext(IndexOnlyScanState *node)
* It's worth going through this complexity to avoid needing to lock
* the VM buffer, which could cause significant contention.
*/
- if (!visibilitymap_test(scandesc->heapRelation,
- ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
+ if (!VM_ALL_VISIBLE(scandesc->heapRelation,
+ ItemPointerGetBlockNumber(tid),
+ &node->ioss_VMBuffer))
{
/*
* Rats, we have to visit the heap to check visibility.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index da768c6..08b61cb 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1329,7 +1329,8 @@ pgstat_report_autovac(Oid dboid)
*/
void
pgstat_report_vacuum(Oid tableoid, bool shared,
- PgStat_Counter livetuples, PgStat_Counter deadtuples)
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages)
{
PgStat_MsgVacuum msg;
@@ -1343,6 +1344,7 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
msg.m_vacuumtime = GetCurrentTimestamp();
msg.m_live_tuples = livetuples;
msg.m_dead_tuples = deadtuples;
+ msg.m_frozen_pages = frozenpages;
pgstat_send(&msg, sizeof(msg));
}
@@ -1354,7 +1356,8 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
*/
void
pgstat_report_analyze(Relation rel,
- PgStat_Counter livetuples, PgStat_Counter deadtuples)
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages)
{
PgStat_MsgAnalyze msg;
@@ -1394,6 +1397,7 @@ pgstat_report_analyze(Relation rel,
msg.m_analyzetime = GetCurrentTimestamp();
msg.m_live_tuples = livetuples;
msg.m_dead_tuples = deadtuples;
+ msg.m_frozen_pages = frozenpages;
pgstat_send(&msg, sizeof(msg));
}
@@ -3702,6 +3706,7 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
result->n_live_tuples = 0;
result->n_dead_tuples = 0;
result->changes_since_analyze = 0;
+ result->n_frozen_pages = 0;
result->blocks_fetched = 0;
result->blocks_hit = 0;
result->vacuum_timestamp = 0;
@@ -5069,6 +5074,7 @@ pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
tabentry->n_live_tuples = msg->m_live_tuples;
tabentry->n_dead_tuples = msg->m_dead_tuples;
+ tabentry->n_frozen_pages = msg->m_frozen_pages;
if (msg->m_autovacuum)
{
@@ -5103,6 +5109,7 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
tabentry->n_live_tuples = msg->m_live_tuples;
tabentry->n_dead_tuples = msg->m_dead_tuples;
+ tabentry->n_frozen_pages = msg->m_frozen_pages;
/*
* We reset changes_since_analyze to zero, forgetting any changes that
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 1b22fcc..7c57b3e 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -46,6 +46,7 @@ extern Datum pg_stat_get_vacuum_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_autovacuum_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_analyze_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_autoanalyze_count(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_frozen_pages(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_function_calls(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_function_total_time(PG_FUNCTION_ARGS);
@@ -450,6 +451,21 @@ pg_stat_get_autoanalyze_count(PG_FUNCTION_ARGS)
}
Datum
+pg_stat_get_frozen_pages(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int32 result;
+ PgStat_StatTabEntry *tabentry;
+
+ if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+ result = 0;
+ else
+ result = (int32) (tabentry->n_frozen_pages);
+
+ PG_RETURN_INT32(result);
+}
+
+Datum
pg_stat_get_function_calls(PG_FUNCTION_ARGS)
{
Oid funcid = PG_GETARG_OID(0);
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 9357ad8..ce55541 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -9,7 +9,11 @@
#include "postgres_fe.h"
+#include "access/visibilitymap.h"
#include "pg_upgrade.h"
+#include "storage/bufpage.h"
+#include "storage/checksum.h"
+#include "storage/checksum_impl.h"
#include <fcntl.h>
@@ -21,6 +25,43 @@ static int copy_file(const char *fromfile, const char *tofile, bool force);
static int win32_pghardlink(const char *src, const char *dst);
#endif
+static const char *rewriteVisibilitymap(const char *fromfile, const char *tofile,
+ bool force);
+
+/* table for fast rewriting vm file in order to add all-frozen information */
+static const uint16 rewrite_vm_table[256] = {
+ 0, 1, 4, 5, 16, 17, 20, 21, 64, 65, 68, 69, 80, 81, 84, 85,
+ 256, 257, 260, 261, 272, 273, 276, 277, 320, 321, 324, 325, 336, 337, 340, 341,
+ 1024, 1025, 1028, 1029, 1040, 1041, 1044, 1045, 1088, 1089, 1092, 1093, 1104, 1105, 1108, 1109,
+ 1280, 1281, 1284, 1285, 1296, 1297, 1300, 1301, 1344, 1345, 1348, 1349, 1360, 1361, 1364, 1365,
+ 4096, 4097, 4100, 4101, 4112, 4113, 4116, 4117, 4160, 4161, 4164, 4165, 4176, 4177, 4180, 4181,
+ 4352, 4353, 4356, 4357, 4368, 4369, 4372, 4373, 4416, 4417, 4420, 4421, 4432, 4433, 4436, 4437,
+ 5120, 5121, 5124, 5125, 5136, 5137, 5140, 5141, 5184, 5185, 5188, 5189, 5200, 5201, 5204, 5205,
+ 5376, 5377, 5380, 5381, 5392, 5393, 5396, 5397, 5440, 5441, 5444, 5445, 5456, 5457, 5460, 5461,
+ 16384, 16385, 16388, 16389, 16400, 16401, 16404, 16405, 16448, 16449, 16452, 16453, 16464, 16465, 16468, 16469,
+ 16640, 16641, 16644, 16645, 16656, 16657, 16660, 16661, 16704, 16705, 16708, 16709, 16720, 16721, 16724, 16725,
+ 17408, 17409, 17412, 17413, 17424, 17425, 17428, 17429, 17472, 17473, 17476, 17477, 17488, 17489, 17492, 17493,
+ 17664, 17665, 17668, 17669, 17680, 17681, 17684, 17685, 17728, 17729, 17732, 17733, 17744, 17745, 17748, 17749,
+ 20480, 20481, 20484, 20485, 20496, 20497, 20500, 20501, 20544, 20545, 20548, 20549, 20560, 20561, 20564, 20565,
+ 20736, 20737, 20740, 20741, 20752, 20753, 20756, 20757, 20800, 20801, 20804, 20805, 20816, 20817, 20820, 20821,
+ 21504, 21505, 21508, 21509, 21520, 21521, 21524, 21525, 21568, 21569, 21572, 21573, 21584, 21585, 21588, 21589,
+ 21760, 21761, 21764, 21765, 21776, 21777, 21780, 21781, 21824, 21825, 21828, 21829, 21840, 21841, 21844, 21845
+};
+
+/*
+ * copyOrRewriteFile()
+ * This function copies file or rewrite visibility map file.
+ * If rewrite_vm is true, we have to rewrite visibility map regardless value of pageConverter.
+ */
+const char *
+copyOrRewriteFile(pageCnvCtx *pageConverter,
+ const char *src, const char *dst, bool force, bool rewrite_vm)
+{
+ if (rewrite_vm)
+ return rewriteVisibilitymap(src, dst, force);
+ else
+ return copyAndUpdateFile(pageConverter, src, dst, force);
+}
/*
* copyAndUpdateFile()
@@ -115,12 +156,14 @@ copyAndUpdateFile(pageCnvCtx *pageConverter,
*/
const char *
linkAndUpdateFile(pageCnvCtx *pageConverter,
- const char *src, const char *dst)
+ const char *src, const char *dst, bool rewrite_vm)
{
if (pageConverter != NULL)
return "Cannot in-place update this cluster, page-by-page conversion is required";
- if (pg_link_file(src, dst) == -1)
+ if (rewrite_vm)
+ return rewriteVisibilitymap(src, dst, true);
+ else if (pg_link_file(src, dst) == -1)
return getErrorText();
else
return NULL;
@@ -205,6 +248,96 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
#endif
+/*
+ * rewriteVisibilitymap()
+ *
+ * Copies a visibility map file while adding all-frozen bit(0) into each bit.
+ */
+static const char *
+
+rewriteVisibilitymap(const char *fromfile, const char *tofile, bool force)
+{
+ int src_fd = 0;
+ int dst_fd = 0;
+ char buffer[BLCKSZ];
+ ssize_t bytesRead;
+ int rewriteVmBytesPerPage = (BLCKSZ - SizeOfPageHeaderData) / 2;
+ BlockNumber blkno = 0;
+
+ /* Reset errno */
+ errno = 0;
+
+ if ((fromfile == NULL) || (tofile == NULL))
+ return getErrorText();
+
+ if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
+ goto err;
+
+ if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+ goto err;
+
+ /* Perform data rewriting per page */
+ while ((bytesRead = read(src_fd, buffer, BLCKSZ)) == BLCKSZ)
+ {
+ char *cur, *end, *blkend;
+ PageHeaderData pageheader;
+ uint16 vm_bits;
+
+ /* Save the page header data */
+ memcpy(&pageheader, buffer, SizeOfPageHeaderData);
+
+ cur = buffer + SizeOfPageHeaderData;
+ end = buffer + SizeOfPageHeaderData + rewriteVmBytesPerPage;
+ blkend = buffer + bytesRead;
+
+ while (blkend >= end)
+ {
+ char vmbuf[BLCKSZ];
+ char *vmtmp = vmbuf;
+
+ /* Copy page header in advance */
+ memcpy(vmbuf, &pageheader, SizeOfPageHeaderData);
+
+ vmtmp += SizeOfPageHeaderData;
+
+ /* Rewrite visibility map bit one by one */
+ while (end > cur)
+ {
+ /* Write rewritten bit from table and its string representation */
+ vm_bits = rewrite_vm_table[(uint8) *cur];
+ memcpy(vmtmp, &vm_bits, BITS_PER_HEAPBLOCK);
+
+ cur++;
+ vmtmp += BITS_PER_HEAPBLOCK;
+ }
+
+ /* Set new checksum for a visibility map page, If enabled */
+ if (old_cluster.controldata.data_checksum_version != 0 &&
+ new_cluster.controldata.data_checksum_version != 0)
+ ((PageHeader) vmbuf)->pd_checksum = pg_checksum_page(vmbuf, blkno);
+
+ if (write(dst_fd, vmbuf, BLCKSZ) != BLCKSZ)
+ {
+ if (errno == 0)
+ errno = ENOSPC;
+ goto err;
+ }
+
+ end += rewriteVmBytesPerPage;
+ blkno++;
+ }
+ }
+
+err:
+ if (src_fd != 0)
+ close(src_fd);
+
+ if (dst_fd != 0)
+ close(dst_fd);
+
+ return (errno == 0) ? NULL : getErrorText();
+}
+
void
check_hard_link(void)
{
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index bc733c4..faa690e 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -110,6 +110,10 @@ extern char *output_files[];
#define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031
/*
+ * The format of visibility map is changed with this 9.6 commit,
+ */
+#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201601121
+/*
* pg_multixact format changed in 9.3 commit 0ac5ad5134f2769ccbaefec73844f85,
* ("Improve concurrency of foreign key locking") which also updated catalog
* version to this value. pg_upgrade behavior depends on whether old and new
@@ -394,10 +398,12 @@ const pageCnvCtx *setupPageConverter(void);
typedef void *pageCnvCtx;
#endif
+const char *copyOrRewriteFile(pageCnvCtx *pageConverter, const char *src,
+ const char *dst, bool force, bool rewrite_vm);
const char *copyAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
const char *dst, bool force);
const char *linkAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
- const char *dst);
+ const char *dst, bool rewrite_vm);
void check_hard_link(void);
FILE *fopen_priv(const char *path, const char *mode);
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index c059c5b..e02a931 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -18,7 +18,7 @@
static void transfer_single_new_db(pageCnvCtx *pageConverter,
FileNameMap *maps, int size, char *old_tablespace);
static void transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *suffix);
+ const char *suffix, bool vm_need_rewrite);
/*
@@ -171,6 +171,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
{
int mapnum;
bool vm_crashsafe_match = true;
+ bool vm_need_rewrite = false;
/*
* Do the old and new cluster disagree on the crash-safetiness of the vm
@@ -180,13 +181,20 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
vm_crashsafe_match = false;
+ /*
+ * Do we need to rewrite visibilitymap?
+ */
+ if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ vm_need_rewrite = true;
+
for (mapnum = 0; mapnum < size; mapnum++)
{
if (old_tablespace == NULL ||
strcmp(maps[mapnum].old_tablespace, old_tablespace) == 0)
{
/* transfer primary file */
- transfer_relfile(pageConverter, &maps[mapnum], "");
+ transfer_relfile(pageConverter, &maps[mapnum], "", false);
/* fsm/vm files added in PG 8.4 */
if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)
@@ -194,9 +202,14 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
/*
* Copy/link any fsm and vm files, if they exist
*/
- transfer_relfile(pageConverter, &maps[mapnum], "_fsm");
+ transfer_relfile(pageConverter, &maps[mapnum], "_fsm", false);
if (vm_crashsafe_match)
- transfer_relfile(pageConverter, &maps[mapnum], "_vm");
+ {
+ if (vm_need_rewrite)
+ transfer_relfile(pageConverter, &maps[mapnum], "_vm", true);
+ else
+ transfer_relfile(pageConverter, &maps[mapnum], "_vm", false);
+ }
}
}
}
@@ -210,7 +223,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
*/
static void
transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *type_suffix)
+ const char *type_suffix, bool vm_need_rewrite)
{
const char *msg;
char old_file[MAXPGPATH];
@@ -276,7 +289,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
+ if ((msg = copyOrRewriteFile(pageConverter, old_file, new_file, true, vm_need_rewrite)) != NULL)
pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
@@ -284,7 +297,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "linking \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = linkAndUpdateFile(pageConverter, old_file, new_file)) != NULL)
+ if ((msg = linkAndUpdateFile(pageConverter, old_file, new_file, vm_need_rewrite)) != NULL)
pg_fatal("error while creating link for relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
diff --git a/src/bin/pg_upgrade/test.sh b/src/bin/pg_upgrade/test.sh
index ba79fb3..cd9b17e 100644
--- a/src/bin/pg_upgrade/test.sh
+++ b/src/bin/pg_upgrade/test.sh
@@ -174,6 +174,11 @@ if "$MAKE" -C "$oldsrc" installcheck; then
mv "$temp_root"/dump1.sql "$temp_root"/dump1.sql.orig
sed "s;$oldsrc;$newsrc;g" "$temp_root"/dump1.sql.orig >"$temp_root"/dump1.sql
fi
+
+ vm_sql="SELECT c.relname, c.relallvisible FROM pg_class as c, pg_namespace as n WHERE c.relnamespace = n.oid AND n.nspname NOT IN ('information_schema', 'pg_toast', 'pg_catalog') ORDER BY c.relname;"
+ # Test for rewriting visibility map
+ vacuumdb -d regression || visibilitymap_vacuum1_status=$?
+ psql -d regression -c "$vm_sql" > "$temp_root"/vm_test1.txt || visibilitymap_test1_status=$?
else
make_installcheck_status=$?
fi
@@ -188,6 +193,14 @@ if [ -n "$pg_dumpall1_status" ]; then
echo "pg_dumpall of pre-upgrade database cluster failed"
exit 1
fi
+if [ -n "$visibilitymap_vacuum1_status" ];then
+ echo "VACUUM of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+if [ -n "$visibilitymap_test1_status" ];then
+ echo "SELECT of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
PGDATA=$BASE_PGDATA
@@ -203,6 +216,8 @@ case $testhost in
esac
pg_dumpall -f "$temp_root"/dump2.sql || pg_dumpall2_status=$?
+vacuumdb -d regression || visibilitymap_vacuum2_status=$?
+psql -d regression -c "$vm_sql" > "$temp_root"/vm_test2.txt || visibilitymap_test2_status=$?
pg_ctl -m fast stop
# no need to echo commands anymore
@@ -214,11 +229,26 @@ if [ -n "$pg_dumpall2_status" ]; then
exit 1
fi
+if [ -n "$visibilitymap_vacuum2_status" ];then
+ echo "VACUUM of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
+if [ -n "$visibilitymap_test2_status" ];then
+ echo "SELECT of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
case $testhost in
MINGW*) cmd /c delete_old_cluster.bat ;;
*) sh ./delete_old_cluster.sh ;;
esac
+if ! diff "$temp_root"/vm_test1.txt "$temp_root"/vm_test2.txt >/dev/null; then
+ echo "Visibility map rewriting test failed"
+ exit 1
+fi
+
if diff "$temp_root"/dump1.sql "$temp_root"/dump2.sql >/dev/null; then
echo PASSED
exit 0
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index f77489b..5fcb539 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -320,9 +320,10 @@ typedef struct xl_heap_freeze_page
typedef struct xl_heap_visible
{
TransactionId cutoff_xid;
+ uint8 flags;
} xl_heap_visible;
-#define SizeOfHeapVisible (offsetof(xl_heap_visible, cutoff_xid) + sizeof(TransactionId))
+#define SizeOfHeapVisible (offsetof(xl_heap_visible, flags) + sizeof(uint8))
typedef struct xl_heap_new_cid
{
@@ -389,6 +390,6 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
xl_heap_freeze_tuple *xlrec_tp);
extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
- Buffer vm_buffer, TransactionId cutoff_xid);
+ Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
#endif /* HEAPAM_XLOG_H */
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index d447daf..a75de5c 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -19,15 +19,36 @@
#include "storage/buf.h"
#include "utils/relcache.h"
+/*
+ * Number of bits allocated for each heap block.
+ * One for all-visible, other for all-frozen.
+*/
+#define BITS_PER_HEAPBLOCK 2
+
+/* Number of heap blocks we can represent in one byte. */
+#define HEAPBLOCKS_PER_BYTE 4
+
+/* Flags for bit map */
+#define VISIBILITYMAP_ALL_VISIBLE 0x01
+#define VISIBILITYMAP_ALL_FROZEN 0x02
+#define VISIBILITYMAP_VALID_BITS 0x03 /* OR of all valid visiblitymap flags bits */
+
+/* Macros for visibilitymap test */
+#define VM_ALL_VISIBLE(r, b, v) \
+ ((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_VISIBLE) != 0)
+#define VM_ALL_FROZEN(r, b, v) \
+ ((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_FROZEN) != 0)
+
extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk,
Buffer vmbuf);
extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
Buffer *vmbuf);
extern bool visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid);
-extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
-extern BlockNumber visibilitymap_count(Relation rel);
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags);
+extern uint8 visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
+extern void visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen);
extern void visibilitymap_truncate(Relation rel, BlockNumber nheapblocks);
#endif /* VISIBILITYMAP_H */
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 62e08a9..8d8ffea 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 201601091
+#define CATALOG_VERSION_NO 201601121
#endif
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index f58672e..7e05dc1 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2781,6 +2781,8 @@ DATA(insert OID = 3056 ( pg_stat_get_analyze_count PGNSP PGUID 12 1 0 0 0 f f f
DESCR("statistics: number of manual analyzes for a table");
DATA(insert OID = 3057 ( pg_stat_get_autoanalyze_count PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_autoanalyze_count _null_ _null_ _null_ ));
DESCR("statistics: number of auto analyzes for a table");
+DATA(insert OID = 6015 ( pg_stat_get_frozen_pages PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 23 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_frozen_pages _null_ _null_ _null_ ));
+DESCR("statistics: number of frozen pages of table");
DATA(insert OID = 1936 ( pg_stat_get_backend_idset PGNSP PGUID 12 1 100 0 0 f f f f t t s r 0 0 23 "" _null_ _null_ _null_ _null_ _null_ pg_stat_get_backend_idset _null_ _null_ _null_ ));
DESCR("statistics: currently active backend IDs");
DATA(insert OID = 2022 ( pg_stat_get_activity PGNSP PGUID 12 1 100 0 0 f f f f f t s r 1 0 2249 "23" "{23,26,23,26,25,25,25,16,1184,1184,1184,1184,869,25,23,28,28,16,25,25,23,16,25}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,datid,pid,usesysid,application_name,state,query,waiting,xact_start,query_start,backend_start,state_change,client_addr,client_hostname,client_port,backend_xid,backend_xmin,ssl,sslversion,sslcipher,sslbits,sslcompression,sslclientdn}" _null_ _null_ pg_stat_get_activity _null_ _null_ _null_ ));
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 65e968e..ad40b70 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -17,6 +17,7 @@
#include "portability/instr_time.h"
#include "postmaster/pgarch.h"
#include "storage/barrier.h"
+#include "storage/block.h"
#include "utils/hsearch.h"
#include "utils/relcache.h"
@@ -355,6 +356,7 @@ typedef struct PgStat_MsgVacuum
TimestampTz m_vacuumtime;
PgStat_Counter m_live_tuples;
PgStat_Counter m_dead_tuples;
+ BlockNumber m_frozen_pages;
} PgStat_MsgVacuum;
@@ -372,6 +374,7 @@ typedef struct PgStat_MsgAnalyze
TimestampTz m_analyzetime;
PgStat_Counter m_live_tuples;
PgStat_Counter m_dead_tuples;
+ int32 m_frozen_pages;
} PgStat_MsgAnalyze;
@@ -551,7 +554,7 @@ typedef union PgStat_Msg
* ------------------------------------------------------------
*/
-#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9D
+#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9E
/* ----------
* PgStat_StatDBEntry The collector's data per database
@@ -615,6 +618,8 @@ typedef struct PgStat_StatTabEntry
PgStat_Counter n_dead_tuples;
PgStat_Counter changes_since_analyze;
+ int32 n_frozen_pages;
+
PgStat_Counter blocks_fetched;
PgStat_Counter blocks_hit;
@@ -917,9 +922,11 @@ extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type t
extern void pgstat_report_autovac(Oid dboid);
extern void pgstat_report_vacuum(Oid tableoid, bool shared,
- PgStat_Counter livetuples, PgStat_Counter deadtuples);
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages);
extern void pgstat_report_analyze(Relation rel,
- PgStat_Counter livetuples, PgStat_Counter deadtuples);
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages);
extern void pgstat_report_recovery_conflict(int reason);
extern void pgstat_report_deadlock(void);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 2ce3be7..0b023b3 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -178,8 +178,10 @@ typedef PageHeaderData *PageHeader;
* tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
+#define PD_ALL_FROZEN 0x0008 /* all tuples on page are completely
+ frozen */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x000F /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -367,7 +369,12 @@ typedef PageHeaderData *PageHeader;
#define PageSetAllVisible(page) \
(((PageHeader) (page))->pd_flags |= PD_ALL_VISIBLE)
#define PageClearAllVisible(page) \
- (((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
+ (((PageHeader) (page))->pd_flags &= ~(PD_ALL_VISIBLE | PD_ALL_FROZEN))
+
+#define PageIsAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags & PD_ALL_FROZEN)
+#define PageSetAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags |= PD_ALL_FROZEN)
#define PageIsPrunable(page, oldestxmin) \
( \
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 28b061f..c95c788 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1683,6 +1683,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,
pg_stat_get_live_tuples(c.oid) AS n_live_tup,
pg_stat_get_dead_tuples(c.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(c.oid) AS n_mod_since_analyze,
+ pg_stat_get_frozen_pages(c.oid) AS n_frozen_pages,
pg_stat_get_last_vacuum_time(c.oid) AS last_vacuum,
pg_stat_get_last_autovacuum_time(c.oid) AS last_autovacuum,
pg_stat_get_last_analyze_time(c.oid) AS last_analyze,
@@ -1795,6 +1796,7 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
+ pg_stat_all_tables.n_frozen_pages,
pg_stat_all_tables.last_vacuum,
pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,
@@ -1838,6 +1840,7 @@ pg_stat_user_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
+ pg_stat_all_tables.n_frozen_pages,
pg_stat_all_tables.last_vacuum,
pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index f5be70f..95ababf 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -127,6 +127,8 @@ SELECT count(*) FROM tenk2 WHERE unique1 = 1;
1
(1 row)
+-- do VACUUM FREEZE
+VACUUM FREEZE tenk2;
-- force the rate-limiting logic in pgstat_report_tabstat() to time out
-- and send a message
SELECT pg_sleep(1.0);
@@ -175,6 +177,14 @@ SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages,
t | t
(1 row)
+SELECT n_frozen_pages = (pg_relation_size('tenk2') / current_setting('block_size')::int)
+ FROM pg_stat_user_tables
+ WHERE relname ='tenk2';
+ ?column?
+----------
+ t
+(1 row)
+
SELECT pr.snap_ts < pg_stat_get_snapshot_timestamp() as snapshot_newer
FROM prevstats AS pr;
snapshot_newer
diff --git a/src/test/regress/expected/visibilitymap.out b/src/test/regress/expected/visibilitymap.out
new file mode 100644
index 0000000..87206b6
--- /dev/null
+++ b/src/test/regress/expected/visibilitymap.out
@@ -0,0 +1,22 @@
+--
+-- Visibility Map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+-- All pages are become all-visible
+VACUUM FREEZE vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+ ?column?
+----------
+ t
+(1 row)
+
+-- Check whether vacuum skips all-frozen pages
+\set VERBOSITY terse
+VACUUM FREEZE VERBOSE vmtest;
+INFO: vacuuming "public.vmtest"
+INFO: index "vmtest_pkey" now contains 10000 row versions in 30 pages
+INFO: skipped 44 frozen pages according to visibility map
+INFO: "vmtest": found 0 removable, 56 nonremovable row versions in 1 out of 45 pages
+\set VERBOSITY default
+DROP TABLE vmtest;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index b1bc7c7..e31fa76 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -110,3 +110,6 @@ test: event_trigger
# run stats by itself because its delay may be insufficient under heavy load
test: stats
+
+# page info map and vacuum test cannot run concurrently with any test that runs SQL
+test: visibilitymap
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index ade9ef1..666e40c 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -161,3 +161,4 @@ test: with
test: xml
test: event_trigger
test: stats
+test: visibilitymap
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index cd2d592..dea5553 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -120,6 +120,8 @@ ROLLBACK;
SELECT count(*) FROM tenk2;
-- do an indexscan
SELECT count(*) FROM tenk2 WHERE unique1 = 1;
+-- do VACUUM FREEZE
+VACUUM FREEZE tenk2;
-- force the rate-limiting logic in pgstat_report_tabstat() to time out
-- and send a message
@@ -145,6 +147,10 @@ SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages,
FROM pg_statio_user_tables AS st, pg_class AS cl, prevstats AS pr
WHERE st.relname='tenk2' AND cl.relname='tenk2';
+SELECT n_frozen_pages = (pg_relation_size('tenk2') / current_setting('block_size')::int)
+ FROM pg_stat_user_tables
+ WHERE relname ='tenk2';
+
SELECT pr.snap_ts < pg_stat_get_snapshot_timestamp() as snapshot_newer
FROM prevstats AS pr;
diff --git a/src/test/regress/sql/visibilitymap.sql b/src/test/regress/sql/visibilitymap.sql
new file mode 100644
index 0000000..365570b
--- /dev/null
+++ b/src/test/regress/sql/visibilitymap.sql
@@ -0,0 +1,16 @@
+--
+-- Visibility Map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+
+-- All pages are become all-visible
+VACUUM FREEZE vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+
+-- Check whether vacuum skips all-frozen pages
+\set VERBOSITY terse
+VACUUM FREEZE VERBOSE vmtest;
+\set VERBOSITY default
+
+DROP TABLE vmtest;
On Wed, Jan 13, 2016 at 12:16 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Mon, Dec 28, 2015 at 6:38 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Mon, Dec 21, 2015 at 11:54 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Dec 21, 2015 at 3:27 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:Hello,
At Fri, 18 Dec 2015 12:09:43 -0500, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoZCCFwgKL0PmSi=htfZ2aCOZPoTPD73ecvSA9rhXa0zUw@mail.gmail.com>
On Thu, Dec 17, 2015 at 1:17 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:I am not really getting the meaning of this sentence. Shouldn't this
be reworded something like:
"Freezing occurs on the whole table once all pages of this relation require it."That statement isn't remotely true, and I don't think this patch
changes that. Freezing occurs on the whole table once relfrozenxid is
old enough that we think there might be at least one page in the table
that requires it.I doubt I can explain this accurately, but I took the original
phrase as that if and only if all pages of the table are marked
as "requires freezing" by accident, all pages are frozen. It's
quite obvious but it is what I think "happen to require freezing"
means. Does this make sense?The phrase might not be necessary if this is correct.
Maybe you are trying to say something like "only those pages which
require freezing are frozen?".I was thinking the same as what Horiguchi-san said.
That is, even if relfrozenxid is old enough, freezing on the whole
table is not required if the table are marked as "not requires
freezing".
In other word, only those pages which are marked as "not frozen" are frozen.The recently changes to HEAD conflicts with freeze map patch, so I've
updated and attached latest freeze map patch.
The another patch that enhances the debug log message of visibilitymap
is attached to previous mail.
</messages/by-id/CAD21AoBScUD4k_QWrYGRmbXVruiekPY=2BY2Fxhqq55a+tzUxg@mail.gmail.com>.Please review it.
Attached updated version patch.
Please review it.
Regards,
--
Masahiko Sawada
Attachments:
000_add_frozen_bit_into_visibilitymap_v33.patchbinary/octet-stream; name=000_add_frozen_bit_into_visibilitymap_v33.patchDownload
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 001988b..5d08c73 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -87,7 +87,7 @@ statapprox_heap(Relation rel, output_type *stat)
* If the page has only visible tuples, then we can find out the free
* space from the FSM and move on.
*/
- if (visibilitymap_test(rel, blkno, &vmbuffer))
+ if (VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
{
freespace = GetRecordedFreeSpace(rel, blkno);
stat->tuple_len += BLCKSZ - freespace;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 392eb70..c43443a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5916,7 +5916,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs an eager freezing if the table's
<structname>pg_class</>.<structfield>relfrozenxid</> field has reached
the age specified by this setting. The default is 150 million
transactions. Although users can set this value anywhere from zero to
@@ -5960,7 +5960,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs an eager freezing if the table's
<structname>pg_class</>.<structfield>relminmxid</> field has reached
the age specified by this setting. The default is 150 million multixacts.
Although users can set this value anywhere from zero to two billions,
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 5204b34..7cc975d 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -352,9 +352,9 @@
Vacuum maintains a <link linkend="storage-vm">visibility map</> for each
table to keep track of which pages contain only tuples that are known to be
visible to all active transactions (and all future transactions, until the
- page is again modified). This has two purposes. First, vacuum
- itself can skip such pages on the next run, since there is nothing to
- clean up.
+ page is again modified), and pages contain only frozen tuples.
+ This has two purposes. First, vacuum itself can skip such pages
+ on the next run, since there is nothing to clean up.
</para>
<para>
@@ -438,28 +438,25 @@
</para>
<para>
- <command>VACUUM</> normally skips pages that don't have any dead row
- versions, but those pages might still have row versions with old XID
- values. To ensure all old row versions have been frozen, a
- scan of the whole table is needed.
+ <command>VACUUM</> skips pages that don't have any dead row
+ versions, and pages that have only frozen rows. To ensure all old
+ row versions have been frozen, a scan of all unfrozen pages is needed.
<xref linkend="guc-vacuum-freeze-table-age"> controls when
- <command>VACUUM</> does that: a whole table sweep is forced if
- the table hasn't been fully scanned for <varname>vacuum_freeze_table_age</>
- minus <varname>vacuum_freeze_min_age</> transactions. Setting it to 0
- forces <command>VACUUM</> to always scan all pages, effectively ignoring
- the visibility map.
+ <command>VACUUM</> does that: a whole-table freezing is forced if
+ the table hasn't been ensured all row versions are frozen for
+ <varname>vacuum_freeze_table_age</> minus <varname>vacuum_freeze_min_age</>
+ transcation.
</para>
<para>
The maximum time that a table can go unvacuumed is two billion
transactions minus the <varname>vacuum_freeze_min_age</> value at
- the time <command>VACUUM</> last scanned the whole table. If it were to go
- unvacuumed for longer than
- that, data loss could result. To ensure that this does not happen,
- autovacuum is invoked on any table that might contain unfrozen rows with
- XIDs older than the age specified by the configuration parameter <xref
- linkend="guc-autovacuum-freeze-max-age">. (This will happen even if
- autovacuum is disabled.)
+ the time <command>VACUUM</> last scanned unfrozen pages. If it were to go
+ unvacuumed for longer than that, data loss could result. To ensure
+ that this does not happen, autovacuum is invoked on any table that might
+ contain unfrozen rows with XIDs older than the age specified by the
+ configuration parameter <xref linkend="guc-autovacuum-freeze-max-age">.
+ (This will happen even if autovacuum is disabled.)
</para>
<para>
@@ -490,8 +487,7 @@
a regularly scheduled <command>VACUUM</> or an autovacuum triggered by
normal delete and update activity is run in that window. Setting it too
close could lead to anti-wraparound autovacuums, even though the table
- was recently vacuumed to reclaim space, whereas lower values lead to more
- frequent whole-table scans.
+ was recently vacuumed to reclaim space.
</para>
<para>
@@ -527,7 +523,7 @@
<structname>pg_database</>. In particular,
the <structfield>relfrozenxid</> column of a table's
<structname>pg_class</> row contains the freeze cutoff XID that was used
- by the last whole-table <command>VACUUM</> for that table. All rows
+ by the last whole-table freezing for that table. All rows
inserted by transactions with XIDs older than this cutoff XID are
guaranteed to have been frozen. Similarly,
the <structfield>datfrozenxid</> column of a database's
@@ -555,17 +551,18 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
<command>VACUUM</> normally
only scans pages that have been modified since the last vacuum, but
<structfield>relfrozenxid</> can only be advanced when the whole table is
- scanned. The whole table is scanned when <structfield>relfrozenxid</> is
- more than <varname>vacuum_freeze_table_age</> transactions old, when
- <command>VACUUM</>'s <literal>FREEZE</> option is used, or when all pages
- happen to
- require vacuuming to remove dead row versions. When <command>VACUUM</>
- scans the whole table, after it's finished <literal>age(relfrozenxid)</>
- should be a little more than the <varname>vacuum_freeze_min_age</> setting
- that was used (more by the number of transactions started since the
- <command>VACUUM</> started). If no whole-table-scanning <command>VACUUM</>
- is issued on the table until <varname>autovacuum_freeze_max_age</> is
- reached, an autovacuum will soon be forced for the table.
+ frozen. Freezing occurs on the whole table once all pages of this relation
+ require it. In other cases such as where <structfield>relfrozenxid</> is more
+ than <varname>vacuum_freeze_table_age</> transactions old, where
+ <command>VACUUM</>'s <literal>FREEZE</> option is used, <command>VACUUM</>
+ can skip the pages that all tuples on the page itself are marked as frozen.
+ When all pages of table are eventually marked as frozen by <command>VACUUM</>,
+ after it's finished <literal>age(relfrozenxid)</> should be a little more
+ than the <varname>vacuum_freeze_min_age</> setting that was used (more by
+ the number of transactions started since the <command>VACUUM</> started).
+ If the advancing of <structfield>relfrozenxid</> is not happend until
+ <varname>autovacuum_freeze_max_age</> is reached, an autovacuum will soon
+ be forced for the table.
</para>
<para>
@@ -642,28 +639,28 @@ HINT: Stop the postmaster and vacuum that database in single-user mode.
<structname>pg_class</>.<structfield>relminmxid</> stores the oldest
possible multixact ID still appearing in any tuple of that table.
If this value is older than
- <xref linkend="guc-vacuum-multixact-freeze-table-age">, a whole-table
+ <xref linkend="guc-vacuum-multixact-freeze-table-age">, a table
scan is forced. <function>mxid_age()</> can be used on
<structname>pg_class</>.<structfield>relminmxid</> to find its age.
</para>
<para>
- Whole-table <command>VACUUM</> scans, regardless of
- what causes them, enable advancing the value for that table.
+ When <command>VACUUM</> scans all unfrozen pages, regardless of what causes
+ them, enable advancing the value for that table.
Eventually, as all tables in all databases are scanned and their
oldest multixact values are advanced, on-disk storage for older
multixacts can be removed.
</para>
<para>
- As a safety device, a whole-table vacuum scan will occur for any table
- whose multixact-age is greater than
- <xref linkend="guc-autovacuum-multixact-freeze-max-age">. Whole-table
+ As a safety device, a vacuum scan will occur for any table whose
+ multixact-age is greater than
+ <xref linkend="guc-autovacuum-multixact-freeze-max-age">. A
vacuum scans will also occur progressively for all tables, starting with
those that have the oldest multixact-age, if the amount of used member
storage space exceeds the amount 50% of the addressable storage space.
- Both of these kinds of whole-table scans will occur even if autovacuum is
- nominally disabled.
+ Both of these kinds of scans will occur even if autovacuum is nominally
+ disabled.
</para>
</sect3>
</sect2>
@@ -743,8 +740,8 @@ vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuple
<command>UPDATE</command> and <command>DELETE</command> operation. (It
is only semi-accurate because some information might be lost under heavy
load.) If the <structfield>relfrozenxid</> value of the table is more
- than <varname>vacuum_freeze_table_age</> transactions old, the whole
- table is scanned to freeze old tuples and advance
+ than <varname>vacuum_freeze_table_age</> transactions old, the table is
+ scanned to freeze old tuples and advance
<structfield>relfrozenxid</>, otherwise only pages that have been modified
since the last vacuum are scanned.
</para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 85459d0..0bcd52d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1423,6 +1423,11 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
<entry>Estimated number of rows modified since this table was last analyzed</entry>
</row>
<row>
+ <entry><structfield>n_frozen_pages</></entry>
+ <entry><type>integer</></entry>
+ <entry>Number of frozen pages</entry>
+ </row>
+ <row>
<entry><structfield>last_vacuum</></entry>
<entry><type>timestamp with time zone</></entry>
<entry>Last time at which this table was manually vacuumed
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 164d08c..5dc8b04 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -623,18 +623,20 @@ can be used to examine the information stored in free space maps.
<para>
Each heap relation has a Visibility Map
(VM) to keep track of which pages contain only tuples that are known to be
-visible to all active transactions. It's stored
-alongside the main relation data in a separate relation fork, named after the
-filenode number of the relation, plus a <literal>_vm</> suffix. For example,
-if the filenode of a relation is 12345, the VM is stored in a file called
-<filename>12345_vm</>, in the same directory as the main relation file.
+visible to all active transactions, and pages contain only unfrozen tuples.
+It's stored alongside the main relation data in a separate relation fork,
+named after the filenode number of the relation, plus a <literal>_vm</> suffix.
+For example, if the filenode of a relation is 12345, the VM is stored in a file
+called <filename>12345_vm</>, in the same directory as the main relation file.
Note that indexes do not have VMs.
</para>
<para>
-The visibility map simply stores one bit per heap page. A set bit means
-that all tuples on the page are known to be visible to all transactions.
-This means that the page does not contain any tuples that need to be vacuumed.
+The visibility map stores two bits per heap page: all-visible, all-frozen.
+A set all-visible bit means that all tuples on the page are known to be visible
+to all transactions. A set all-frozen bit means that all tuples on the page are
+completely marked as frozen. This means that the page does not contain any tuples
+that need to be vacuumed, even if a scan of whole table is required.
This information can also be used by <firstterm>index-only scans</> to answer
queries using only the index tuple.
</para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f443742..e75144f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3034,9 +3034,9 @@ heap_delete(Relation relation, ItemPointer tid,
/*
* If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, we'll have to unlock and
- * re-lock, to avoid holding the buffer lock across an I/O. That's a bit
- * unfortunate, but hopefully shouldn't happen often.
+ * visible or all frozen while we were busy locking the buffer, we'll
+ * have to unlock and re-lock, to avoid holding the buffer lock across an
+ * I/O. That's a bit unfortunate, but hopefully shouldn't happen often.
*/
if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
{
@@ -7205,7 +7205,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
*/
XLogRecPtr
log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
- TransactionId cutoff_xid)
+ TransactionId cutoff_xid, uint8 vmflags)
{
xl_heap_visible xlrec;
XLogRecPtr recptr;
@@ -7215,6 +7215,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
Assert(BufferIsValid(vm_buffer));
xlrec.cutoff_xid = cutoff_xid;
+ xlrec.flags = vmflags;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
@@ -7804,7 +7805,12 @@ heap_xlog_visible(XLogReaderState *record)
* the subsequent update won't be replayed to clear the flag.
*/
page = BufferGetPage(buffer);
- PageSetAllVisible(page);
+
+ if (xlrec->flags & VISIBILITYMAP_ALL_VISIBLE)
+ PageSetAllVisible(page);
+ if (xlrec->flags & VISIBILITYMAP_ALL_FROZEN)
+ PageSetAllFrozen(page);
+
MarkBufferDirty(buffer);
}
else if (action == BLK_RESTORED)
@@ -7856,7 +7862,7 @@ heap_xlog_visible(XLogReaderState *record)
*/
if (lsn > PageGetLSN(vmpage))
visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
- xlrec->cutoff_xid);
+ xlrec->cutoff_xid, xlrec->flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index fc28f3f..6d95c7f 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -15,39 +15,45 @@
* visibilitymap_pin - pin a map page for setting a bit
* visibilitymap_pin_ok - check whether correct map page is already pinned
* visibilitymap_set - set a bit in a previously pinned page
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_get_status - get status of bits
* visibilitymap_count - count number of bits set in visibility map
* visibilitymap_truncate - truncate the visibility map
*
* NOTES
*
- * The visibility map is a bitmap with one bit per heap page. A set bit means
- * that all tuples on the page are known visible to all transactions, and
- * therefore the page doesn't need to be vacuumed. The map is conservative in
- * the sense that we make sure that whenever a bit is set, we know the
- * condition is true, but if a bit is not set, it might or might not be true.
+ * The visibility map is a bitmap with two bits (all-visible and all-frozen)
+ * per heap page. A set all-visible bit means that all tuples on the page are
+ * known visible to all transactions, and therefore the page doesn't need to
+ * be vacuumed. A set all-frozen bit means that all tuples on the page are
+ * completely frozen, and therefore the page doesn't need to be vacuumed even
+ * if whole table scanning vacuum is required (e.g. anti-wraparound vacuum).
+ * The all-frozen bit must be set only when the page is already all-visible.
*
- * Clearing a visibility map bit is not separately WAL-logged. The callers
+ * The map is conservative in the sense that we make sure that whenever a bit
+ * is set, we know the condition is true, but if a bit is not set, it might or
+ * might not be true.
+ *
+ * Clearing both visibility map bits is not separately WAL-logged. The callers
* must make sure that whenever a bit is cleared, the bit is cleared on WAL
* replay of the updating operation as well.
*
* When we *set* a visibility map during VACUUM, we must write WAL. This may
* seem counterintuitive, since the bit is basically a hint: if it is clear,
- * it may still be the case that every tuple on the page is visible to all
- * transactions; we just don't know that for certain. The difficulty is that
- * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
- * on the page itself, and the visibility map bit. If a crash occurs after the
- * visibility map page makes it to disk and before the updated heap page makes
- * it to disk, redo must set the bit on the heap page. Otherwise, the next
- * insert, update, or delete on the heap page will fail to realize that the
- * visibility map bit must be cleared, possibly causing index-only scans to
- * return wrong answers.
+ * it may still be the case that every tuple on the page is all-visible or
+ * all-frozen we just don't know that for certain. The difficulty is that
+ * there are two bits which are typically set together: the PD_ALL_VISIBLE
+ * or PD_ALL_FROZEN bit on the page itself, and the corresponding visibility map
+ * bit. If a crash occurs after the visibility map page makes it to disk and before
+ * the updated heap page makes it to disk, redo must set the bit on the heap page.
+ * Otherwise, the next insert, update, or delete on the heap page will fail to
+ * realize that the visibility map bit must be cleared, possibly causing index-only
+ * scans to return wrong answers.
*
* VACUUM will normally skip pages for which the visibility map bit is set;
* such pages can't contain any dead tuples and therefore don't need vacuuming.
- * The visibility map is not used for anti-wraparound vacuums, because
- * an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
- * present in the table, even on pages that don't have any dead tuples.
+ * The visibility map has the all-frozen bit which indicates all tuples on
+ * corresponding page have been completely frozen, so the visibility map is also
+ * used for anti-wraparound vacuum, even if freezing of tuples is required.
*
* LOCKING
*
@@ -58,14 +64,14 @@
* section that logs the page modification. However, we don't want to hold
* the buffer lock over any I/O that may be required to read in the visibility
* map page. To avoid this, we examine the heap page before locking it;
- * if the page-level PD_ALL_VISIBLE bit is set, we pin the visibility map
- * bit. Then, we lock the buffer. But this creates a race condition: there
- * is a possibility that in the time it takes to lock the buffer, the
- * PD_ALL_VISIBLE bit gets set. If that happens, we have to unlock the
- * buffer, pin the visibility map page, and relock the buffer. This shouldn't
- * happen often, because only VACUUM currently sets visibility map bits,
- * and the race will only occur if VACUUM processes a given page at almost
- * exactly the same time that someone tries to further modify it.
+ * if the page-level PD_ALL_VISIBLE or PD_ALL_FROZEN bit is set, we pin the
+ * visibility map bit. Then, we lock the buffer. But this creates a race
+ * condition: there is a possibility that in the time it takes to lock the
+ * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that happens,
+ * we have to unlock the buffer, pin the visibility map page, and relock the
+ * buffer. This shouldn't happen often, because only VACUUM currently sets
+ * visibility map bits, and the race will only occur if VACUUM processes a given
+ * page at almost exactly the same time that someone tries to further modify it.
*
* To set a bit, you need to hold a lock on the heap page. That prevents
* the race condition where VACUUM sees that all tuples on the page are
@@ -101,38 +107,50 @@
*/
#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
-/* Number of bits allocated for each heap block. */
-#define BITS_PER_HEAPBLOCK 1
-
-/* Number of heap blocks we can represent in one byte. */
-#define HEAPBLOCKS_PER_BYTE 8
-
/* Number of heap blocks we can represent in one visibility map page. */
#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
/* Mapping from heap block number to the right bit in the visibility map */
#define HEAPBLK_TO_MAPBLOCK(x) ((x) / HEAPBLOCKS_PER_PAGE)
#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
-#define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
-
-/* table for fast counting of set bits */
-static const uint8 number_of_ones[256] = {
- 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
+#define HEAPBLK_TO_MAPBIT(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)
+
+/* tables for fast counting of set bits for visible and frozen */
+static const uint8 number_of_ones_for_visible[256] = {
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4
+};
+static const uint8 number_of_ones_for_frozen[256] = {
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4
};
/* prototypes for internal routines */
@@ -141,7 +159,7 @@ static void vm_extend(Relation rel, BlockNumber nvmblocks);
/*
- * visibilitymap_clear - clear a bit in visibility map
+ * visibilitymap_clear - clear all bits in visibility map
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -153,7 +171,7 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
int mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- uint8 mask = 1 << mapBit;
+ uint8 mask = VISIBILITYMAP_VALID_BITS << mapBit;
char *map;
#ifdef TRACE_VISIBILITYMAP
@@ -186,7 +204,7 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
* visibilitymap_set to actually set the bit.
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by
- * an earlier call to visibilitymap_pin or visibilitymap_test on the same
+ * an earlier call to visibilitymap_pin or visibilitymap_get_status on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk.
*
@@ -212,7 +230,7 @@ visibilitymap_pin(Relation rel, BlockNumber heapBlk, Buffer *buf)
* visibilitymap_pin_ok - do we already have the correct page pinned?
*
* On entry, buf should be InvalidBuffer or a valid buffer returned by
- * an earlier call to visibilitymap_pin or visibilitymap_test on the same
+ * an earlier call to visibilitymap_pin or visibilitymap_get_status on the same
* relation. The return value indicates whether the buffer covers the
* given heapBlk.
*/
@@ -225,7 +243,7 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
}
/*
- * visibilitymap_set - set a bit on a previously pinned page
+ * visibilitymap_set - set bit(s) on a previously pinned page
*
* recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
* or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
@@ -234,10 +252,11 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
* marked all-visible; it is needed for Hot Standby, and can be
* InvalidTransactionId if the page contains no tuples.
*
- * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
- * this function. Except in recovery, caller should also pass the heap
- * buffer. When checksums are enabled and we're not in recovery, we must add
- * the heap buffer to the WAL chain to protect it from being torn.
+ * Caller is expected to set the heap page's PD_ALL_VISIBLE or PD_ALL_FROZEN
+ * bit before calling this function. Except in recovery, caller should also
+ * pass the heap buffer and flags which indicates what flag we want to set.
+ * When checksums are enabled and we're not in recovery, we must add the heap
+ * buffer to the WAL chain to protect it from being torn.
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -245,13 +264,14 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
*/
void
visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid)
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
uint8 mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
Page page;
- char *map;
+ uint8 *map;
#ifdef TRACE_VISIBILITYMAP
elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
@@ -259,6 +279,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
Assert(InRecovery || BufferIsValid(heapBuf));
+ Assert(flags & VISIBILITYMAP_VALID_BITS);
/* Check that we have the right heap page pinned, if present */
if (BufferIsValid(heapBuf) && BufferGetBlockNumber(heapBuf) != heapBlk)
@@ -269,14 +290,14 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
elog(ERROR, "wrong VM buffer passed to visibilitymap_set");
page = BufferGetPage(vmBuf);
- map = PageGetContents(page);
+ map = (uint8 *)PageGetContents(page);
LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
- if (!(map[mapByte] & (1 << mapBit)))
+ if (flags != (map[mapByte] >> mapBit & VISIBILITYMAP_VALID_BITS))
{
START_CRIT_SECTION();
- map[mapByte] |= (1 << mapBit);
+ map[mapByte] |= (flags << mapBit);
MarkBufferDirty(vmBuf);
if (RelationNeedsWAL(rel))
@@ -285,7 +306,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Assert(!InRecovery);
recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
- cutoff_xid);
+ cutoff_xid, flags);
/*
* If data checksums are enabled (or wal_log_hints=on), we
@@ -295,11 +316,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Page heapPage = BufferGetPage(heapBuf);
- /* caller is expected to set PD_ALL_VISIBLE first */
- Assert(PageIsAllVisible(heapPage));
+ /*
+ * Caller is expected to set PD_ALL_VISIBLE or
+ * PD_ALL_FROZEN first.
+ */
+ Assert(((flags | VISIBILITYMAP_ALL_VISIBLE) && PageIsAllVisible(heapPage)) ||
+ ((flags | VISIBILITYMAP_ALL_FROZEN) && PageIsAllFrozen(heapPage)));
PageSetLSN(heapPage, recptr);
}
}
+
PageSetLSN(page, recptr);
}
@@ -310,15 +336,17 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
}
/*
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_get_status - get status of bits
*
- * Are all tuples on heapBlk visible to all, according to the visibility map?
+ * Are all tuples on heapBlk visible to all or are marked frozen, according
+ * to the visibility map?
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by an
- * earlier call to visibilitymap_pin or visibilitymap_test on the same
+ * earlier call to visibilitymap_pin or visibilitymap_get_status on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk, or InvalidBuffer. The caller is responsible for
- * releasing *buf after it's done testing and setting bits.
+ * releasing *buf after it's done testing and setting bits, and must pass flags
+ * for which it needs to check the value in visibility map.
*
* NOTE: This function is typically called without a lock on the heap page,
* so somebody else could change the bit just after we look at it. In fact,
@@ -327,17 +355,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
* we might see the old value. It is the caller's responsibility to deal with
* all concurrency issues!
*/
-bool
-visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
+uint8
+visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *buf)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
uint8 mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- bool result;
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_test %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_get_status %s %d", RelationGetRelationName(rel), heapBlk);
#endif
/* Reuse the old pinned buffer if possible */
@@ -360,13 +387,11 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
map = PageGetContents(BufferGetPage(*buf));
/*
- * A single-bit read is atomic. There could be memory-ordering effects
+ * The double bits read is atomic. There could be memory-ordering effects
* here, but for performance reasons we make it the caller's job to worry
* about that.
*/
- result = (map[mapByte] & (1 << mapBit)) ? true : false;
-
- return result;
+ return ((map[mapByte] >> mapBit) & VISIBILITYMAP_VALID_BITS);
}
/*
@@ -374,14 +399,20 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
*
* Note: we ignore the possibility of race conditions when the table is being
* extended concurrently with the call. New pages added to the table aren't
- * going to be marked all-visible, so they won't affect the result.
+ * going to be marked all-visible or all-frozen, so they won't affect the result.
*/
-BlockNumber
-visibilitymap_count(Relation rel)
+void
+visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen)
{
- BlockNumber result = 0;
BlockNumber mapBlock;
+ /* all_visible must be specified */
+ Assert(all_visible);
+
+ *all_visible = 0;
+ if (all_frozen)
+ *all_frozen = 0;
+
for (mapBlock = 0;; mapBlock++)
{
Buffer mapBuffer;
@@ -406,13 +437,13 @@ visibilitymap_count(Relation rel)
for (i = 0; i < MAPSIZE; i++)
{
- result += number_of_ones[map[i]];
+ *all_visible += number_of_ones_for_visible[map[i]];
+ if (all_frozen)
+ *all_frozen += number_of_ones_for_frozen[map[i]];
}
ReleaseBuffer(mapBuffer);
}
-
- return result;
}
/*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 062691c..f7f26cd 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1923,7 +1923,7 @@ index_update_stats(Relation rel,
BlockNumber relallvisible;
if (rd_rel->relkind != RELKIND_INDEX)
- relallvisible = visibilitymap_count(rel);
+ visibilitymap_count(rel, &relallvisible, NULL);
else /* don't bother for indexes */
relallvisible = 0;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 923fe58..86437c6 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -452,6 +452,7 @@ CREATE VIEW pg_stat_all_tables AS
pg_stat_get_live_tuples(C.oid) AS n_live_tup,
pg_stat_get_dead_tuples(C.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(C.oid) AS n_mod_since_analyze,
+ pg_stat_get_frozen_pages(C.oid) AS n_frozen_pages,
pg_stat_get_last_vacuum_time(C.oid) as last_vacuum,
pg_stat_get_last_autovacuum_time(C.oid) as last_autovacuum,
pg_stat_get_last_analyze_time(C.oid) as last_analyze,
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 070df29..d7f3035 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -314,6 +314,8 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
+ BlockNumber relallvisible,
+ relallfrozen;
if (inh)
ereport(elevel,
@@ -564,51 +566,56 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
}
}
- /*
- * Update pages/tuples stats in pg_class ... but not if we're doing
- * inherited stats.
- */
if (!inh)
+ {
+ /* Calculate the number of all-visible and all-frozen bit */
+ visibilitymap_count(onerel, &relallvisible, &relallfrozen);
+
+ /*
+ * Update pages/tuples stats in pg_class ... but not if we're doing
+ * inherited stats.
+ */
vac_update_relstats(onerel,
relpages,
totalrows,
- visibilitymap_count(onerel),
+ relallvisible,
hasindex,
InvalidTransactionId,
InvalidMultiXactId,
in_outer_xact);
- /*
- * Same for indexes. Vacuum always scans all indexes, so if we're part of
- * VACUUM ANALYZE, don't overwrite the accurate count already inserted by
- * VACUUM.
- */
- if (!inh && !(options & VACOPT_VACUUM))
- {
- for (ind = 0; ind < nindexes; ind++)
+ /*
+ * Same for indexes. Vacuum always scans all indexes, so if we're part of
+ * VACUUM ANALYZE, don't overwrite the accurate count already inserted by
+ * VACUUM.
+ */
+ if (!(options & VACOPT_VACUUM))
{
- AnlIndexData *thisdata = &indexdata[ind];
- double totalindexrows;
-
- totalindexrows = ceil(thisdata->tupleFract * totalrows);
- vac_update_relstats(Irel[ind],
- RelationGetNumberOfBlocks(Irel[ind]),
- totalindexrows,
- 0,
- false,
- InvalidTransactionId,
- InvalidMultiXactId,
- in_outer_xact);
+ for (ind = 0; ind < nindexes; ind++)
+ {
+ AnlIndexData *thisdata = &indexdata[ind];
+ double totalindexrows;
+
+ totalindexrows = ceil(thisdata->tupleFract * totalrows);
+ vac_update_relstats(Irel[ind],
+ RelationGetNumberOfBlocks(Irel[ind]),
+ totalindexrows,
+ 0,
+ false,
+ InvalidTransactionId,
+ InvalidMultiXactId,
+ in_outer_xact);
+ }
}
- }
- /*
- * Report ANALYZE to the stats collector, too. However, if doing
- * inherited stats we shouldn't report, because the stats collector only
- * tracks per-table stats.
- */
- if (!inh)
- pgstat_report_analyze(onerel, totalrows, totaldeadrows);
+ /*
+ * Report ANALYZE to the stats collector, too. However, if doing
+ * inherited stats we shouldn't report, because the stats collector only
+ * tracks per-table stats.
+ */
+ pgstat_report_analyze(onerel, totalrows, totaldeadrows, relallfrozen);
+
+ }
/* If this isn't part of VACUUM ANALYZE, let index AMs do cleanup */
if (!(options & VACOPT_VACUUM))
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 4f6f6e7..fbdb18c 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -106,6 +106,8 @@ typedef struct LVRelStats
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* number of pages we examined */
BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber vmskipped_frozen_pages; /* # of pages we skipped by all-frozen bit
+ of visibility map */
double scanned_tuples; /* counts only tuples on scanned pages */
double old_rel_tuples; /* previous value of pg_class.reltuples */
double new_rel_tuples; /* new estimated total # of tuples */
@@ -156,8 +158,9 @@ static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
ItemPointer itemptr);
static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
static int vac_cmp_itemptr(const void *left, const void *right);
-static bool heap_page_is_all_visible(Relation rel, Buffer buf,
- TransactionId *visibility_cutoff_xid);
+static void heap_page_visible_status(Relation rel, Buffer buf,
+ TransactionId *visibility_cutoff_xid,
+ bool *all_visible, bool *all_frozen);
/*
@@ -188,7 +191,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
MultiXactId mxactFullScanLimit;
BlockNumber new_rel_pages;
double new_rel_tuples;
- BlockNumber new_rel_allvisible;
+ BlockNumber new_rel_allvisible,
+ new_rel_allfrozen;
double new_live_tuples;
TransactionId new_frozen_xid;
MultiXactId new_min_multi;
@@ -221,7 +225,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* We request a full scan if either the table's frozen Xid is now older
* than or equal to the requested Xid full-table scan limit; or if the
* table's minimum MultiXactId is older than or equal to the requested
- * mxid full-table scan limit.
+ * mxid full-table scan limit. During full scan, we could skip some pages
+ * according to all-frozen bit of visibility map.
*/
scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
xidFullScanLimit);
@@ -253,7 +258,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
- if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages + vacrelstats->vmskipped_frozen_pages)
+ < vacrelstats->rel_pages)
{
Assert(!scan_all);
scanned_all = false;
@@ -274,15 +280,15 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* Update statistics in pg_class.
*
* A corner case here is that if we scanned no pages at all because every
- * page is all-visible, we should not update relpages/reltuples, because
- * we have no new information to contribute. In particular this keeps us
- * from replacing relpages=reltuples=0 (which means "unknown tuple
+ * page is all-visible or all-frozen, we should not update relpages/reltuples,
+ * because we have no new information to contribute. In particular this keeps
+ * us from replacing relpages=reltuples=0 (which means "unknown tuple
* density") with nonzero relpages and reltuples=0 (which means "zero
* tuple density") unless there's some actual evidence for the latter.
*
- * We do update relallvisible even in the corner case, since if the table
- * is all-visible we'd definitely like to know that. But clamp the value
- * to be not more than what we're setting relpages to.
+ * We do update relallvisible and relallfrozen even in the corner case,
+ * since if the table is all-visible we'd definitely like to know that.
+ * But clamp the value to be not more than what we're setting relpages to.
*
* Also, don't change relfrozenxid/relminmxid if we skipped any pages,
* since then we don't know for certain that all tuples have a newer xmin.
@@ -295,10 +301,13 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_tuples = vacrelstats->old_rel_tuples;
}
- new_rel_allvisible = visibilitymap_count(onerel);
+ visibilitymap_count(onerel, &new_rel_allvisible, &new_rel_allfrozen);
if (new_rel_allvisible > new_rel_pages)
new_rel_allvisible = new_rel_pages;
+ if (new_rel_allfrozen > new_rel_pages)
+ new_rel_allfrozen = new_rel_pages;
+
new_frozen_xid = scanned_all ? FreezeLimit : InvalidTransactionId;
new_min_multi = scanned_all ? MultiXactCutoff : InvalidMultiXactId;
@@ -319,7 +328,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
pgstat_report_vacuum(RelationGetRelid(onerel),
onerel->rd_rel->relisshared,
new_live_tuples,
- vacrelstats->new_dead_tuples);
+ vacrelstats->new_dead_tuples,
+ new_rel_allfrozen);
/* and log the action if appropriate */
if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
@@ -354,10 +364,11 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel),
vacrelstats->num_index_scans);
- appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins\n"),
+ appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped according to vm\n"),
vacrelstats->pages_removed,
vacrelstats->rel_pages,
- vacrelstats->pinskipped_pages);
+ vacrelstats->pinskipped_pages,
+ vacrelstats->vmskipped_frozen_pages);
appendStringInfo(&buf,
_("tuples: %.0f removed, %.0f remain, %.0f are dead but not yet removable\n"),
vacrelstats->tuples_deleted,
@@ -480,9 +491,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* consecutive pages. Since we're reading sequentially, the OS should be
* doing readahead for us, so there's no gain in skipping a page now and
* then; that's likely to disable readahead and so be counterproductive.
- * Also, skipping even a single page means that we can't update
- * relfrozenxid, so we only want to do it if we can skip a goodly number
- * of pages.
+ * Also, skipping even a single page according to all-visible bit of
+ * visibility map means that we might not be able to update relfrozenxid,
+ * so we only want to do it if we can skip a goodly number. On the other hand,
+ * we count both how many pages we skipped according to all-frozen bit of
+ * visibility map and how many pages we froze, so we can update relfrozenxid
+ * if the sum of two is as many as pages of table.
*
* Before entering the main loop, establish the invariant that
* next_not_all_visible_block is the next block number >= blkno that's not
@@ -492,18 +506,18 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* started skipping blocks, we may as well skip everything up to the next
* not-all-visible block.
*
- * Note: if scan_all is true, we won't actually skip any pages; but we
+ * Note: if scan_all is true, we might not actually skip any pages; but we
* maintain next_not_all_visible_block anyway, so as to set up the
* all_visible_according_to_vm flag correctly for each page.
*
- * Note: The value returned by visibilitymap_test could be slightly
+ * Note: The value returned by visibilitymap_get_status could be slightly
* out-of-date, since we make this test before reading the corresponding
* heap page or locking the buffer. This is OK. If we mistakenly think
- * that the page is all-visible when in fact the flag's just been cleared,
- * we might fail to vacuum the page. But it's OK to skip pages when
- * scan_all is not set, so no great harm done; the next vacuum will find
- * them. If we make the reverse mistake and vacuum a page unnecessarily,
- * it'll just be a no-op.
+ * that the page is all-visible/all-frozen when in fact the flag's just
+ * been cleared, we might fail to vacuum the page. But it's OK to skip
+ * pages when scan_all is not set, so no great harm done; the next vacuum
+ * will find them. If we make the reverse mistake and vacuum a page
+ * unnecessarily, it'll just be a no-op.
*
* We will scan the table's last page, at least to the extent of
* determining whether it has tuples or not, even if it should be skipped
@@ -518,7 +532,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer))
+ if (!VM_ALL_VISIBLE(onerel, next_not_all_visible_block, &vmbuffer))
break;
vacuum_delay_point();
}
@@ -536,9 +550,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
bool tupgone,
hastup;
int prev_dead_count;
- int nfrozen;
+ int nfrozen; /* # of tuples is frozen */
+ int nalready_frozen; /* # of tuples is already frozen */
+ int ntotal_frozen; /* # of frozen tuples in single page */
+ int ntup_per_page;
Size freespace;
bool all_visible_according_to_vm;
+ bool all_frozen_according_to_vm;
bool all_visible;
bool has_dead_tuples;
TransactionId visibility_cutoff_xid = InvalidTransactionId;
@@ -554,8 +572,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block,
- &vmbuffer))
+ if (!VM_ALL_VISIBLE(onerel, next_not_all_visible_block, &vmbuffer))
break;
vacuum_delay_point();
}
@@ -569,14 +586,29 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
skipping_all_visible_blocks = true;
else
skipping_all_visible_blocks = false;
+
all_visible_according_to_vm = false;
+ all_frozen_according_to_vm = false;
}
else
{
- /* Current block is all-visible */
- if (skipping_all_visible_blocks && !scan_all && !FORCE_CHECK_PAGE())
- continue;
+ /*
+ * This block is at least all-visible according to visibility map.
+ * We check whether this block is all-frozen or not, to skip to
+ * vacuum this page even if scan_all is true.
+ */
+ bool all_frozen = VM_ALL_FROZEN(onerel, blkno, &vmbuffer);
+
+ if (scan_all && all_frozen && !FORCE_CHECK_PAGE())
+ {
+ vacrelstats->vmskipped_frozen_pages++;
+ continue;
+ }
+ else if (!scan_all && skipping_all_visible_blocks && !FORCE_CHECK_PAGE())
+ continue;
+
all_visible_according_to_vm = true;
+ all_frozen_according_to_vm = all_frozen;
}
vacuum_delay_point();
@@ -743,7 +775,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
empty_pages++;
freespace = PageGetHeapFreeSpace(page);
- /* empty pages are always all-visible */
+ /* empty pages are always all-visible and all-frozen */
if (!PageIsAllVisible(page))
{
START_CRIT_SECTION();
@@ -766,8 +798,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
log_newpage_buffer(buf, true);
PageSetAllVisible(page);
+ PageSetAllFrozen(page);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, InvalidTransactionId);
+ vmbuffer, InvalidTransactionId,
+ VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
END_CRIT_SECTION();
}
@@ -791,13 +825,15 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
all_visible = true;
has_dead_tuples = false;
nfrozen = 0;
+ nalready_frozen = 0;
+ ntup_per_page = 0;
hastup = false;
prev_dead_count = vacrelstats->num_dead_tuples;
maxoff = PageGetMaxOffsetNumber(page);
/*
* Note: If you change anything in the loop below, also look at
- * heap_page_is_all_visible to see if that needs to be changed.
+ * heap_page_visible_status to see if that needs to be changed.
*/
for (offnum = FirstOffsetNumber;
offnum <= maxoff;
@@ -945,8 +981,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
{
num_tuples += 1;
+ ntup_per_page += 1;;
hastup = true;
+ /* Check whether this tuple is already frozen or not */
+ if (HeapTupleHeaderXminFrozen(tuple.t_data))
+ nalready_frozen += 1;
+
/*
* Each non-removable tuple must be checked to see if it needs
* freezing. Note we already have exclusive buffer lock.
@@ -993,6 +1034,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
END_CRIT_SECTION();
}
+ /* Compute total number of frozen tuples in a page */
+ ntotal_frozen = nfrozen + nalready_frozen;
+
/*
* If there are no indexes then we can vacuum the page right now
* instead of doing a second scan.
@@ -1015,26 +1059,46 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
freespace = PageGetHeapFreeSpace(page);
- /* mark page all-visible, if appropriate */
- if (all_visible && !all_visible_according_to_vm)
+ /* This page is all visible */
+ if (all_visible)
{
- /*
- * It should never be the case that the visibility map page is set
- * while the page-level bit is clear, but the reverse is allowed
- * (if checksums are not enabled). Regardless, set the both bits
- * so that we get back in sync.
- *
- * NB: If the heap page is all-visible but the VM bit is not set,
- * we don't need to dirty the heap page. However, if checksums
- * are enabled, we do need to make sure that the heap page is
- * dirtied before passing it to visibilitymap_set(), because it
- * may be logged. Given that this situation should only happen in
- * rare cases after a crash, it is not worth optimizing.
- */
- PageSetAllVisible(page);
- MarkBufferDirty(buf);
- visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ /* mark page all-visible, if appropriate */
+ if (!all_visible_according_to_vm)
+ {
+ /*
+ * It should never be the case that the visibility map page is set
+ * while the page-level bit is clear, but the reverse is allowed
+ * (if checksums are not enabled). Regardless, set the both bits
+ * so that we get back in sync.
+ *
+ * NB: If the heap page is all-visible but the VM bit is not set,
+ * we don't need to dirty the heap page. However, if checksums
+ * are enabled, we do need to make sure that the heap page is
+ * dirtied before passing it to visibilitymap_set(), because it
+ * may be logged. Given that this situation should only happen in
+ * rare cases after a crash, it is not worth optimizing.
+ */
+ PageSetAllVisible(page);
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+ }
+
+ /* mark page all-frozen, if all tuples are frozen and not marked yet */
+ if ((ntotal_frozen == ntup_per_page) && !all_frozen_according_to_vm)
+ {
+ Assert(PageIsAllVisible(page));
+
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ {
+ MarkBufferDirty(buf);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+ vmbuffer, visibility_cutoff_xid, flags);
+ }
}
/*
@@ -1045,9 +1109,14 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* that something bad has happened.
*/
else if (all_visible_according_to_vm && !PageIsAllVisible(page)
- && visibilitymap_test(onerel, blkno, &vmbuffer))
+ && VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
{
- elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
+ /* If the all-frozen is set then all-visible must be set */
+ if (all_frozen_according_to_vm)
+ Assert(VM_ALL_FROZEN(onerel, blkno, &vmbuffer) &&
+ VM_ALL_VISIBLE(onerel, blkno, &vmbuffer));
+
+ elog(WARNING, "page is not marked all-visible (and all-frozen) but visibility map bit(s) is set in relation \"%s\" page %u",
relname, blkno);
visibilitymap_clear(onerel, blkno, vmbuffer);
}
@@ -1055,19 +1124,25 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
/*
* It's possible for the value returned by GetOldestXmin() to move
* backwards, so it's not wrong for us to see tuples that appear to
- * not be visible to everyone yet, while PD_ALL_VISIBLE is already
- * set. The real safe xmin value never moves backwards, but
- * GetOldestXmin() is conservative and sometimes returns a value
+ * not be visible to everyone yet, while PD_ALL_VISIBLE (and PD_ALL_FROZEN)
+ * are already set. The real safe xmin value never moves backwards,
+ * but GetOldestXmin() is conservative and sometimes returns a value
* that's unnecessarily small, so if we see that contradiction it just
* means that the tuples that we think are not visible to everyone yet
- * actually are, and the PD_ALL_VISIBLE flag is correct.
+ * actually are, and the PD_ALL_VISIBLE (and PD_ALL_FROZEN) flags are
+ * correct.
*
* There should never be dead tuples on a page with PD_ALL_VISIBLE
* set, however.
*/
else if (PageIsAllVisible(page) && has_dead_tuples)
{
- elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
+ /* If all-frozen is set then all-visible must be set */
+ if (PageIsAllFrozen(page))
+ Assert(VM_ALL_FROZEN(onerel, blkno, &vmbuffer) &&
+ VM_ALL_VISIBLE(onerel, blkno, &vmbuffer));
+
+ elog(WARNING, "page containing dead tuples is marked as all-visible (and all-frozen) in relation \"%s\" page %u",
relname, blkno);
PageClearAllVisible(page);
MarkBufferDirty(buf);
@@ -1141,6 +1216,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
RelationGetRelationName(onerel),
tups_vacuumed, vacuumed_pages)));
+ /* Report how many frozen pages vacuum skipped according to visibility map */
+ ereport(elevel,
+ (errmsg_plural("skipped %d frozen page according to visibility map",
+ "skipped %d frozen pages according to visibility map",
+ vacrelstats->vmskipped_frozen_pages,
+ vacrelstats->vmskipped_frozen_pages)));
+
/*
* This is pretty messy, but we split it up so that we can skip emitting
* individual parts of the message when not applicable.
@@ -1257,6 +1339,8 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
OffsetNumber unused[MaxOffsetNumber];
int uncnt = 0;
TransactionId visibility_cutoff_xid;
+ bool all_visible;
+ bool all_frozen;
START_CRIT_SECTION();
@@ -1308,19 +1392,36 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
* dirty, exclusively locked, and, if needed, a full page image has been
* emitted in the log_heap_clean() above.
*/
- if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid))
+ heap_page_visible_status(onerel, buffer, &visibility_cutoff_xid,
+ &all_visible, &all_frozen);
+ if (all_visible)
PageSetAllVisible(page);
/*
* All the changes to the heap page have been done. If the all-visible
- * flag is now set, also set the VM bit.
+ * flag is now set, also set the VM all-visible bit.
+ * Also, if this page is all-frozen, set the VM all-frozen bit and flag.
*/
- if (PageIsAllVisible(page) &&
- !visibilitymap_test(onerel, blkno, vmbuffer))
+ if (PageIsAllVisible(page))
{
+ uint8 vm_status = visibilitymap_get_status(onerel, blkno, vmbuffer);
+ uint8 flags = 0;
+
+ if (!(vm_status & VISIBILITYMAP_ALL_VISIBLE))
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+
+ /* Set the VM all-frozen bit to flag, if needed */
+ if (all_frozen && !(vm_status & VISIBILITYMAP_ALL_FROZEN))
+ {
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
Assert(BufferIsValid(*vmbuffer));
- visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
- visibility_cutoff_xid);
+
+ if (vm_status != flags)
+ visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
+ visibility_cutoff_xid, flags);
}
return tupindex;
@@ -1842,18 +1943,21 @@ vac_cmp_itemptr(const void *left, const void *right)
/*
* Check if every tuple in the given page is visible to all current and future
* transactions. Also return the visibility_cutoff_xid which is the highest
- * xmin amongst the visible tuples.
+ * xmin amongst the visible tuples, and all_frozen which implies that all tuples
+ * of this page are frozen.
*/
-static bool
-heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid)
+static void
+heap_page_visible_status(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid,
+ bool *all_visible, bool *all_frozen)
{
Page page = BufferGetPage(buf);
BlockNumber blockno = BufferGetBlockNumber(buf);
OffsetNumber offnum,
maxoff;
- bool all_visible = true;
*visibility_cutoff_xid = InvalidTransactionId;
+ *all_visible = true;
+ *all_frozen = true;
/*
* This is a stripped down version of the line pointer scan in
@@ -1861,7 +1965,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
*/
maxoff = PageGetMaxOffsetNumber(page);
for (offnum = FirstOffsetNumber;
- offnum <= maxoff && all_visible;
+ offnum <= maxoff && *all_visible;
offnum = OffsetNumberNext(offnum))
{
ItemId itemid;
@@ -1877,11 +1981,12 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/*
* Dead line pointers can have index pointers pointing to them. So
- * they can't be treated as visible
+ * they can't be treated as visible and frozen.
*/
if (ItemIdIsDead(itemid))
{
- all_visible = false;
+ *all_visible = false;
+ *all_frozen = false;
break;
}
@@ -1900,7 +2005,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/* Check comments in lazy_scan_heap. */
if (!HeapTupleHeaderXminCommitted(tuple.t_data))
{
- all_visible = false;
+ *all_visible = false;
break;
}
@@ -1911,13 +2016,17 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
xmin = HeapTupleHeaderGetXmin(tuple.t_data);
if (!TransactionIdPrecedes(xmin, OldestXmin))
{
- all_visible = false;
+ *all_visible = false;
break;
}
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
+
+ /* Check whether this tuple is already frozen or not */
+ if (!HeapTupleHeaderXminFrozen(tuple.t_data))
+ *all_frozen = false;
}
break;
@@ -1925,7 +2034,8 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
case HEAPTUPLE_RECENTLY_DEAD:
case HEAPTUPLE_INSERT_IN_PROGRESS:
case HEAPTUPLE_DELETE_IN_PROGRESS:
- all_visible = false;
+ *all_visible = false;
+ *all_frozen = false;
break;
default:
@@ -1934,5 +2044,6 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
}
} /* scan along page */
- return all_visible;
+ if (!(*all_visible))
+ *all_frozen = false;
}
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 90afbdc..4f6f91c 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -85,9 +85,9 @@ IndexOnlyNext(IndexOnlyScanState *node)
* which all tuples are known visible to everybody. In any case,
* we'll use the index tuple not the heap tuple as the data source.
*
- * Note on Memory Ordering Effects: visibilitymap_test does not lock
- * the visibility map buffer, and therefore the result we read here
- * could be slightly stale. However, it can't be stale enough to
+ * Note on Memory Ordering Effects: visibilitymap_get_status does not
+ * lock the visibility map buffer, and therefore the result we read
+ * here could be slightly stale. However, it can't be stale enough to
* matter.
*
* We need to detect clearing a VM bit due to an insert right away,
@@ -114,9 +114,9 @@ IndexOnlyNext(IndexOnlyScanState *node)
* It's worth going through this complexity to avoid needing to lock
* the VM buffer, which could cause significant contention.
*/
- if (!visibilitymap_test(scandesc->heapRelation,
- ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
+ if (!VM_ALL_VISIBLE(scandesc->heapRelation,
+ ItemPointerGetBlockNumber(tid),
+ &node->ioss_VMBuffer))
{
/*
* Rats, we have to visit the heap to check visibility.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index da768c6..08b61cb 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1329,7 +1329,8 @@ pgstat_report_autovac(Oid dboid)
*/
void
pgstat_report_vacuum(Oid tableoid, bool shared,
- PgStat_Counter livetuples, PgStat_Counter deadtuples)
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages)
{
PgStat_MsgVacuum msg;
@@ -1343,6 +1344,7 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
msg.m_vacuumtime = GetCurrentTimestamp();
msg.m_live_tuples = livetuples;
msg.m_dead_tuples = deadtuples;
+ msg.m_frozen_pages = frozenpages;
pgstat_send(&msg, sizeof(msg));
}
@@ -1354,7 +1356,8 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
*/
void
pgstat_report_analyze(Relation rel,
- PgStat_Counter livetuples, PgStat_Counter deadtuples)
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages)
{
PgStat_MsgAnalyze msg;
@@ -1394,6 +1397,7 @@ pgstat_report_analyze(Relation rel,
msg.m_analyzetime = GetCurrentTimestamp();
msg.m_live_tuples = livetuples;
msg.m_dead_tuples = deadtuples;
+ msg.m_frozen_pages = frozenpages;
pgstat_send(&msg, sizeof(msg));
}
@@ -3702,6 +3706,7 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
result->n_live_tuples = 0;
result->n_dead_tuples = 0;
result->changes_since_analyze = 0;
+ result->n_frozen_pages = 0;
result->blocks_fetched = 0;
result->blocks_hit = 0;
result->vacuum_timestamp = 0;
@@ -5069,6 +5074,7 @@ pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
tabentry->n_live_tuples = msg->m_live_tuples;
tabentry->n_dead_tuples = msg->m_dead_tuples;
+ tabentry->n_frozen_pages = msg->m_frozen_pages;
if (msg->m_autovacuum)
{
@@ -5103,6 +5109,7 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
tabentry->n_live_tuples = msg->m_live_tuples;
tabentry->n_dead_tuples = msg->m_dead_tuples;
+ tabentry->n_frozen_pages = msg->m_frozen_pages;
/*
* We reset changes_since_analyze to zero, forgetting any changes that
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 1b22fcc..7c57b3e 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -46,6 +46,7 @@ extern Datum pg_stat_get_vacuum_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_autovacuum_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_analyze_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_autoanalyze_count(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_frozen_pages(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_function_calls(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_function_total_time(PG_FUNCTION_ARGS);
@@ -450,6 +451,21 @@ pg_stat_get_autoanalyze_count(PG_FUNCTION_ARGS)
}
Datum
+pg_stat_get_frozen_pages(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int32 result;
+ PgStat_StatTabEntry *tabentry;
+
+ if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+ result = 0;
+ else
+ result = (int32) (tabentry->n_frozen_pages);
+
+ PG_RETURN_INT32(result);
+}
+
+Datum
pg_stat_get_function_calls(PG_FUNCTION_ARGS)
{
Oid funcid = PG_GETARG_OID(0);
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 9357ad8..ce55541 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -9,7 +9,11 @@
#include "postgres_fe.h"
+#include "access/visibilitymap.h"
#include "pg_upgrade.h"
+#include "storage/bufpage.h"
+#include "storage/checksum.h"
+#include "storage/checksum_impl.h"
#include <fcntl.h>
@@ -21,6 +25,43 @@ static int copy_file(const char *fromfile, const char *tofile, bool force);
static int win32_pghardlink(const char *src, const char *dst);
#endif
+static const char *rewriteVisibilitymap(const char *fromfile, const char *tofile,
+ bool force);
+
+/* table for fast rewriting vm file in order to add all-frozen information */
+static const uint16 rewrite_vm_table[256] = {
+ 0, 1, 4, 5, 16, 17, 20, 21, 64, 65, 68, 69, 80, 81, 84, 85,
+ 256, 257, 260, 261, 272, 273, 276, 277, 320, 321, 324, 325, 336, 337, 340, 341,
+ 1024, 1025, 1028, 1029, 1040, 1041, 1044, 1045, 1088, 1089, 1092, 1093, 1104, 1105, 1108, 1109,
+ 1280, 1281, 1284, 1285, 1296, 1297, 1300, 1301, 1344, 1345, 1348, 1349, 1360, 1361, 1364, 1365,
+ 4096, 4097, 4100, 4101, 4112, 4113, 4116, 4117, 4160, 4161, 4164, 4165, 4176, 4177, 4180, 4181,
+ 4352, 4353, 4356, 4357, 4368, 4369, 4372, 4373, 4416, 4417, 4420, 4421, 4432, 4433, 4436, 4437,
+ 5120, 5121, 5124, 5125, 5136, 5137, 5140, 5141, 5184, 5185, 5188, 5189, 5200, 5201, 5204, 5205,
+ 5376, 5377, 5380, 5381, 5392, 5393, 5396, 5397, 5440, 5441, 5444, 5445, 5456, 5457, 5460, 5461,
+ 16384, 16385, 16388, 16389, 16400, 16401, 16404, 16405, 16448, 16449, 16452, 16453, 16464, 16465, 16468, 16469,
+ 16640, 16641, 16644, 16645, 16656, 16657, 16660, 16661, 16704, 16705, 16708, 16709, 16720, 16721, 16724, 16725,
+ 17408, 17409, 17412, 17413, 17424, 17425, 17428, 17429, 17472, 17473, 17476, 17477, 17488, 17489, 17492, 17493,
+ 17664, 17665, 17668, 17669, 17680, 17681, 17684, 17685, 17728, 17729, 17732, 17733, 17744, 17745, 17748, 17749,
+ 20480, 20481, 20484, 20485, 20496, 20497, 20500, 20501, 20544, 20545, 20548, 20549, 20560, 20561, 20564, 20565,
+ 20736, 20737, 20740, 20741, 20752, 20753, 20756, 20757, 20800, 20801, 20804, 20805, 20816, 20817, 20820, 20821,
+ 21504, 21505, 21508, 21509, 21520, 21521, 21524, 21525, 21568, 21569, 21572, 21573, 21584, 21585, 21588, 21589,
+ 21760, 21761, 21764, 21765, 21776, 21777, 21780, 21781, 21824, 21825, 21828, 21829, 21840, 21841, 21844, 21845
+};
+
+/*
+ * copyOrRewriteFile()
+ * This function copies file or rewrite visibility map file.
+ * If rewrite_vm is true, we have to rewrite visibility map regardless value of pageConverter.
+ */
+const char *
+copyOrRewriteFile(pageCnvCtx *pageConverter,
+ const char *src, const char *dst, bool force, bool rewrite_vm)
+{
+ if (rewrite_vm)
+ return rewriteVisibilitymap(src, dst, force);
+ else
+ return copyAndUpdateFile(pageConverter, src, dst, force);
+}
/*
* copyAndUpdateFile()
@@ -115,12 +156,14 @@ copyAndUpdateFile(pageCnvCtx *pageConverter,
*/
const char *
linkAndUpdateFile(pageCnvCtx *pageConverter,
- const char *src, const char *dst)
+ const char *src, const char *dst, bool rewrite_vm)
{
if (pageConverter != NULL)
return "Cannot in-place update this cluster, page-by-page conversion is required";
- if (pg_link_file(src, dst) == -1)
+ if (rewrite_vm)
+ return rewriteVisibilitymap(src, dst, true);
+ else if (pg_link_file(src, dst) == -1)
return getErrorText();
else
return NULL;
@@ -205,6 +248,96 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
#endif
+/*
+ * rewriteVisibilitymap()
+ *
+ * Copies a visibility map file while adding all-frozen bit(0) into each bit.
+ */
+static const char *
+
+rewriteVisibilitymap(const char *fromfile, const char *tofile, bool force)
+{
+ int src_fd = 0;
+ int dst_fd = 0;
+ char buffer[BLCKSZ];
+ ssize_t bytesRead;
+ int rewriteVmBytesPerPage = (BLCKSZ - SizeOfPageHeaderData) / 2;
+ BlockNumber blkno = 0;
+
+ /* Reset errno */
+ errno = 0;
+
+ if ((fromfile == NULL) || (tofile == NULL))
+ return getErrorText();
+
+ if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
+ goto err;
+
+ if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+ goto err;
+
+ /* Perform data rewriting per page */
+ while ((bytesRead = read(src_fd, buffer, BLCKSZ)) == BLCKSZ)
+ {
+ char *cur, *end, *blkend;
+ PageHeaderData pageheader;
+ uint16 vm_bits;
+
+ /* Save the page header data */
+ memcpy(&pageheader, buffer, SizeOfPageHeaderData);
+
+ cur = buffer + SizeOfPageHeaderData;
+ end = buffer + SizeOfPageHeaderData + rewriteVmBytesPerPage;
+ blkend = buffer + bytesRead;
+
+ while (blkend >= end)
+ {
+ char vmbuf[BLCKSZ];
+ char *vmtmp = vmbuf;
+
+ /* Copy page header in advance */
+ memcpy(vmbuf, &pageheader, SizeOfPageHeaderData);
+
+ vmtmp += SizeOfPageHeaderData;
+
+ /* Rewrite visibility map bit one by one */
+ while (end > cur)
+ {
+ /* Write rewritten bit from table and its string representation */
+ vm_bits = rewrite_vm_table[(uint8) *cur];
+ memcpy(vmtmp, &vm_bits, BITS_PER_HEAPBLOCK);
+
+ cur++;
+ vmtmp += BITS_PER_HEAPBLOCK;
+ }
+
+ /* Set new checksum for a visibility map page, If enabled */
+ if (old_cluster.controldata.data_checksum_version != 0 &&
+ new_cluster.controldata.data_checksum_version != 0)
+ ((PageHeader) vmbuf)->pd_checksum = pg_checksum_page(vmbuf, blkno);
+
+ if (write(dst_fd, vmbuf, BLCKSZ) != BLCKSZ)
+ {
+ if (errno == 0)
+ errno = ENOSPC;
+ goto err;
+ }
+
+ end += rewriteVmBytesPerPage;
+ blkno++;
+ }
+ }
+
+err:
+ if (src_fd != 0)
+ close(src_fd);
+
+ if (dst_fd != 0)
+ close(dst_fd);
+
+ return (errno == 0) ? NULL : getErrorText();
+}
+
void
check_hard_link(void)
{
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index bc733c4..5fb98ae 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -110,6 +110,10 @@ extern char *output_files[];
#define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031
/*
+ * The format of visibility map is changed with this 9.6 commit,
+ */
+#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201601171
+/*
* pg_multixact format changed in 9.3 commit 0ac5ad5134f2769ccbaefec73844f85,
* ("Improve concurrency of foreign key locking") which also updated catalog
* version to this value. pg_upgrade behavior depends on whether old and new
@@ -394,10 +398,12 @@ const pageCnvCtx *setupPageConverter(void);
typedef void *pageCnvCtx;
#endif
+const char *copyOrRewriteFile(pageCnvCtx *pageConverter, const char *src,
+ const char *dst, bool force, bool rewrite_vm);
const char *copyAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
const char *dst, bool force);
const char *linkAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
- const char *dst);
+ const char *dst, bool rewrite_vm);
void check_hard_link(void);
FILE *fopen_priv(const char *path, const char *mode);
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index c059c5b..e02a931 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -18,7 +18,7 @@
static void transfer_single_new_db(pageCnvCtx *pageConverter,
FileNameMap *maps, int size, char *old_tablespace);
static void transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *suffix);
+ const char *suffix, bool vm_need_rewrite);
/*
@@ -171,6 +171,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
{
int mapnum;
bool vm_crashsafe_match = true;
+ bool vm_need_rewrite = false;
/*
* Do the old and new cluster disagree on the crash-safetiness of the vm
@@ -180,13 +181,20 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
vm_crashsafe_match = false;
+ /*
+ * Do we need to rewrite visibilitymap?
+ */
+ if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ vm_need_rewrite = true;
+
for (mapnum = 0; mapnum < size; mapnum++)
{
if (old_tablespace == NULL ||
strcmp(maps[mapnum].old_tablespace, old_tablespace) == 0)
{
/* transfer primary file */
- transfer_relfile(pageConverter, &maps[mapnum], "");
+ transfer_relfile(pageConverter, &maps[mapnum], "", false);
/* fsm/vm files added in PG 8.4 */
if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)
@@ -194,9 +202,14 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
/*
* Copy/link any fsm and vm files, if they exist
*/
- transfer_relfile(pageConverter, &maps[mapnum], "_fsm");
+ transfer_relfile(pageConverter, &maps[mapnum], "_fsm", false);
if (vm_crashsafe_match)
- transfer_relfile(pageConverter, &maps[mapnum], "_vm");
+ {
+ if (vm_need_rewrite)
+ transfer_relfile(pageConverter, &maps[mapnum], "_vm", true);
+ else
+ transfer_relfile(pageConverter, &maps[mapnum], "_vm", false);
+ }
}
}
}
@@ -210,7 +223,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
*/
static void
transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *type_suffix)
+ const char *type_suffix, bool vm_need_rewrite)
{
const char *msg;
char old_file[MAXPGPATH];
@@ -276,7 +289,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
+ if ((msg = copyOrRewriteFile(pageConverter, old_file, new_file, true, vm_need_rewrite)) != NULL)
pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
@@ -284,7 +297,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "linking \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = linkAndUpdateFile(pageConverter, old_file, new_file)) != NULL)
+ if ((msg = linkAndUpdateFile(pageConverter, old_file, new_file, vm_need_rewrite)) != NULL)
pg_fatal("error while creating link for relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
diff --git a/src/bin/pg_upgrade/test.sh b/src/bin/pg_upgrade/test.sh
index ba79fb3..cd9b17e 100644
--- a/src/bin/pg_upgrade/test.sh
+++ b/src/bin/pg_upgrade/test.sh
@@ -174,6 +174,11 @@ if "$MAKE" -C "$oldsrc" installcheck; then
mv "$temp_root"/dump1.sql "$temp_root"/dump1.sql.orig
sed "s;$oldsrc;$newsrc;g" "$temp_root"/dump1.sql.orig >"$temp_root"/dump1.sql
fi
+
+ vm_sql="SELECT c.relname, c.relallvisible FROM pg_class as c, pg_namespace as n WHERE c.relnamespace = n.oid AND n.nspname NOT IN ('information_schema', 'pg_toast', 'pg_catalog') ORDER BY c.relname;"
+ # Test for rewriting visibility map
+ vacuumdb -d regression || visibilitymap_vacuum1_status=$?
+ psql -d regression -c "$vm_sql" > "$temp_root"/vm_test1.txt || visibilitymap_test1_status=$?
else
make_installcheck_status=$?
fi
@@ -188,6 +193,14 @@ if [ -n "$pg_dumpall1_status" ]; then
echo "pg_dumpall of pre-upgrade database cluster failed"
exit 1
fi
+if [ -n "$visibilitymap_vacuum1_status" ];then
+ echo "VACUUM of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+if [ -n "$visibilitymap_test1_status" ];then
+ echo "SELECT of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
PGDATA=$BASE_PGDATA
@@ -203,6 +216,8 @@ case $testhost in
esac
pg_dumpall -f "$temp_root"/dump2.sql || pg_dumpall2_status=$?
+vacuumdb -d regression || visibilitymap_vacuum2_status=$?
+psql -d regression -c "$vm_sql" > "$temp_root"/vm_test2.txt || visibilitymap_test2_status=$?
pg_ctl -m fast stop
# no need to echo commands anymore
@@ -214,11 +229,26 @@ if [ -n "$pg_dumpall2_status" ]; then
exit 1
fi
+if [ -n "$visibilitymap_vacuum2_status" ];then
+ echo "VACUUM of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
+if [ -n "$visibilitymap_test2_status" ];then
+ echo "SELECT of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
case $testhost in
MINGW*) cmd /c delete_old_cluster.bat ;;
*) sh ./delete_old_cluster.sh ;;
esac
+if ! diff "$temp_root"/vm_test1.txt "$temp_root"/vm_test2.txt >/dev/null; then
+ echo "Visibility map rewriting test failed"
+ exit 1
+fi
+
if diff "$temp_root"/dump1.sql "$temp_root"/dump2.sql >/dev/null; then
echo PASSED
exit 0
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index f77489b..5fcb539 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -320,9 +320,10 @@ typedef struct xl_heap_freeze_page
typedef struct xl_heap_visible
{
TransactionId cutoff_xid;
+ uint8 flags;
} xl_heap_visible;
-#define SizeOfHeapVisible (offsetof(xl_heap_visible, cutoff_xid) + sizeof(TransactionId))
+#define SizeOfHeapVisible (offsetof(xl_heap_visible, flags) + sizeof(uint8))
typedef struct xl_heap_new_cid
{
@@ -389,6 +390,6 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
xl_heap_freeze_tuple *xlrec_tp);
extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
- Buffer vm_buffer, TransactionId cutoff_xid);
+ Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
#endif /* HEAPAM_XLOG_H */
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index d447daf..a75de5c 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -19,15 +19,36 @@
#include "storage/buf.h"
#include "utils/relcache.h"
+/*
+ * Number of bits allocated for each heap block.
+ * One for all-visible, other for all-frozen.
+*/
+#define BITS_PER_HEAPBLOCK 2
+
+/* Number of heap blocks we can represent in one byte. */
+#define HEAPBLOCKS_PER_BYTE 4
+
+/* Flags for bit map */
+#define VISIBILITYMAP_ALL_VISIBLE 0x01
+#define VISIBILITYMAP_ALL_FROZEN 0x02
+#define VISIBILITYMAP_VALID_BITS 0x03 /* OR of all valid visiblitymap flags bits */
+
+/* Macros for visibilitymap test */
+#define VM_ALL_VISIBLE(r, b, v) \
+ ((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_VISIBLE) != 0)
+#define VM_ALL_FROZEN(r, b, v) \
+ ((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_FROZEN) != 0)
+
extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk,
Buffer vmbuf);
extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
Buffer *vmbuf);
extern bool visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid);
-extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
-extern BlockNumber visibilitymap_count(Relation rel);
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags);
+extern uint8 visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
+extern void visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen);
extern void visibilitymap_truncate(Relation rel, BlockNumber nheapblocks);
#endif /* VISIBILITYMAP_H */
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 62e08a9..54b9944 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 201601091
+#define CATALOG_VERSION_NO 201601171
#endif
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index f58672e..7e05dc1 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2781,6 +2781,8 @@ DATA(insert OID = 3056 ( pg_stat_get_analyze_count PGNSP PGUID 12 1 0 0 0 f f f
DESCR("statistics: number of manual analyzes for a table");
DATA(insert OID = 3057 ( pg_stat_get_autoanalyze_count PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_autoanalyze_count _null_ _null_ _null_ ));
DESCR("statistics: number of auto analyzes for a table");
+DATA(insert OID = 6015 ( pg_stat_get_frozen_pages PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 23 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_frozen_pages _null_ _null_ _null_ ));
+DESCR("statistics: number of frozen pages of table");
DATA(insert OID = 1936 ( pg_stat_get_backend_idset PGNSP PGUID 12 1 100 0 0 f f f f t t s r 0 0 23 "" _null_ _null_ _null_ _null_ _null_ pg_stat_get_backend_idset _null_ _null_ _null_ ));
DESCR("statistics: currently active backend IDs");
DATA(insert OID = 2022 ( pg_stat_get_activity PGNSP PGUID 12 1 100 0 0 f f f f f t s r 1 0 2249 "23" "{23,26,23,26,25,25,25,16,1184,1184,1184,1184,869,25,23,28,28,16,25,25,23,16,25}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,datid,pid,usesysid,application_name,state,query,waiting,xact_start,query_start,backend_start,state_change,client_addr,client_hostname,client_port,backend_xid,backend_xmin,ssl,sslversion,sslcipher,sslbits,sslcompression,sslclientdn}" _null_ _null_ pg_stat_get_activity _null_ _null_ _null_ ));
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 65e968e..ad40b70 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -17,6 +17,7 @@
#include "portability/instr_time.h"
#include "postmaster/pgarch.h"
#include "storage/barrier.h"
+#include "storage/block.h"
#include "utils/hsearch.h"
#include "utils/relcache.h"
@@ -355,6 +356,7 @@ typedef struct PgStat_MsgVacuum
TimestampTz m_vacuumtime;
PgStat_Counter m_live_tuples;
PgStat_Counter m_dead_tuples;
+ BlockNumber m_frozen_pages;
} PgStat_MsgVacuum;
@@ -372,6 +374,7 @@ typedef struct PgStat_MsgAnalyze
TimestampTz m_analyzetime;
PgStat_Counter m_live_tuples;
PgStat_Counter m_dead_tuples;
+ int32 m_frozen_pages;
} PgStat_MsgAnalyze;
@@ -551,7 +554,7 @@ typedef union PgStat_Msg
* ------------------------------------------------------------
*/
-#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9D
+#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9E
/* ----------
* PgStat_StatDBEntry The collector's data per database
@@ -615,6 +618,8 @@ typedef struct PgStat_StatTabEntry
PgStat_Counter n_dead_tuples;
PgStat_Counter changes_since_analyze;
+ int32 n_frozen_pages;
+
PgStat_Counter blocks_fetched;
PgStat_Counter blocks_hit;
@@ -917,9 +922,11 @@ extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type t
extern void pgstat_report_autovac(Oid dboid);
extern void pgstat_report_vacuum(Oid tableoid, bool shared,
- PgStat_Counter livetuples, PgStat_Counter deadtuples);
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages);
extern void pgstat_report_analyze(Relation rel,
- PgStat_Counter livetuples, PgStat_Counter deadtuples);
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages);
extern void pgstat_report_recovery_conflict(int reason);
extern void pgstat_report_deadlock(void);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 2ce3be7..0b023b3 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -178,8 +178,10 @@ typedef PageHeaderData *PageHeader;
* tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
+#define PD_ALL_FROZEN 0x0008 /* all tuples on page are completely
+ frozen */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x000F /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -367,7 +369,12 @@ typedef PageHeaderData *PageHeader;
#define PageSetAllVisible(page) \
(((PageHeader) (page))->pd_flags |= PD_ALL_VISIBLE)
#define PageClearAllVisible(page) \
- (((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
+ (((PageHeader) (page))->pd_flags &= ~(PD_ALL_VISIBLE | PD_ALL_FROZEN))
+
+#define PageIsAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags & PD_ALL_FROZEN)
+#define PageSetAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags |= PD_ALL_FROZEN)
#define PageIsPrunable(page, oldestxmin) \
( \
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 28b061f..c95c788 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1683,6 +1683,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,
pg_stat_get_live_tuples(c.oid) AS n_live_tup,
pg_stat_get_dead_tuples(c.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(c.oid) AS n_mod_since_analyze,
+ pg_stat_get_frozen_pages(c.oid) AS n_frozen_pages,
pg_stat_get_last_vacuum_time(c.oid) AS last_vacuum,
pg_stat_get_last_autovacuum_time(c.oid) AS last_autovacuum,
pg_stat_get_last_analyze_time(c.oid) AS last_analyze,
@@ -1795,6 +1796,7 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
+ pg_stat_all_tables.n_frozen_pages,
pg_stat_all_tables.last_vacuum,
pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,
@@ -1838,6 +1840,7 @@ pg_stat_user_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
+ pg_stat_all_tables.n_frozen_pages,
pg_stat_all_tables.last_vacuum,
pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index f5be70f..95ababf 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -127,6 +127,8 @@ SELECT count(*) FROM tenk2 WHERE unique1 = 1;
1
(1 row)
+-- do VACUUM FREEZE
+VACUUM FREEZE tenk2;
-- force the rate-limiting logic in pgstat_report_tabstat() to time out
-- and send a message
SELECT pg_sleep(1.0);
@@ -175,6 +177,14 @@ SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages,
t | t
(1 row)
+SELECT n_frozen_pages = (pg_relation_size('tenk2') / current_setting('block_size')::int)
+ FROM pg_stat_user_tables
+ WHERE relname ='tenk2';
+ ?column?
+----------
+ t
+(1 row)
+
SELECT pr.snap_ts < pg_stat_get_snapshot_timestamp() as snapshot_newer
FROM prevstats AS pr;
snapshot_newer
diff --git a/src/test/regress/expected/visibilitymap.out b/src/test/regress/expected/visibilitymap.out
new file mode 100644
index 0000000..87206b6
--- /dev/null
+++ b/src/test/regress/expected/visibilitymap.out
@@ -0,0 +1,22 @@
+--
+-- Visibility Map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+-- All pages are become all-visible
+VACUUM FREEZE vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+ ?column?
+----------
+ t
+(1 row)
+
+-- Check whether vacuum skips all-frozen pages
+\set VERBOSITY terse
+VACUUM FREEZE VERBOSE vmtest;
+INFO: vacuuming "public.vmtest"
+INFO: index "vmtest_pkey" now contains 10000 row versions in 30 pages
+INFO: skipped 44 frozen pages according to visibility map
+INFO: "vmtest": found 0 removable, 56 nonremovable row versions in 1 out of 45 pages
+\set VERBOSITY default
+DROP TABLE vmtest;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index b1bc7c7..e31fa76 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -110,3 +110,6 @@ test: event_trigger
# run stats by itself because its delay may be insufficient under heavy load
test: stats
+
+# page info map and vacuum test cannot run concurrently with any test that runs SQL
+test: visibilitymap
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index ade9ef1..666e40c 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -161,3 +161,4 @@ test: with
test: xml
test: event_trigger
test: stats
+test: visibilitymap
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index cd2d592..dea5553 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -120,6 +120,8 @@ ROLLBACK;
SELECT count(*) FROM tenk2;
-- do an indexscan
SELECT count(*) FROM tenk2 WHERE unique1 = 1;
+-- do VACUUM FREEZE
+VACUUM FREEZE tenk2;
-- force the rate-limiting logic in pgstat_report_tabstat() to time out
-- and send a message
@@ -145,6 +147,10 @@ SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages,
FROM pg_statio_user_tables AS st, pg_class AS cl, prevstats AS pr
WHERE st.relname='tenk2' AND cl.relname='tenk2';
+SELECT n_frozen_pages = (pg_relation_size('tenk2') / current_setting('block_size')::int)
+ FROM pg_stat_user_tables
+ WHERE relname ='tenk2';
+
SELECT pr.snap_ts < pg_stat_get_snapshot_timestamp() as snapshot_newer
FROM prevstats AS pr;
diff --git a/src/test/regress/sql/visibilitymap.sql b/src/test/regress/sql/visibilitymap.sql
new file mode 100644
index 0000000..365570b
--- /dev/null
+++ b/src/test/regress/sql/visibilitymap.sql
@@ -0,0 +1,16 @@
+--
+-- Visibility Map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+
+-- All pages are become all-visible
+VACUUM FREEZE vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+
+-- Check whether vacuum skips all-frozen pages
+\set VERBOSITY terse
+VACUUM FREEZE VERBOSE vmtest;
+\set VERBOSITY default
+
+DROP TABLE vmtest;
001_enhance_visibilitymap_debug_messages_v1.patchbinary/octet-stream; name=001_enhance_visibilitymap_debug_messages_v1.patchDownload
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 6186caf..f4d878b 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -175,7 +175,7 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_clear %s, block %d", RelationGetRelationName(rel), heapBlk);
#endif
if (!BufferIsValid(buf) || BufferGetBlockNumber(buf) != mapBlock)
@@ -274,7 +274,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
uint8 *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_set %s, block %d, flags %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
@@ -364,7 +364,7 @@ visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *buf)
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_get_status %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_get_status %s, block %d", RelationGetRelationName(rel), heapBlk);
#endif
/* Reuse the old pinned buffer if possible */
@@ -467,7 +467,7 @@ visibilitymap_truncate(Relation rel, BlockNumber nheapblocks)
uint8 truncBit = HEAPBLK_TO_MAPBIT(nheapblocks);
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_truncate %s %d", RelationGetRelationName(rel), nheapblocks);
+ elog(DEBUG1, "vm_truncate %s, block %d", RelationGetRelationName(rel), nheapblocks);
#endif
RelationOpenSmgr(rel);
Masahiko Sawada wrote:
Attached updated version patch.
Please review it.
In pg_upgrade, the "page convert" functionality is there to abstract
rewrites of pages being copied; your patch is circumventing it and
AFAICS it makes the interface more complicated for no good reason. I
think the real way to do that is to write your rewriteVisibilityMap as a
pageConverter routine. That should reduce some duplication there.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2/1/16 4:59 PM, Alvaro Herrera wrote:
Masahiko Sawada wrote:
Attached updated version patch.
Please review it.In pg_upgrade, the "page convert" functionality is there to abstract
rewrites of pages being copied; your patch is circumventing it and
AFAICS it makes the interface more complicated for no good reason. I
think the real way to do that is to write your rewriteVisibilityMap as a
pageConverter routine. That should reduce some duplication there.
IIRC this is about the third problem that's been found with pg_upgrade
in this patch. That concerns me given the potential for disaster if
freeze bits are set incorrectly.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Feb 2, 2016 at 10:15 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 2/1/16 4:59 PM, Alvaro Herrera wrote:
Masahiko Sawada wrote:
Attached updated version patch.
Please review it.In pg_upgrade, the "page convert" functionality is there to abstract
rewrites of pages being copied; your patch is circumventing it and
AFAICS it makes the interface more complicated for no good reason. I
think the real way to do that is to write your rewriteVisibilityMap as a
pageConverter routine. That should reduce some duplication there.
This means that user always have to set pageConverter plug-in when upgrading?
I was thinking that this conversion is required for all user who wants
to upgrade to 9.6, so we should support it in core, not as a plug-in.
IIRC this is about the third problem that's been found with pg_upgrade in
this patch. That concerns me given the potential for disaster if freeze bits
are set incorrectly.
Yeah, I tend to have diagnostic tool for visibilitymap, and to exactly
compare VM between old one and new one after upgrading postgres
server.
I've implemented a such tool that is in my github repository[1]https://github.com/MasahikoSawada/pg_visibilitymap.
I'm thinking to add this tool into core(e.g., pg_upgrade directory,
not contrib module) as testing function.
[1]: https://github.com/MasahikoSawada/pg_visibilitymap
Regards,
--
Masahiko Sawada
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Feb 2, 2016 at 11:42 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Tue, Feb 2, 2016 at 10:15 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 2/1/16 4:59 PM, Alvaro Herrera wrote:
Masahiko Sawada wrote:
Attached updated version patch.
Please review it.In pg_upgrade, the "page convert" functionality is there to abstract
rewrites of pages being copied; your patch is circumventing it and
AFAICS it makes the interface more complicated for no good reason. I
think the real way to do that is to write your rewriteVisibilityMap as a
pageConverter routine. That should reduce some duplication there.This means that user always have to set pageConverter plug-in when upgrading?
I was thinking that this conversion is required for all user who wants
to upgrade to 9.6, so we should support it in core, not as a plug-in.
I misunderstood. Sorry for noise.
I agree with adding conversion method as a pageConverter routine.
This patch doesn't change page layout actually, but pageConverter
routine checks only the page layout.
And we have to plugin named convertLayout_X_to_Y.
I think we have two options.
1. Change page layout(PG_PAGE_LAYOUT_VERSION) to 5. pg_upgrade detects
it and then converts only VM files.
2. Change pg_upgrade plugin mechanism so that it can handle other name
conversion plugins (e.g., convertLayout_vm_to_vfm)
I think #2 is better. Thought?
Regards,
--
Masahiko Sawada
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Masahiko Sawada wrote:
I misunderstood. Sorry for noise.
I agree with adding conversion method as a pageConverter routine.
\o/
This patch doesn't change page layout actually, but pageConverter
routine checks only the page layout.
And we have to plugin named convertLayout_X_to_Y.I think we have two options.
1. Change page layout(PG_PAGE_LAYOUT_VERSION) to 5. pg_upgrade detects
it and then converts only VM files.
2. Change pg_upgrade plugin mechanism so that it can handle other name
conversion plugins (e.g., convertLayout_vm_to_vfm)I think #2 is better. Thought?
My vote is for #2 as well. Maybe we just didn't have forks when this
functionality was invented; maybe the author just didn't think hard
enough about what would be the right interface to do it.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Feb 2, 2016 at 7:22 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Masahiko Sawada wrote:
I misunderstood. Sorry for noise.
I agree with adding conversion method as a pageConverter routine.\o/
This patch doesn't change page layout actually, but pageConverter
routine checks only the page layout.
And we have to plugin named convertLayout_X_to_Y.I think we have two options.
1. Change page layout(PG_PAGE_LAYOUT_VERSION) to 5. pg_upgrade detects
it and then converts only VM files.
2. Change pg_upgrade plugin mechanism so that it can handle other name
conversion plugins (e.g., convertLayout_vm_to_vfm)I think #2 is better. Thought?
My vote is for #2 as well. Maybe we just didn't have forks when this
functionality was invented; maybe the author just didn't think hard
enough about what would be the right interface to do it.
Thanks.
I'm planning to change as follows.
- pageCnvCtx struct has new function pointer convertVMFile().
If the layout of other relation such as FSM, CLOG in the future, we
could add convertFSMFile() and convertCLOGFile().
- Create new library convertLayoutVM_add_frozenbit.c that has
convertVMFile() function which converts only visibilitymap.
When rewriting of VM is required, convertLayoutVM_add_frozenbit.so
is dynamically loaded.
convertLayout_X_to_Y converts other relation files.
That is, converting VM and converting other relations are independently done.
- Current plugin mechanism puts conversion library (*.so) into
${bin}/plugins (i.g., new plugin directory is required), but I'm
thinking to puts it into ${libdir}.
Please give me feedbacks.
Regards,
--
Masahiko Sawada
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hello,
At Tue, 2 Feb 2016 20:25:23 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoA5iaKQ6K7gUZyzN2KJnPNMeHc6PPPxj6cJgmssjj=fqw@mail.gmail.com>
On Tue, Feb 2, 2016 at 7:22 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Masahiko Sawada wrote:
I misunderstood. Sorry for noise.
I agree with adding conversion method as a pageConverter routine.\o/
This patch doesn't change page layout actually, but pageConverter
routine checks only the page layout.
And we have to plugin named convertLayout_X_to_Y.I think we have two options.
1. Change page layout(PG_PAGE_LAYOUT_VERSION) to 5. pg_upgrade detects
it and then converts only VM files.
2. Change pg_upgrade plugin mechanism so that it can handle other name
conversion plugins (e.g., convertLayout_vm_to_vfm)I think #2 is better. Thought?
My vote is for #2 as well. Maybe we just didn't have forks when this
functionality was invented; maybe the author just didn't think hard
enough about what would be the right interface to do it.Thanks.
I'm planning to change as follows.
- pageCnvCtx struct has new function pointer convertVMFile().
If the layout of other relation such as FSM, CLOG in the future, we
could add convertFSMFile() and convertCLOGFile().
- Create new library convertLayoutVM_add_frozenbit.c that has
convertVMFile() function which converts only visibilitymap.
When rewriting of VM is required, convertLayoutVM_add_frozenbit.so
is dynamically loaded.
convertLayout_X_to_Y converts other relation files.
That is, converting VM and converting other relations are independently done.
- Current plugin mechanism puts conversion library (*.so) into
${bin}/plugins (i.g., new plugin directory is required), but I'm
thinking to puts it into ${libdir}.Please give me feedbacks.
I agree that the plugin mechanism would be usable and needs to be
redesigned, but..
Since the destination version is fixed, the advantage of the
plugin mechanism for pg_upgade would be capability to choose a
plugin to load according to some characteristics of the source
database. What do you think the trigger characteristics for
convertLayoutVM_add_frozenbit.so or similars? If it is hard-coded
like what transfer_single_new_db is doing for fsm and vm, I
suppose the module is not necessary to be a plugin.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
This patch has gotten its fair share of feedback in this fest. I moved
it to the next commitfest. Please do keep working on it and reviewers
that have additional comments are welcome.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Feb 2, 2016 at 10:05 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
This patch has gotten its fair share of feedback in this fest. I moved
it to the next commitfest. Please do keep working on it and reviewers
that have additional comments are welcome.
Thanks!
On Tue, Feb 2, 2016 at 8:59 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Since the destination version is fixed, the advantage of the
plugin mechanism for pg_upgade would be capability to choose a
plugin to load according to some characteristics of the source
database. What do you think the trigger characteristics for
convertLayoutVM_add_frozenbit.so or similars? If it is hard-coded
like what transfer_single_new_db is doing for fsm and vm, I
suppose the module is not necessary to be a plugin.
Sorry, I couldn't get it.
You meant that we should use rewriteVisibilityMap as a function (not
dynamically load)?
The destination version is not fixed, it depends on new cluster version.
I'm planning that convertLayoutVM_add_frozenbit.so is dynamically
loaded and used only when rewriting of VM is required.
If layout of VM will be changed again in the future, we could add
other libraries for convert
Regards,
--
Masahiko Sawada
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Feb 2, 2016 at 7:22 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Masahiko Sawada wrote:
I misunderstood. Sorry for noise.
I agree with adding conversion method as a pageConverter routine.\o/
This patch doesn't change page layout actually, but pageConverter
routine checks only the page layout.
And we have to plugin named convertLayout_X_to_Y.I think we have two options.
1. Change page layout(PG_PAGE_LAYOUT_VERSION) to 5. pg_upgrade detects
it and then converts only VM files.
2. Change pg_upgrade plugin mechanism so that it can handle other name
conversion plugins (e.g., convertLayout_vm_to_vfm)I think #2 is better. Thought?
My vote is for #2 as well. Maybe we just didn't have forks when this
functionality was invented; maybe the author just didn't think hard
enough about what would be the right interface to do it.
I've almost wrote up the very rough patch. (it can pass regression test)
Windows supporting is not yet, and Makefile is not correct.
I've divided the main patch into two patches; add frozen bit patch and
pg_upgrade support patch.
000 patch is almost same as previous code. (includes small fix)
001 patch provides rewriting visibility map as a pageConverter routine.
002 patch is for enhancement debug message in visibilitymap.c
In order to support pageConvert plugin, I made the following changes.
* Main changes
- Remove PAGE_CONVERSION
- pg_upgrade plugin is located to 'src/bin/pg_upgrade/plugins' directory.
- Move directory having plugins from '$(bin)/plugins' to '$(lib)/plugins'.
- Add new page-converter plugin function for visibility map.
- Current code doesn't allow us to use link mode (-k) in the case
where page-converter is required.
But I changed it so that if page-converter for fork file is
specified, we convert it actually even when link mode.
* Interface designe
convertFile() and convertPage() are plugin function for main relation
file, and these functions are dynamically loaded by
loadConvertPlugin().
I added a new pageConvert plugin function converVMFile() for
visibility map (fork file).
If layout of CLOG, FSM or etc will be changed in the future, we could
expand some new pageConvert plugin functions like convertCLOGFile() or
convertFSMFile(), and these functions are dynamically loaded by
loadAdditionalConvertPlugin().
It means that main file and other fork file conversion are executed
independently, and conversion for fork file are executed even if link
mode is specified.
Each conversion plugin is loaded and used only when it's required.
I still agree with this plugin approach, but I felt it's still
complicated a bit, and I'm concerned that patch size has been
increased.
Please give me feedbacks.
If there are not objections about this, I'm going to spend time to improve it.
Regards,
--
Masahiko Sawada
Attachments:
000_add_frozen_bit_into_visibilitymap_v34.patchtext/x-patch; charset=US-ASCII; name=000_add_frozen_bit_into_visibilitymap_v34.patchDownload
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 001988b..5d08c73 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -87,7 +87,7 @@ statapprox_heap(Relation rel, output_type *stat)
* If the page has only visible tuples, then we can find out the free
* space from the FSM and move on.
*/
- if (visibilitymap_test(rel, blkno, &vmbuffer))
+ if (VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
{
freespace = GetRecordedFreeSpace(rel, blkno);
stat->tuple_len += BLCKSZ - freespace;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 392eb70..c43443a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5916,7 +5916,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs an eager freezing if the table's
<structname>pg_class</>.<structfield>relfrozenxid</> field has reached
the age specified by this setting. The default is 150 million
transactions. Although users can set this value anywhere from zero to
@@ -5960,7 +5960,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs an eager freezing if the table's
<structname>pg_class</>.<structfield>relminmxid</> field has reached
the age specified by this setting. The default is 150 million multixacts.
Although users can set this value anywhere from zero to two billions,
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 5204b34..7cc975d 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -352,9 +352,9 @@
Vacuum maintains a <link linkend="storage-vm">visibility map</> for each
table to keep track of which pages contain only tuples that are known to be
visible to all active transactions (and all future transactions, until the
- page is again modified). This has two purposes. First, vacuum
- itself can skip such pages on the next run, since there is nothing to
- clean up.
+ page is again modified), and pages contain only frozen tuples.
+ This has two purposes. First, vacuum itself can skip such pages
+ on the next run, since there is nothing to clean up.
</para>
<para>
@@ -438,28 +438,25 @@
</para>
<para>
- <command>VACUUM</> normally skips pages that don't have any dead row
- versions, but those pages might still have row versions with old XID
- values. To ensure all old row versions have been frozen, a
- scan of the whole table is needed.
+ <command>VACUUM</> skips pages that don't have any dead row
+ versions, and pages that have only frozen rows. To ensure all old
+ row versions have been frozen, a scan of all unfrozen pages is needed.
<xref linkend="guc-vacuum-freeze-table-age"> controls when
- <command>VACUUM</> does that: a whole table sweep is forced if
- the table hasn't been fully scanned for <varname>vacuum_freeze_table_age</>
- minus <varname>vacuum_freeze_min_age</> transactions. Setting it to 0
- forces <command>VACUUM</> to always scan all pages, effectively ignoring
- the visibility map.
+ <command>VACUUM</> does that: a whole-table freezing is forced if
+ the table hasn't been ensured all row versions are frozen for
+ <varname>vacuum_freeze_table_age</> minus <varname>vacuum_freeze_min_age</>
+ transcation.
</para>
<para>
The maximum time that a table can go unvacuumed is two billion
transactions minus the <varname>vacuum_freeze_min_age</> value at
- the time <command>VACUUM</> last scanned the whole table. If it were to go
- unvacuumed for longer than
- that, data loss could result. To ensure that this does not happen,
- autovacuum is invoked on any table that might contain unfrozen rows with
- XIDs older than the age specified by the configuration parameter <xref
- linkend="guc-autovacuum-freeze-max-age">. (This will happen even if
- autovacuum is disabled.)
+ the time <command>VACUUM</> last scanned unfrozen pages. If it were to go
+ unvacuumed for longer than that, data loss could result. To ensure
+ that this does not happen, autovacuum is invoked on any table that might
+ contain unfrozen rows with XIDs older than the age specified by the
+ configuration parameter <xref linkend="guc-autovacuum-freeze-max-age">.
+ (This will happen even if autovacuum is disabled.)
</para>
<para>
@@ -490,8 +487,7 @@
a regularly scheduled <command>VACUUM</> or an autovacuum triggered by
normal delete and update activity is run in that window. Setting it too
close could lead to anti-wraparound autovacuums, even though the table
- was recently vacuumed to reclaim space, whereas lower values lead to more
- frequent whole-table scans.
+ was recently vacuumed to reclaim space.
</para>
<para>
@@ -527,7 +523,7 @@
<structname>pg_database</>. In particular,
the <structfield>relfrozenxid</> column of a table's
<structname>pg_class</> row contains the freeze cutoff XID that was used
- by the last whole-table <command>VACUUM</> for that table. All rows
+ by the last whole-table freezing for that table. All rows
inserted by transactions with XIDs older than this cutoff XID are
guaranteed to have been frozen. Similarly,
the <structfield>datfrozenxid</> column of a database's
@@ -555,17 +551,18 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
<command>VACUUM</> normally
only scans pages that have been modified since the last vacuum, but
<structfield>relfrozenxid</> can only be advanced when the whole table is
- scanned. The whole table is scanned when <structfield>relfrozenxid</> is
- more than <varname>vacuum_freeze_table_age</> transactions old, when
- <command>VACUUM</>'s <literal>FREEZE</> option is used, or when all pages
- happen to
- require vacuuming to remove dead row versions. When <command>VACUUM</>
- scans the whole table, after it's finished <literal>age(relfrozenxid)</>
- should be a little more than the <varname>vacuum_freeze_min_age</> setting
- that was used (more by the number of transactions started since the
- <command>VACUUM</> started). If no whole-table-scanning <command>VACUUM</>
- is issued on the table until <varname>autovacuum_freeze_max_age</> is
- reached, an autovacuum will soon be forced for the table.
+ frozen. Freezing occurs on the whole table once all pages of this relation
+ require it. In other cases such as where <structfield>relfrozenxid</> is more
+ than <varname>vacuum_freeze_table_age</> transactions old, where
+ <command>VACUUM</>'s <literal>FREEZE</> option is used, <command>VACUUM</>
+ can skip the pages that all tuples on the page itself are marked as frozen.
+ When all pages of table are eventually marked as frozen by <command>VACUUM</>,
+ after it's finished <literal>age(relfrozenxid)</> should be a little more
+ than the <varname>vacuum_freeze_min_age</> setting that was used (more by
+ the number of transactions started since the <command>VACUUM</> started).
+ If the advancing of <structfield>relfrozenxid</> is not happend until
+ <varname>autovacuum_freeze_max_age</> is reached, an autovacuum will soon
+ be forced for the table.
</para>
<para>
@@ -642,28 +639,28 @@ HINT: Stop the postmaster and vacuum that database in single-user mode.
<structname>pg_class</>.<structfield>relminmxid</> stores the oldest
possible multixact ID still appearing in any tuple of that table.
If this value is older than
- <xref linkend="guc-vacuum-multixact-freeze-table-age">, a whole-table
+ <xref linkend="guc-vacuum-multixact-freeze-table-age">, a table
scan is forced. <function>mxid_age()</> can be used on
<structname>pg_class</>.<structfield>relminmxid</> to find its age.
</para>
<para>
- Whole-table <command>VACUUM</> scans, regardless of
- what causes them, enable advancing the value for that table.
+ When <command>VACUUM</> scans all unfrozen pages, regardless of what causes
+ them, enable advancing the value for that table.
Eventually, as all tables in all databases are scanned and their
oldest multixact values are advanced, on-disk storage for older
multixacts can be removed.
</para>
<para>
- As a safety device, a whole-table vacuum scan will occur for any table
- whose multixact-age is greater than
- <xref linkend="guc-autovacuum-multixact-freeze-max-age">. Whole-table
+ As a safety device, a vacuum scan will occur for any table whose
+ multixact-age is greater than
+ <xref linkend="guc-autovacuum-multixact-freeze-max-age">. A
vacuum scans will also occur progressively for all tables, starting with
those that have the oldest multixact-age, if the amount of used member
storage space exceeds the amount 50% of the addressable storage space.
- Both of these kinds of whole-table scans will occur even if autovacuum is
- nominally disabled.
+ Both of these kinds of scans will occur even if autovacuum is nominally
+ disabled.
</para>
</sect3>
</sect2>
@@ -743,8 +740,8 @@ vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuple
<command>UPDATE</command> and <command>DELETE</command> operation. (It
is only semi-accurate because some information might be lost under heavy
load.) If the <structfield>relfrozenxid</> value of the table is more
- than <varname>vacuum_freeze_table_age</> transactions old, the whole
- table is scanned to freeze old tuples and advance
+ than <varname>vacuum_freeze_table_age</> transactions old, the table is
+ scanned to freeze old tuples and advance
<structfield>relfrozenxid</>, otherwise only pages that have been modified
since the last vacuum are scanned.
</para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 85459d0..0bcd52d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1423,6 +1423,11 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
<entry>Estimated number of rows modified since this table was last analyzed</entry>
</row>
<row>
+ <entry><structfield>n_frozen_pages</></entry>
+ <entry><type>integer</></entry>
+ <entry>Number of frozen pages</entry>
+ </row>
+ <row>
<entry><structfield>last_vacuum</></entry>
<entry><type>timestamp with time zone</></entry>
<entry>Last time at which this table was manually vacuumed
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 164d08c..5dc8b04 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -623,18 +623,20 @@ can be used to examine the information stored in free space maps.
<para>
Each heap relation has a Visibility Map
(VM) to keep track of which pages contain only tuples that are known to be
-visible to all active transactions. It's stored
-alongside the main relation data in a separate relation fork, named after the
-filenode number of the relation, plus a <literal>_vm</> suffix. For example,
-if the filenode of a relation is 12345, the VM is stored in a file called
-<filename>12345_vm</>, in the same directory as the main relation file.
+visible to all active transactions, and pages contain only unfrozen tuples.
+It's stored alongside the main relation data in a separate relation fork,
+named after the filenode number of the relation, plus a <literal>_vm</> suffix.
+For example, if the filenode of a relation is 12345, the VM is stored in a file
+called <filename>12345_vm</>, in the same directory as the main relation file.
Note that indexes do not have VMs.
</para>
<para>
-The visibility map simply stores one bit per heap page. A set bit means
-that all tuples on the page are known to be visible to all transactions.
-This means that the page does not contain any tuples that need to be vacuumed.
+The visibility map stores two bits per heap page: all-visible, all-frozen.
+A set all-visible bit means that all tuples on the page are known to be visible
+to all transactions. A set all-frozen bit means that all tuples on the page are
+completely marked as frozen. This means that the page does not contain any tuples
+that need to be vacuumed, even if a scan of whole table is required.
This information can also be used by <firstterm>index-only scans</> to answer
queries using only the index tuple.
</para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f443742..e75144f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3034,9 +3034,9 @@ heap_delete(Relation relation, ItemPointer tid,
/*
* If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, we'll have to unlock and
- * re-lock, to avoid holding the buffer lock across an I/O. That's a bit
- * unfortunate, but hopefully shouldn't happen often.
+ * visible or all frozen while we were busy locking the buffer, we'll
+ * have to unlock and re-lock, to avoid holding the buffer lock across an
+ * I/O. That's a bit unfortunate, but hopefully shouldn't happen often.
*/
if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
{
@@ -7205,7 +7205,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
*/
XLogRecPtr
log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
- TransactionId cutoff_xid)
+ TransactionId cutoff_xid, uint8 vmflags)
{
xl_heap_visible xlrec;
XLogRecPtr recptr;
@@ -7215,6 +7215,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
Assert(BufferIsValid(vm_buffer));
xlrec.cutoff_xid = cutoff_xid;
+ xlrec.flags = vmflags;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
@@ -7804,7 +7805,12 @@ heap_xlog_visible(XLogReaderState *record)
* the subsequent update won't be replayed to clear the flag.
*/
page = BufferGetPage(buffer);
- PageSetAllVisible(page);
+
+ if (xlrec->flags & VISIBILITYMAP_ALL_VISIBLE)
+ PageSetAllVisible(page);
+ if (xlrec->flags & VISIBILITYMAP_ALL_FROZEN)
+ PageSetAllFrozen(page);
+
MarkBufferDirty(buffer);
}
else if (action == BLK_RESTORED)
@@ -7856,7 +7862,7 @@ heap_xlog_visible(XLogReaderState *record)
*/
if (lsn > PageGetLSN(vmpage))
visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
- xlrec->cutoff_xid);
+ xlrec->cutoff_xid, xlrec->flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index fc28f3f..6d95c7f 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -15,39 +15,45 @@
* visibilitymap_pin - pin a map page for setting a bit
* visibilitymap_pin_ok - check whether correct map page is already pinned
* visibilitymap_set - set a bit in a previously pinned page
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_get_status - get status of bits
* visibilitymap_count - count number of bits set in visibility map
* visibilitymap_truncate - truncate the visibility map
*
* NOTES
*
- * The visibility map is a bitmap with one bit per heap page. A set bit means
- * that all tuples on the page are known visible to all transactions, and
- * therefore the page doesn't need to be vacuumed. The map is conservative in
- * the sense that we make sure that whenever a bit is set, we know the
- * condition is true, but if a bit is not set, it might or might not be true.
+ * The visibility map is a bitmap with two bits (all-visible and all-frozen)
+ * per heap page. A set all-visible bit means that all tuples on the page are
+ * known visible to all transactions, and therefore the page doesn't need to
+ * be vacuumed. A set all-frozen bit means that all tuples on the page are
+ * completely frozen, and therefore the page doesn't need to be vacuumed even
+ * if whole table scanning vacuum is required (e.g. anti-wraparound vacuum).
+ * The all-frozen bit must be set only when the page is already all-visible.
*
- * Clearing a visibility map bit is not separately WAL-logged. The callers
+ * The map is conservative in the sense that we make sure that whenever a bit
+ * is set, we know the condition is true, but if a bit is not set, it might or
+ * might not be true.
+ *
+ * Clearing both visibility map bits is not separately WAL-logged. The callers
* must make sure that whenever a bit is cleared, the bit is cleared on WAL
* replay of the updating operation as well.
*
* When we *set* a visibility map during VACUUM, we must write WAL. This may
* seem counterintuitive, since the bit is basically a hint: if it is clear,
- * it may still be the case that every tuple on the page is visible to all
- * transactions; we just don't know that for certain. The difficulty is that
- * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
- * on the page itself, and the visibility map bit. If a crash occurs after the
- * visibility map page makes it to disk and before the updated heap page makes
- * it to disk, redo must set the bit on the heap page. Otherwise, the next
- * insert, update, or delete on the heap page will fail to realize that the
- * visibility map bit must be cleared, possibly causing index-only scans to
- * return wrong answers.
+ * it may still be the case that every tuple on the page is all-visible or
+ * all-frozen we just don't know that for certain. The difficulty is that
+ * there are two bits which are typically set together: the PD_ALL_VISIBLE
+ * or PD_ALL_FROZEN bit on the page itself, and the corresponding visibility map
+ * bit. If a crash occurs after the visibility map page makes it to disk and before
+ * the updated heap page makes it to disk, redo must set the bit on the heap page.
+ * Otherwise, the next insert, update, or delete on the heap page will fail to
+ * realize that the visibility map bit must be cleared, possibly causing index-only
+ * scans to return wrong answers.
*
* VACUUM will normally skip pages for which the visibility map bit is set;
* such pages can't contain any dead tuples and therefore don't need vacuuming.
- * The visibility map is not used for anti-wraparound vacuums, because
- * an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
- * present in the table, even on pages that don't have any dead tuples.
+ * The visibility map has the all-frozen bit which indicates all tuples on
+ * corresponding page have been completely frozen, so the visibility map is also
+ * used for anti-wraparound vacuum, even if freezing of tuples is required.
*
* LOCKING
*
@@ -58,14 +64,14 @@
* section that logs the page modification. However, we don't want to hold
* the buffer lock over any I/O that may be required to read in the visibility
* map page. To avoid this, we examine the heap page before locking it;
- * if the page-level PD_ALL_VISIBLE bit is set, we pin the visibility map
- * bit. Then, we lock the buffer. But this creates a race condition: there
- * is a possibility that in the time it takes to lock the buffer, the
- * PD_ALL_VISIBLE bit gets set. If that happens, we have to unlock the
- * buffer, pin the visibility map page, and relock the buffer. This shouldn't
- * happen often, because only VACUUM currently sets visibility map bits,
- * and the race will only occur if VACUUM processes a given page at almost
- * exactly the same time that someone tries to further modify it.
+ * if the page-level PD_ALL_VISIBLE or PD_ALL_FROZEN bit is set, we pin the
+ * visibility map bit. Then, we lock the buffer. But this creates a race
+ * condition: there is a possibility that in the time it takes to lock the
+ * buffer, the PD_ALL_VISIBLE or PD_ALL_FROZEN bit gets set. If that happens,
+ * we have to unlock the buffer, pin the visibility map page, and relock the
+ * buffer. This shouldn't happen often, because only VACUUM currently sets
+ * visibility map bits, and the race will only occur if VACUUM processes a given
+ * page at almost exactly the same time that someone tries to further modify it.
*
* To set a bit, you need to hold a lock on the heap page. That prevents
* the race condition where VACUUM sees that all tuples on the page are
@@ -101,38 +107,50 @@
*/
#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
-/* Number of bits allocated for each heap block. */
-#define BITS_PER_HEAPBLOCK 1
-
-/* Number of heap blocks we can represent in one byte. */
-#define HEAPBLOCKS_PER_BYTE 8
-
/* Number of heap blocks we can represent in one visibility map page. */
#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
/* Mapping from heap block number to the right bit in the visibility map */
#define HEAPBLK_TO_MAPBLOCK(x) ((x) / HEAPBLOCKS_PER_PAGE)
#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
-#define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
-
-/* table for fast counting of set bits */
-static const uint8 number_of_ones[256] = {
- 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
+#define HEAPBLK_TO_MAPBIT(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)
+
+/* tables for fast counting of set bits for visible and frozen */
+static const uint8 number_of_ones_for_visible[256] = {
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4
+};
+static const uint8 number_of_ones_for_frozen[256] = {
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4
};
/* prototypes for internal routines */
@@ -141,7 +159,7 @@ static void vm_extend(Relation rel, BlockNumber nvmblocks);
/*
- * visibilitymap_clear - clear a bit in visibility map
+ * visibilitymap_clear - clear all bits in visibility map
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -153,7 +171,7 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
int mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- uint8 mask = 1 << mapBit;
+ uint8 mask = VISIBILITYMAP_VALID_BITS << mapBit;
char *map;
#ifdef TRACE_VISIBILITYMAP
@@ -186,7 +204,7 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
* visibilitymap_set to actually set the bit.
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by
- * an earlier call to visibilitymap_pin or visibilitymap_test on the same
+ * an earlier call to visibilitymap_pin or visibilitymap_get_status on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk.
*
@@ -212,7 +230,7 @@ visibilitymap_pin(Relation rel, BlockNumber heapBlk, Buffer *buf)
* visibilitymap_pin_ok - do we already have the correct page pinned?
*
* On entry, buf should be InvalidBuffer or a valid buffer returned by
- * an earlier call to visibilitymap_pin or visibilitymap_test on the same
+ * an earlier call to visibilitymap_pin or visibilitymap_get_status on the same
* relation. The return value indicates whether the buffer covers the
* given heapBlk.
*/
@@ -225,7 +243,7 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
}
/*
- * visibilitymap_set - set a bit on a previously pinned page
+ * visibilitymap_set - set bit(s) on a previously pinned page
*
* recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
* or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
@@ -234,10 +252,11 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
* marked all-visible; it is needed for Hot Standby, and can be
* InvalidTransactionId if the page contains no tuples.
*
- * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
- * this function. Except in recovery, caller should also pass the heap
- * buffer. When checksums are enabled and we're not in recovery, we must add
- * the heap buffer to the WAL chain to protect it from being torn.
+ * Caller is expected to set the heap page's PD_ALL_VISIBLE or PD_ALL_FROZEN
+ * bit before calling this function. Except in recovery, caller should also
+ * pass the heap buffer and flags which indicates what flag we want to set.
+ * When checksums are enabled and we're not in recovery, we must add the heap
+ * buffer to the WAL chain to protect it from being torn.
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -245,13 +264,14 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
*/
void
visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid)
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
uint8 mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
Page page;
- char *map;
+ uint8 *map;
#ifdef TRACE_VISIBILITYMAP
elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
@@ -259,6 +279,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
Assert(InRecovery || BufferIsValid(heapBuf));
+ Assert(flags & VISIBILITYMAP_VALID_BITS);
/* Check that we have the right heap page pinned, if present */
if (BufferIsValid(heapBuf) && BufferGetBlockNumber(heapBuf) != heapBlk)
@@ -269,14 +290,14 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
elog(ERROR, "wrong VM buffer passed to visibilitymap_set");
page = BufferGetPage(vmBuf);
- map = PageGetContents(page);
+ map = (uint8 *)PageGetContents(page);
LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
- if (!(map[mapByte] & (1 << mapBit)))
+ if (flags != (map[mapByte] >> mapBit & VISIBILITYMAP_VALID_BITS))
{
START_CRIT_SECTION();
- map[mapByte] |= (1 << mapBit);
+ map[mapByte] |= (flags << mapBit);
MarkBufferDirty(vmBuf);
if (RelationNeedsWAL(rel))
@@ -285,7 +306,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Assert(!InRecovery);
recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
- cutoff_xid);
+ cutoff_xid, flags);
/*
* If data checksums are enabled (or wal_log_hints=on), we
@@ -295,11 +316,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Page heapPage = BufferGetPage(heapBuf);
- /* caller is expected to set PD_ALL_VISIBLE first */
- Assert(PageIsAllVisible(heapPage));
+ /*
+ * Caller is expected to set PD_ALL_VISIBLE or
+ * PD_ALL_FROZEN first.
+ */
+ Assert(((flags | VISIBILITYMAP_ALL_VISIBLE) && PageIsAllVisible(heapPage)) ||
+ ((flags | VISIBILITYMAP_ALL_FROZEN) && PageIsAllFrozen(heapPage)));
PageSetLSN(heapPage, recptr);
}
}
+
PageSetLSN(page, recptr);
}
@@ -310,15 +336,17 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
}
/*
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_get_status - get status of bits
*
- * Are all tuples on heapBlk visible to all, according to the visibility map?
+ * Are all tuples on heapBlk visible to all or are marked frozen, according
+ * to the visibility map?
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by an
- * earlier call to visibilitymap_pin or visibilitymap_test on the same
+ * earlier call to visibilitymap_pin or visibilitymap_get_status on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk, or InvalidBuffer. The caller is responsible for
- * releasing *buf after it's done testing and setting bits.
+ * releasing *buf after it's done testing and setting bits, and must pass flags
+ * for which it needs to check the value in visibility map.
*
* NOTE: This function is typically called without a lock on the heap page,
* so somebody else could change the bit just after we look at it. In fact,
@@ -327,17 +355,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
* we might see the old value. It is the caller's responsibility to deal with
* all concurrency issues!
*/
-bool
-visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
+uint8
+visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *buf)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
uint8 mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- bool result;
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_test %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_get_status %s %d", RelationGetRelationName(rel), heapBlk);
#endif
/* Reuse the old pinned buffer if possible */
@@ -360,13 +387,11 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
map = PageGetContents(BufferGetPage(*buf));
/*
- * A single-bit read is atomic. There could be memory-ordering effects
+ * The double bits read is atomic. There could be memory-ordering effects
* here, but for performance reasons we make it the caller's job to worry
* about that.
*/
- result = (map[mapByte] & (1 << mapBit)) ? true : false;
-
- return result;
+ return ((map[mapByte] >> mapBit) & VISIBILITYMAP_VALID_BITS);
}
/*
@@ -374,14 +399,20 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
*
* Note: we ignore the possibility of race conditions when the table is being
* extended concurrently with the call. New pages added to the table aren't
- * going to be marked all-visible, so they won't affect the result.
+ * going to be marked all-visible or all-frozen, so they won't affect the result.
*/
-BlockNumber
-visibilitymap_count(Relation rel)
+void
+visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen)
{
- BlockNumber result = 0;
BlockNumber mapBlock;
+ /* all_visible must be specified */
+ Assert(all_visible);
+
+ *all_visible = 0;
+ if (all_frozen)
+ *all_frozen = 0;
+
for (mapBlock = 0;; mapBlock++)
{
Buffer mapBuffer;
@@ -406,13 +437,13 @@ visibilitymap_count(Relation rel)
for (i = 0; i < MAPSIZE; i++)
{
- result += number_of_ones[map[i]];
+ *all_visible += number_of_ones_for_visible[map[i]];
+ if (all_frozen)
+ *all_frozen += number_of_ones_for_frozen[map[i]];
}
ReleaseBuffer(mapBuffer);
}
-
- return result;
}
/*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 313ee9c..ded6d77 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1919,7 +1919,7 @@ index_update_stats(Relation rel,
BlockNumber relallvisible;
if (rd_rel->relkind != RELKIND_INDEX)
- relallvisible = visibilitymap_count(rel);
+ visibilitymap_count(rel, &relallvisible, NULL);
else /* don't bother for indexes */
relallvisible = 0;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 923fe58..86437c6 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -452,6 +452,7 @@ CREATE VIEW pg_stat_all_tables AS
pg_stat_get_live_tuples(C.oid) AS n_live_tup,
pg_stat_get_dead_tuples(C.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(C.oid) AS n_mod_since_analyze,
+ pg_stat_get_frozen_pages(C.oid) AS n_frozen_pages,
pg_stat_get_last_vacuum_time(C.oid) as last_vacuum,
pg_stat_get_last_autovacuum_time(C.oid) as last_autovacuum,
pg_stat_get_last_analyze_time(C.oid) as last_analyze,
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 070df29..d7f3035 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -314,6 +314,8 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
+ BlockNumber relallvisible,
+ relallfrozen;
if (inh)
ereport(elevel,
@@ -564,51 +566,56 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
}
}
- /*
- * Update pages/tuples stats in pg_class ... but not if we're doing
- * inherited stats.
- */
if (!inh)
+ {
+ /* Calculate the number of all-visible and all-frozen bit */
+ visibilitymap_count(onerel, &relallvisible, &relallfrozen);
+
+ /*
+ * Update pages/tuples stats in pg_class ... but not if we're doing
+ * inherited stats.
+ */
vac_update_relstats(onerel,
relpages,
totalrows,
- visibilitymap_count(onerel),
+ relallvisible,
hasindex,
InvalidTransactionId,
InvalidMultiXactId,
in_outer_xact);
- /*
- * Same for indexes. Vacuum always scans all indexes, so if we're part of
- * VACUUM ANALYZE, don't overwrite the accurate count already inserted by
- * VACUUM.
- */
- if (!inh && !(options & VACOPT_VACUUM))
- {
- for (ind = 0; ind < nindexes; ind++)
+ /*
+ * Same for indexes. Vacuum always scans all indexes, so if we're part of
+ * VACUUM ANALYZE, don't overwrite the accurate count already inserted by
+ * VACUUM.
+ */
+ if (!(options & VACOPT_VACUUM))
{
- AnlIndexData *thisdata = &indexdata[ind];
- double totalindexrows;
-
- totalindexrows = ceil(thisdata->tupleFract * totalrows);
- vac_update_relstats(Irel[ind],
- RelationGetNumberOfBlocks(Irel[ind]),
- totalindexrows,
- 0,
- false,
- InvalidTransactionId,
- InvalidMultiXactId,
- in_outer_xact);
+ for (ind = 0; ind < nindexes; ind++)
+ {
+ AnlIndexData *thisdata = &indexdata[ind];
+ double totalindexrows;
+
+ totalindexrows = ceil(thisdata->tupleFract * totalrows);
+ vac_update_relstats(Irel[ind],
+ RelationGetNumberOfBlocks(Irel[ind]),
+ totalindexrows,
+ 0,
+ false,
+ InvalidTransactionId,
+ InvalidMultiXactId,
+ in_outer_xact);
+ }
}
- }
- /*
- * Report ANALYZE to the stats collector, too. However, if doing
- * inherited stats we shouldn't report, because the stats collector only
- * tracks per-table stats.
- */
- if (!inh)
- pgstat_report_analyze(onerel, totalrows, totaldeadrows);
+ /*
+ * Report ANALYZE to the stats collector, too. However, if doing
+ * inherited stats we shouldn't report, because the stats collector only
+ * tracks per-table stats.
+ */
+ pgstat_report_analyze(onerel, totalrows, totaldeadrows, relallfrozen);
+
+ }
/* If this isn't part of VACUUM ANALYZE, let index AMs do cleanup */
if (!(options & VACOPT_VACUUM))
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 4f6f6e7..fbdb18c 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -106,6 +106,8 @@ typedef struct LVRelStats
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* number of pages we examined */
BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber vmskipped_frozen_pages; /* # of pages we skipped by all-frozen bit
+ of visibility map */
double scanned_tuples; /* counts only tuples on scanned pages */
double old_rel_tuples; /* previous value of pg_class.reltuples */
double new_rel_tuples; /* new estimated total # of tuples */
@@ -156,8 +158,9 @@ static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
ItemPointer itemptr);
static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
static int vac_cmp_itemptr(const void *left, const void *right);
-static bool heap_page_is_all_visible(Relation rel, Buffer buf,
- TransactionId *visibility_cutoff_xid);
+static void heap_page_visible_status(Relation rel, Buffer buf,
+ TransactionId *visibility_cutoff_xid,
+ bool *all_visible, bool *all_frozen);
/*
@@ -188,7 +191,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
MultiXactId mxactFullScanLimit;
BlockNumber new_rel_pages;
double new_rel_tuples;
- BlockNumber new_rel_allvisible;
+ BlockNumber new_rel_allvisible,
+ new_rel_allfrozen;
double new_live_tuples;
TransactionId new_frozen_xid;
MultiXactId new_min_multi;
@@ -221,7 +225,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* We request a full scan if either the table's frozen Xid is now older
* than or equal to the requested Xid full-table scan limit; or if the
* table's minimum MultiXactId is older than or equal to the requested
- * mxid full-table scan limit.
+ * mxid full-table scan limit. During full scan, we could skip some pages
+ * according to all-frozen bit of visibility map.
*/
scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
xidFullScanLimit);
@@ -253,7 +258,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
- if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages + vacrelstats->vmskipped_frozen_pages)
+ < vacrelstats->rel_pages)
{
Assert(!scan_all);
scanned_all = false;
@@ -274,15 +280,15 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* Update statistics in pg_class.
*
* A corner case here is that if we scanned no pages at all because every
- * page is all-visible, we should not update relpages/reltuples, because
- * we have no new information to contribute. In particular this keeps us
- * from replacing relpages=reltuples=0 (which means "unknown tuple
+ * page is all-visible or all-frozen, we should not update relpages/reltuples,
+ * because we have no new information to contribute. In particular this keeps
+ * us from replacing relpages=reltuples=0 (which means "unknown tuple
* density") with nonzero relpages and reltuples=0 (which means "zero
* tuple density") unless there's some actual evidence for the latter.
*
- * We do update relallvisible even in the corner case, since if the table
- * is all-visible we'd definitely like to know that. But clamp the value
- * to be not more than what we're setting relpages to.
+ * We do update relallvisible and relallfrozen even in the corner case,
+ * since if the table is all-visible we'd definitely like to know that.
+ * But clamp the value to be not more than what we're setting relpages to.
*
* Also, don't change relfrozenxid/relminmxid if we skipped any pages,
* since then we don't know for certain that all tuples have a newer xmin.
@@ -295,10 +301,13 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_tuples = vacrelstats->old_rel_tuples;
}
- new_rel_allvisible = visibilitymap_count(onerel);
+ visibilitymap_count(onerel, &new_rel_allvisible, &new_rel_allfrozen);
if (new_rel_allvisible > new_rel_pages)
new_rel_allvisible = new_rel_pages;
+ if (new_rel_allfrozen > new_rel_pages)
+ new_rel_allfrozen = new_rel_pages;
+
new_frozen_xid = scanned_all ? FreezeLimit : InvalidTransactionId;
new_min_multi = scanned_all ? MultiXactCutoff : InvalidMultiXactId;
@@ -319,7 +328,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
pgstat_report_vacuum(RelationGetRelid(onerel),
onerel->rd_rel->relisshared,
new_live_tuples,
- vacrelstats->new_dead_tuples);
+ vacrelstats->new_dead_tuples,
+ new_rel_allfrozen);
/* and log the action if appropriate */
if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
@@ -354,10 +364,11 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel),
vacrelstats->num_index_scans);
- appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins\n"),
+ appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped according to vm\n"),
vacrelstats->pages_removed,
vacrelstats->rel_pages,
- vacrelstats->pinskipped_pages);
+ vacrelstats->pinskipped_pages,
+ vacrelstats->vmskipped_frozen_pages);
appendStringInfo(&buf,
_("tuples: %.0f removed, %.0f remain, %.0f are dead but not yet removable\n"),
vacrelstats->tuples_deleted,
@@ -480,9 +491,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* consecutive pages. Since we're reading sequentially, the OS should be
* doing readahead for us, so there's no gain in skipping a page now and
* then; that's likely to disable readahead and so be counterproductive.
- * Also, skipping even a single page means that we can't update
- * relfrozenxid, so we only want to do it if we can skip a goodly number
- * of pages.
+ * Also, skipping even a single page according to all-visible bit of
+ * visibility map means that we might not be able to update relfrozenxid,
+ * so we only want to do it if we can skip a goodly number. On the other hand,
+ * we count both how many pages we skipped according to all-frozen bit of
+ * visibility map and how many pages we froze, so we can update relfrozenxid
+ * if the sum of two is as many as pages of table.
*
* Before entering the main loop, establish the invariant that
* next_not_all_visible_block is the next block number >= blkno that's not
@@ -492,18 +506,18 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* started skipping blocks, we may as well skip everything up to the next
* not-all-visible block.
*
- * Note: if scan_all is true, we won't actually skip any pages; but we
+ * Note: if scan_all is true, we might not actually skip any pages; but we
* maintain next_not_all_visible_block anyway, so as to set up the
* all_visible_according_to_vm flag correctly for each page.
*
- * Note: The value returned by visibilitymap_test could be slightly
+ * Note: The value returned by visibilitymap_get_status could be slightly
* out-of-date, since we make this test before reading the corresponding
* heap page or locking the buffer. This is OK. If we mistakenly think
- * that the page is all-visible when in fact the flag's just been cleared,
- * we might fail to vacuum the page. But it's OK to skip pages when
- * scan_all is not set, so no great harm done; the next vacuum will find
- * them. If we make the reverse mistake and vacuum a page unnecessarily,
- * it'll just be a no-op.
+ * that the page is all-visible/all-frozen when in fact the flag's just
+ * been cleared, we might fail to vacuum the page. But it's OK to skip
+ * pages when scan_all is not set, so no great harm done; the next vacuum
+ * will find them. If we make the reverse mistake and vacuum a page
+ * unnecessarily, it'll just be a no-op.
*
* We will scan the table's last page, at least to the extent of
* determining whether it has tuples or not, even if it should be skipped
@@ -518,7 +532,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer))
+ if (!VM_ALL_VISIBLE(onerel, next_not_all_visible_block, &vmbuffer))
break;
vacuum_delay_point();
}
@@ -536,9 +550,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
bool tupgone,
hastup;
int prev_dead_count;
- int nfrozen;
+ int nfrozen; /* # of tuples is frozen */
+ int nalready_frozen; /* # of tuples is already frozen */
+ int ntotal_frozen; /* # of frozen tuples in single page */
+ int ntup_per_page;
Size freespace;
bool all_visible_according_to_vm;
+ bool all_frozen_according_to_vm;
bool all_visible;
bool has_dead_tuples;
TransactionId visibility_cutoff_xid = InvalidTransactionId;
@@ -554,8 +572,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block,
- &vmbuffer))
+ if (!VM_ALL_VISIBLE(onerel, next_not_all_visible_block, &vmbuffer))
break;
vacuum_delay_point();
}
@@ -569,14 +586,29 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
skipping_all_visible_blocks = true;
else
skipping_all_visible_blocks = false;
+
all_visible_according_to_vm = false;
+ all_frozen_according_to_vm = false;
}
else
{
- /* Current block is all-visible */
- if (skipping_all_visible_blocks && !scan_all && !FORCE_CHECK_PAGE())
- continue;
+ /*
+ * This block is at least all-visible according to visibility map.
+ * We check whether this block is all-frozen or not, to skip to
+ * vacuum this page even if scan_all is true.
+ */
+ bool all_frozen = VM_ALL_FROZEN(onerel, blkno, &vmbuffer);
+
+ if (scan_all && all_frozen && !FORCE_CHECK_PAGE())
+ {
+ vacrelstats->vmskipped_frozen_pages++;
+ continue;
+ }
+ else if (!scan_all && skipping_all_visible_blocks && !FORCE_CHECK_PAGE())
+ continue;
+
all_visible_according_to_vm = true;
+ all_frozen_according_to_vm = all_frozen;
}
vacuum_delay_point();
@@ -743,7 +775,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
empty_pages++;
freespace = PageGetHeapFreeSpace(page);
- /* empty pages are always all-visible */
+ /* empty pages are always all-visible and all-frozen */
if (!PageIsAllVisible(page))
{
START_CRIT_SECTION();
@@ -766,8 +798,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
log_newpage_buffer(buf, true);
PageSetAllVisible(page);
+ PageSetAllFrozen(page);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, InvalidTransactionId);
+ vmbuffer, InvalidTransactionId,
+ VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
END_CRIT_SECTION();
}
@@ -791,13 +825,15 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
all_visible = true;
has_dead_tuples = false;
nfrozen = 0;
+ nalready_frozen = 0;
+ ntup_per_page = 0;
hastup = false;
prev_dead_count = vacrelstats->num_dead_tuples;
maxoff = PageGetMaxOffsetNumber(page);
/*
* Note: If you change anything in the loop below, also look at
- * heap_page_is_all_visible to see if that needs to be changed.
+ * heap_page_visible_status to see if that needs to be changed.
*/
for (offnum = FirstOffsetNumber;
offnum <= maxoff;
@@ -945,8 +981,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
{
num_tuples += 1;
+ ntup_per_page += 1;;
hastup = true;
+ /* Check whether this tuple is already frozen or not */
+ if (HeapTupleHeaderXminFrozen(tuple.t_data))
+ nalready_frozen += 1;
+
/*
* Each non-removable tuple must be checked to see if it needs
* freezing. Note we already have exclusive buffer lock.
@@ -993,6 +1034,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
END_CRIT_SECTION();
}
+ /* Compute total number of frozen tuples in a page */
+ ntotal_frozen = nfrozen + nalready_frozen;
+
/*
* If there are no indexes then we can vacuum the page right now
* instead of doing a second scan.
@@ -1015,26 +1059,46 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
freespace = PageGetHeapFreeSpace(page);
- /* mark page all-visible, if appropriate */
- if (all_visible && !all_visible_according_to_vm)
+ /* This page is all visible */
+ if (all_visible)
{
- /*
- * It should never be the case that the visibility map page is set
- * while the page-level bit is clear, but the reverse is allowed
- * (if checksums are not enabled). Regardless, set the both bits
- * so that we get back in sync.
- *
- * NB: If the heap page is all-visible but the VM bit is not set,
- * we don't need to dirty the heap page. However, if checksums
- * are enabled, we do need to make sure that the heap page is
- * dirtied before passing it to visibilitymap_set(), because it
- * may be logged. Given that this situation should only happen in
- * rare cases after a crash, it is not worth optimizing.
- */
- PageSetAllVisible(page);
- MarkBufferDirty(buf);
- visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ /* mark page all-visible, if appropriate */
+ if (!all_visible_according_to_vm)
+ {
+ /*
+ * It should never be the case that the visibility map page is set
+ * while the page-level bit is clear, but the reverse is allowed
+ * (if checksums are not enabled). Regardless, set the both bits
+ * so that we get back in sync.
+ *
+ * NB: If the heap page is all-visible but the VM bit is not set,
+ * we don't need to dirty the heap page. However, if checksums
+ * are enabled, we do need to make sure that the heap page is
+ * dirtied before passing it to visibilitymap_set(), because it
+ * may be logged. Given that this situation should only happen in
+ * rare cases after a crash, it is not worth optimizing.
+ */
+ PageSetAllVisible(page);
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+ }
+
+ /* mark page all-frozen, if all tuples are frozen and not marked yet */
+ if ((ntotal_frozen == ntup_per_page) && !all_frozen_according_to_vm)
+ {
+ Assert(PageIsAllVisible(page));
+
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ {
+ MarkBufferDirty(buf);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+ vmbuffer, visibility_cutoff_xid, flags);
+ }
}
/*
@@ -1045,9 +1109,14 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* that something bad has happened.
*/
else if (all_visible_according_to_vm && !PageIsAllVisible(page)
- && visibilitymap_test(onerel, blkno, &vmbuffer))
+ && VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
{
- elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
+ /* If the all-frozen is set then all-visible must be set */
+ if (all_frozen_according_to_vm)
+ Assert(VM_ALL_FROZEN(onerel, blkno, &vmbuffer) &&
+ VM_ALL_VISIBLE(onerel, blkno, &vmbuffer));
+
+ elog(WARNING, "page is not marked all-visible (and all-frozen) but visibility map bit(s) is set in relation \"%s\" page %u",
relname, blkno);
visibilitymap_clear(onerel, blkno, vmbuffer);
}
@@ -1055,19 +1124,25 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
/*
* It's possible for the value returned by GetOldestXmin() to move
* backwards, so it's not wrong for us to see tuples that appear to
- * not be visible to everyone yet, while PD_ALL_VISIBLE is already
- * set. The real safe xmin value never moves backwards, but
- * GetOldestXmin() is conservative and sometimes returns a value
+ * not be visible to everyone yet, while PD_ALL_VISIBLE (and PD_ALL_FROZEN)
+ * are already set. The real safe xmin value never moves backwards,
+ * but GetOldestXmin() is conservative and sometimes returns a value
* that's unnecessarily small, so if we see that contradiction it just
* means that the tuples that we think are not visible to everyone yet
- * actually are, and the PD_ALL_VISIBLE flag is correct.
+ * actually are, and the PD_ALL_VISIBLE (and PD_ALL_FROZEN) flags are
+ * correct.
*
* There should never be dead tuples on a page with PD_ALL_VISIBLE
* set, however.
*/
else if (PageIsAllVisible(page) && has_dead_tuples)
{
- elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
+ /* If all-frozen is set then all-visible must be set */
+ if (PageIsAllFrozen(page))
+ Assert(VM_ALL_FROZEN(onerel, blkno, &vmbuffer) &&
+ VM_ALL_VISIBLE(onerel, blkno, &vmbuffer));
+
+ elog(WARNING, "page containing dead tuples is marked as all-visible (and all-frozen) in relation \"%s\" page %u",
relname, blkno);
PageClearAllVisible(page);
MarkBufferDirty(buf);
@@ -1141,6 +1216,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
RelationGetRelationName(onerel),
tups_vacuumed, vacuumed_pages)));
+ /* Report how many frozen pages vacuum skipped according to visibility map */
+ ereport(elevel,
+ (errmsg_plural("skipped %d frozen page according to visibility map",
+ "skipped %d frozen pages according to visibility map",
+ vacrelstats->vmskipped_frozen_pages,
+ vacrelstats->vmskipped_frozen_pages)));
+
/*
* This is pretty messy, but we split it up so that we can skip emitting
* individual parts of the message when not applicable.
@@ -1257,6 +1339,8 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
OffsetNumber unused[MaxOffsetNumber];
int uncnt = 0;
TransactionId visibility_cutoff_xid;
+ bool all_visible;
+ bool all_frozen;
START_CRIT_SECTION();
@@ -1308,19 +1392,36 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
* dirty, exclusively locked, and, if needed, a full page image has been
* emitted in the log_heap_clean() above.
*/
- if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid))
+ heap_page_visible_status(onerel, buffer, &visibility_cutoff_xid,
+ &all_visible, &all_frozen);
+ if (all_visible)
PageSetAllVisible(page);
/*
* All the changes to the heap page have been done. If the all-visible
- * flag is now set, also set the VM bit.
+ * flag is now set, also set the VM all-visible bit.
+ * Also, if this page is all-frozen, set the VM all-frozen bit and flag.
*/
- if (PageIsAllVisible(page) &&
- !visibilitymap_test(onerel, blkno, vmbuffer))
+ if (PageIsAllVisible(page))
{
+ uint8 vm_status = visibilitymap_get_status(onerel, blkno, vmbuffer);
+ uint8 flags = 0;
+
+ if (!(vm_status & VISIBILITYMAP_ALL_VISIBLE))
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+
+ /* Set the VM all-frozen bit to flag, if needed */
+ if (all_frozen && !(vm_status & VISIBILITYMAP_ALL_FROZEN))
+ {
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
Assert(BufferIsValid(*vmbuffer));
- visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
- visibility_cutoff_xid);
+
+ if (vm_status != flags)
+ visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
+ visibility_cutoff_xid, flags);
}
return tupindex;
@@ -1842,18 +1943,21 @@ vac_cmp_itemptr(const void *left, const void *right)
/*
* Check if every tuple in the given page is visible to all current and future
* transactions. Also return the visibility_cutoff_xid which is the highest
- * xmin amongst the visible tuples.
+ * xmin amongst the visible tuples, and all_frozen which implies that all tuples
+ * of this page are frozen.
*/
-static bool
-heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid)
+static void
+heap_page_visible_status(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid,
+ bool *all_visible, bool *all_frozen)
{
Page page = BufferGetPage(buf);
BlockNumber blockno = BufferGetBlockNumber(buf);
OffsetNumber offnum,
maxoff;
- bool all_visible = true;
*visibility_cutoff_xid = InvalidTransactionId;
+ *all_visible = true;
+ *all_frozen = true;
/*
* This is a stripped down version of the line pointer scan in
@@ -1861,7 +1965,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
*/
maxoff = PageGetMaxOffsetNumber(page);
for (offnum = FirstOffsetNumber;
- offnum <= maxoff && all_visible;
+ offnum <= maxoff && *all_visible;
offnum = OffsetNumberNext(offnum))
{
ItemId itemid;
@@ -1877,11 +1981,12 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/*
* Dead line pointers can have index pointers pointing to them. So
- * they can't be treated as visible
+ * they can't be treated as visible and frozen.
*/
if (ItemIdIsDead(itemid))
{
- all_visible = false;
+ *all_visible = false;
+ *all_frozen = false;
break;
}
@@ -1900,7 +2005,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/* Check comments in lazy_scan_heap. */
if (!HeapTupleHeaderXminCommitted(tuple.t_data))
{
- all_visible = false;
+ *all_visible = false;
break;
}
@@ -1911,13 +2016,17 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
xmin = HeapTupleHeaderGetXmin(tuple.t_data);
if (!TransactionIdPrecedes(xmin, OldestXmin))
{
- all_visible = false;
+ *all_visible = false;
break;
}
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
+
+ /* Check whether this tuple is already frozen or not */
+ if (!HeapTupleHeaderXminFrozen(tuple.t_data))
+ *all_frozen = false;
}
break;
@@ -1925,7 +2034,8 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
case HEAPTUPLE_RECENTLY_DEAD:
case HEAPTUPLE_INSERT_IN_PROGRESS:
case HEAPTUPLE_DELETE_IN_PROGRESS:
- all_visible = false;
+ *all_visible = false;
+ *all_frozen = false;
break;
default:
@@ -1934,5 +2044,6 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
}
} /* scan along page */
- return all_visible;
+ if (!(*all_visible))
+ *all_frozen = false;
}
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 90afbdc..4f6f91c 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -85,9 +85,9 @@ IndexOnlyNext(IndexOnlyScanState *node)
* which all tuples are known visible to everybody. In any case,
* we'll use the index tuple not the heap tuple as the data source.
*
- * Note on Memory Ordering Effects: visibilitymap_test does not lock
- * the visibility map buffer, and therefore the result we read here
- * could be slightly stale. However, it can't be stale enough to
+ * Note on Memory Ordering Effects: visibilitymap_get_status does not
+ * lock the visibility map buffer, and therefore the result we read
+ * here could be slightly stale. However, it can't be stale enough to
* matter.
*
* We need to detect clearing a VM bit due to an insert right away,
@@ -114,9 +114,9 @@ IndexOnlyNext(IndexOnlyScanState *node)
* It's worth going through this complexity to avoid needing to lock
* the VM buffer, which could cause significant contention.
*/
- if (!visibilitymap_test(scandesc->heapRelation,
- ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
+ if (!VM_ALL_VISIBLE(scandesc->heapRelation,
+ ItemPointerGetBlockNumber(tid),
+ &node->ioss_VMBuffer))
{
/*
* Rats, we have to visit the heap to check visibility.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index da768c6..08b61cb 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1329,7 +1329,8 @@ pgstat_report_autovac(Oid dboid)
*/
void
pgstat_report_vacuum(Oid tableoid, bool shared,
- PgStat_Counter livetuples, PgStat_Counter deadtuples)
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages)
{
PgStat_MsgVacuum msg;
@@ -1343,6 +1344,7 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
msg.m_vacuumtime = GetCurrentTimestamp();
msg.m_live_tuples = livetuples;
msg.m_dead_tuples = deadtuples;
+ msg.m_frozen_pages = frozenpages;
pgstat_send(&msg, sizeof(msg));
}
@@ -1354,7 +1356,8 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
*/
void
pgstat_report_analyze(Relation rel,
- PgStat_Counter livetuples, PgStat_Counter deadtuples)
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages)
{
PgStat_MsgAnalyze msg;
@@ -1394,6 +1397,7 @@ pgstat_report_analyze(Relation rel,
msg.m_analyzetime = GetCurrentTimestamp();
msg.m_live_tuples = livetuples;
msg.m_dead_tuples = deadtuples;
+ msg.m_frozen_pages = frozenpages;
pgstat_send(&msg, sizeof(msg));
}
@@ -3702,6 +3706,7 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
result->n_live_tuples = 0;
result->n_dead_tuples = 0;
result->changes_since_analyze = 0;
+ result->n_frozen_pages = 0;
result->blocks_fetched = 0;
result->blocks_hit = 0;
result->vacuum_timestamp = 0;
@@ -5069,6 +5074,7 @@ pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
tabentry->n_live_tuples = msg->m_live_tuples;
tabentry->n_dead_tuples = msg->m_dead_tuples;
+ tabentry->n_frozen_pages = msg->m_frozen_pages;
if (msg->m_autovacuum)
{
@@ -5103,6 +5109,7 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
tabentry->n_live_tuples = msg->m_live_tuples;
tabentry->n_dead_tuples = msg->m_dead_tuples;
+ tabentry->n_frozen_pages = msg->m_frozen_pages;
/*
* We reset changes_since_analyze to zero, forgetting any changes that
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 1b22fcc..7c57b3e 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -46,6 +46,7 @@ extern Datum pg_stat_get_vacuum_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_autovacuum_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_analyze_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_autoanalyze_count(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_frozen_pages(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_function_calls(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_function_total_time(PG_FUNCTION_ARGS);
@@ -450,6 +451,21 @@ pg_stat_get_autoanalyze_count(PG_FUNCTION_ARGS)
}
Datum
+pg_stat_get_frozen_pages(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int32 result;
+ PgStat_StatTabEntry *tabentry;
+
+ if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+ result = 0;
+ else
+ result = (int32) (tabentry->n_frozen_pages);
+
+ PG_RETURN_INT32(result);
+}
+
+Datum
pg_stat_get_function_calls(PG_FUNCTION_ARGS)
{
Oid funcid = PG_GETARG_OID(0);
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -320,9 +320,10 @@ typedef struct xl_heap_freeze_page
typedef struct xl_heap_visible
{
TransactionId cutoff_xid;
+ uint8 flags;
} xl_heap_visible;
-#define SizeOfHeapVisible (offsetof(xl_heap_visible, cutoff_xid) + sizeof(TransactionId))
+#define SizeOfHeapVisible (offsetof(xl_heap_visible, flags) + sizeof(uint8))
typedef struct xl_heap_new_cid
{
@@ -389,6 +390,6 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
xl_heap_freeze_tuple *xlrec_tp);
extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
- Buffer vm_buffer, TransactionId cutoff_xid);
+ Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
#endif /* HEAPAM_XLOG_H */
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index d447daf..a75de5c 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -19,15 +19,36 @@
#include "storage/buf.h"
#include "utils/relcache.h"
+/*
+ * Number of bits allocated for each heap block.
+ * One for all-visible, other for all-frozen.
+*/
+#define BITS_PER_HEAPBLOCK 2
+
+/* Number of heap blocks we can represent in one byte. */
+#define HEAPBLOCKS_PER_BYTE 4
+
+/* Flags for bit map */
+#define VISIBILITYMAP_ALL_VISIBLE 0x01
+#define VISIBILITYMAP_ALL_FROZEN 0x02
+#define VISIBILITYMAP_VALID_BITS 0x03 /* OR of all valid visiblitymap flags bits */
+
+/* Macros for visibilitymap test */
+#define VM_ALL_VISIBLE(r, b, v) \
+ ((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_VISIBLE) != 0)
+#define VM_ALL_FROZEN(r, b, v) \
+ ((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_FROZEN) != 0)
+
extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk,
Buffer vmbuf);
extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
Buffer *vmbuf);
extern bool visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid);
-extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
-extern BlockNumber visibilitymap_count(Relation rel);
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags);
+extern uint8 visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
+extern void visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen);
extern void visibilitymap_truncate(Relation rel, BlockNumber nheapblocks);
#endif /* VISIBILITYMAP_H */
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 5c480b7..68ec2e1 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 201601281
+#define CATALOG_VERSION_NO 201602021
#endif
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index a2248b4..9842294 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2698,6 +2698,8 @@ DATA(insert OID = 3056 ( pg_stat_get_analyze_count PGNSP PGUID 12 1 0 0 0 f f f
DESCR("statistics: number of manual analyzes for a table");
DATA(insert OID = 3057 ( pg_stat_get_autoanalyze_count PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_autoanalyze_count _null_ _null_ _null_ ));
DESCR("statistics: number of auto analyzes for a table");
+DATA(insert OID = 6015 ( pg_stat_get_frozen_pages PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 23 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_frozen_pages _null_ _null_ _null_ ));
+DESCR("statistics: number of frozen pages of table");
DATA(insert OID = 1936 ( pg_stat_get_backend_idset PGNSP PGUID 12 1 100 0 0 f f f f t t s r 0 0 23 "" _null_ _null_ _null_ _null_ _null_ pg_stat_get_backend_idset _null_ _null_ _null_ ));
DESCR("statistics: currently active backend IDs");
DATA(insert OID = 2022 ( pg_stat_get_activity PGNSP PGUID 12 1 100 0 0 f f f f f t s r 1 0 2249 "23" "{23,26,23,26,25,25,25,16,1184,1184,1184,1184,869,25,23,28,28,16,25,25,23,16,25}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,datid,pid,usesysid,application_name,state,query,waiting,xact_start,query_start,backend_start,state_change,client_addr,client_hostname,client_port,backend_xid,backend_xmin,ssl,sslversion,sslcipher,sslbits,sslcompression,sslclientdn}" _null_ _null_ pg_stat_get_activity _null_ _null_ _null_ ));
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 65e968e..ad40b70 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -17,6 +17,7 @@
#include "portability/instr_time.h"
#include "postmaster/pgarch.h"
#include "storage/barrier.h"
+#include "storage/block.h"
#include "utils/hsearch.h"
#include "utils/relcache.h"
@@ -355,6 +356,7 @@ typedef struct PgStat_MsgVacuum
TimestampTz m_vacuumtime;
PgStat_Counter m_live_tuples;
PgStat_Counter m_dead_tuples;
+ BlockNumber m_frozen_pages;
} PgStat_MsgVacuum;
@@ -372,6 +374,7 @@ typedef struct PgStat_MsgAnalyze
TimestampTz m_analyzetime;
PgStat_Counter m_live_tuples;
PgStat_Counter m_dead_tuples;
+ int32 m_frozen_pages;
} PgStat_MsgAnalyze;
@@ -551,7 +554,7 @@ typedef union PgStat_Msg
* ------------------------------------------------------------
*/
-#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9D
+#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9E
/* ----------
* PgStat_StatDBEntry The collector's data per database
@@ -615,6 +618,8 @@ typedef struct PgStat_StatTabEntry
PgStat_Counter n_dead_tuples;
PgStat_Counter changes_since_analyze;
+ int32 n_frozen_pages;
+
PgStat_Counter blocks_fetched;
PgStat_Counter blocks_hit;
@@ -917,9 +922,11 @@ extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type t
extern void pgstat_report_autovac(Oid dboid);
extern void pgstat_report_vacuum(Oid tableoid, bool shared,
- PgStat_Counter livetuples, PgStat_Counter deadtuples);
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages);
extern void pgstat_report_analyze(Relation rel,
- PgStat_Counter livetuples, PgStat_Counter deadtuples);
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages);
extern void pgstat_report_recovery_conflict(int reason);
extern void pgstat_report_deadlock(void);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 2ce3be7..0b023b3 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -178,8 +178,10 @@ typedef PageHeaderData *PageHeader;
* tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
+#define PD_ALL_FROZEN 0x0008 /* all tuples on page are completely
+ frozen */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x000F /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -367,7 +369,12 @@ typedef PageHeaderData *PageHeader;
#define PageSetAllVisible(page) \
(((PageHeader) (page))->pd_flags |= PD_ALL_VISIBLE)
#define PageClearAllVisible(page) \
- (((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
+ (((PageHeader) (page))->pd_flags &= ~(PD_ALL_VISIBLE | PD_ALL_FROZEN))
+
+#define PageIsAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags & PD_ALL_FROZEN)
+#define PageSetAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags |= PD_ALL_FROZEN)
#define PageIsPrunable(page, oldestxmin) \
( \
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 28b061f..c95c788 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1683,6 +1683,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,
pg_stat_get_live_tuples(c.oid) AS n_live_tup,
pg_stat_get_dead_tuples(c.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(c.oid) AS n_mod_since_analyze,
+ pg_stat_get_frozen_pages(c.oid) AS n_frozen_pages,
pg_stat_get_last_vacuum_time(c.oid) AS last_vacuum,
pg_stat_get_last_autovacuum_time(c.oid) AS last_autovacuum,
pg_stat_get_last_analyze_time(c.oid) AS last_analyze,
@@ -1795,6 +1796,7 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
+ pg_stat_all_tables.n_frozen_pages,
pg_stat_all_tables.last_vacuum,
pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,
@@ -1838,6 +1840,7 @@ pg_stat_user_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
+ pg_stat_all_tables.n_frozen_pages,
pg_stat_all_tables.last_vacuum,
pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index f5be70f..95ababf 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -127,6 +127,8 @@ SELECT count(*) FROM tenk2 WHERE unique1 = 1;
1
(1 row)
+-- do VACUUM FREEZE
+VACUUM FREEZE tenk2;
-- force the rate-limiting logic in pgstat_report_tabstat() to time out
-- and send a message
SELECT pg_sleep(1.0);
@@ -175,6 +177,14 @@ SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages,
t | t
(1 row)
+SELECT n_frozen_pages = (pg_relation_size('tenk2') / current_setting('block_size')::int)
+ FROM pg_stat_user_tables
+ WHERE relname ='tenk2';
+ ?column?
+----------
+ t
+(1 row)
+
SELECT pr.snap_ts < pg_stat_get_snapshot_timestamp() as snapshot_newer
FROM prevstats AS pr;
snapshot_newer
diff --git a/src/test/regress/expected/visibilitymap.out b/src/test/regress/expected/visibilitymap.out
new file mode 100644
index 0000000..87206b6
--- /dev/null
+++ b/src/test/regress/expected/visibilitymap.out
@@ -0,0 +1,22 @@
+--
+-- Visibility Map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+-- All pages are become all-visible
+VACUUM FREEZE vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+ ?column?
+----------
+ t
+(1 row)
+
+-- Check whether vacuum skips all-frozen pages
+\set VERBOSITY terse
+VACUUM FREEZE VERBOSE vmtest;
+INFO: vacuuming "public.vmtest"
+INFO: index "vmtest_pkey" now contains 10000 row versions in 30 pages
+INFO: skipped 44 frozen pages according to visibility map
+INFO: "vmtest": found 0 removable, 56 nonremovable row versions in 1 out of 45 pages
+\set VERBOSITY default
+DROP TABLE vmtest;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index b1bc7c7..e31fa76 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -110,3 +110,6 @@ test: event_trigger
# run stats by itself because its delay may be insufficient under heavy load
test: stats
+
+# page info map and vacuum test cannot run concurrently with any test that runs SQL
+test: visibilitymap
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index ade9ef1..666e40c 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -161,3 +161,4 @@ test: with
test: xml
test: event_trigger
test: stats
+test: visibilitymap
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index cd2d592..dea5553 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -120,6 +120,8 @@ ROLLBACK;
SELECT count(*) FROM tenk2;
-- do an indexscan
SELECT count(*) FROM tenk2 WHERE unique1 = 1;
+-- do VACUUM FREEZE
+VACUUM FREEZE tenk2;
-- force the rate-limiting logic in pgstat_report_tabstat() to time out
-- and send a message
@@ -145,6 +147,10 @@ SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages,
FROM pg_statio_user_tables AS st, pg_class AS cl, prevstats AS pr
WHERE st.relname='tenk2' AND cl.relname='tenk2';
+SELECT n_frozen_pages = (pg_relation_size('tenk2') / current_setting('block_size')::int)
+ FROM pg_stat_user_tables
+ WHERE relname ='tenk2';
+
SELECT pr.snap_ts < pg_stat_get_snapshot_timestamp() as snapshot_newer
FROM prevstats AS pr;
diff --git a/src/test/regress/sql/visibilitymap.sql b/src/test/regress/sql/visibilitymap.sql
new file mode 100644
index 0000000..365570b
--- /dev/null
+++ b/src/test/regress/sql/visibilitymap.sql
@@ -0,0 +1,16 @@
+--
+-- Visibility Map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+
+-- All pages are become all-visible
+VACUUM FREEZE vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+
+-- Check whether vacuum skips all-frozen pages
+\set VERBOSITY terse
+VACUUM FREEZE VERBOSE vmtest;
+\set VERBOSITY default
+
+DROP TABLE vmtest;
001_freezemap_support_for_pg_upgrade_v34.patchtext/x-patch; charset=US-ASCII; name=001_freezemap_support_for_pg_upgrade_v34.patchDownload
diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile
index d9c8145..153622d 100644
--- a/src/bin/pg_upgrade/Makefile
+++ b/src/bin/pg_upgrade/Makefile
@@ -11,8 +11,11 @@ OBJS = check.o controldata.o dump.o exec.o file.o function.o info.o \
option.o page.o parallel.o pg_upgrade.o relfilenode.o server.o \
tablespace.o util.o version.o $(WIN32RES)
+SUBDIRS = plugins
+
override CPPFLAGS := -DDLSUFFIX=\"$(DLSUFFIX)\" -I$(srcdir) -I$(libpq_srcdir) $(CPPFLAGS)
+$(recurse)
all: pg_upgrade
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 9357ad8..4c4b955 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -14,6 +14,7 @@
#include <fcntl.h>
+const static char *pg_copy_file(const char *src, const char *dst, bool force);
#ifndef WIN32
static int copy_file(const char *fromfile, const char *tofile, bool force);
@@ -22,6 +23,8 @@ static int win32_pghardlink(const char *src, const char *dst);
#endif
+const char *convertVMFile(pageCnvCtx *pageConverter, const char *src, const char *dst);
+
/*
* copyAndUpdateFile()
*
@@ -30,19 +33,11 @@ static int win32_pghardlink(const char *src, const char *dst);
*/
const char *
copyAndUpdateFile(pageCnvCtx *pageConverter,
- const char *src, const char *dst, bool force)
+ const char *src, const char *dst, const char *type_suffix,
+ bool force)
{
if (pageConverter == NULL)
- {
-#ifndef WIN32
- if (copy_file(src, dst, force) == -1)
-#else
- if (CopyFile(src, dst, !force) == 0)
-#endif
- return getErrorText();
- else
- return NULL;
- }
+ return pg_copy_file(src, dst, force);
else
{
/*
@@ -55,12 +50,18 @@ copyAndUpdateFile(pageCnvCtx *pageConverter,
* file and call the convertPage plugin function.
*/
-#ifdef PAGE_CONVERSION
- if (pageConverter->convertFile)
- return pageConverter->convertFile(pageConverter->pluginData,
- dst, src);
+ /* Process visibility map */
+ if (strcmp(type_suffix, "_vm") == 0)
+ {
+ if (pageConverter->convertVMFile == NULL)
+ return pg_copy_file(src, dst, force);
+ else
+ return convertVMFile(pageConverter, src, dst);
+ }
+ /* Process relation file */
+ else if (type_suffix == NULL && pageConverter->convertFile)
+ return pageConverter->convertFile(pageConverter->pluginData, dst, src);
else
-#endif
{
int src_fd;
int dstfd;
@@ -79,10 +80,9 @@ copyAndUpdateFile(pageCnvCtx *pageConverter,
while ((bytesRead = read(src_fd, buf, BLCKSZ)) == BLCKSZ)
{
-#ifdef PAGE_CONVERSION
- if ((msg = pageConverter->convertPage(pageConverter->pluginData, buf, buf)) != NULL)
- break;
-#endif
+ if (pageConverter->convertPage)
+ if ((msg = pageConverter->convertPage(pageConverter->pluginData, buf, buf)) != NULL)
+ break;
if (write(dstfd, buf, BLCKSZ) != BLCKSZ)
{
msg = "could not write new page to destination";
@@ -103,7 +103,6 @@ copyAndUpdateFile(pageCnvCtx *pageConverter,
}
}
-
/*
* linkAndUpdateFile()
*
@@ -115,15 +114,29 @@ copyAndUpdateFile(pageCnvCtx *pageConverter,
*/
const char *
linkAndUpdateFile(pageCnvCtx *pageConverter,
- const char *src, const char *dst)
+ const char *src, const char *dst, const char *type_suffix,
+ bool rewrite_vm)
{
- if (pageConverter != NULL)
+ if (convertRelfile(pageConverter))
return "Cannot in-place update this cluster, page-by-page conversion is required";
+ /* Convert page actually using additional pageConverter */
+ if (strcmp(type_suffix, "_vm") == 0)
+ return convertVMFile(pageConverter, src, dst);
+
if (pg_link_file(src, dst) == -1)
return getErrorText();
else
return NULL;
+
+/*
+ if (rewrite_vm)
+ return rewriteVisibilitymap(src, dst, true);
+ else if (pg_link_file(src, dst) == -1)
+ return getErrorText();
+ else
+ return NULL;
+*/
}
@@ -204,6 +217,28 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
}
#endif
+/*
+ * convertVMFile()
+ *
+ * This function fills information about checksum for rewriting VM, and execute
+ * plugin function.
+ */
+const char *
+convertVMFile(pageCnvCtx *pageConverter, const char *src, const char *dst)
+
+{
+ bool checksum_enabled = false;
+
+ /* Check whether checksum is enabled on both cluster */
+ if (old_cluster.controldata.data_checksum_version != 0 &&
+ new_cluster.controldata.data_checksum_version != 0)
+ checksum_enabled = true;
+
+ /* Convert visibility map file */
+ pageConverter->pluginData = (void *) &checksum_enabled;
+
+ return pageConverter->convertVMFile(pageConverter->pluginData, dst, src);
+}
void
check_hard_link(void)
@@ -224,6 +259,20 @@ check_hard_link(void)
unlink(new_link_file);
}
+const static char *
+pg_copy_file(const char *src, const char *dst, bool force)
+{
+
+#ifndef WIN32
+ if (copy_file(src, dst, force) == -1)
+#else
+ if (CopyFile(src, dst, !force) == 0)
+#endif
+ return getErrorText();
+ else
+ return NULL;
+}
+
#ifdef WIN32
static int
win32_pghardlink(const char *src, const char *dst)
diff --git a/src/bin/pg_upgrade/page.c b/src/bin/pg_upgrade/page.c
index e5686e5..423deae 100644
--- a/src/bin/pg_upgrade/page.c
+++ b/src/bin/pg_upgrade/page.c
@@ -13,15 +13,31 @@
#include "storage/bufpage.h"
-
-#ifdef PAGE_CONVERSION
+#include <dlfcn.h>
static void getPageVersion(
uint16 *version, const char *pathName);
static pageCnvCtx *loadConverterPlugin(
uint16 newPageVersion, uint16 oldPageVersion);
+static pageCnvCtx *loadAdditionalConverterPlugin(pageCnvCtx *converter,
+ const char *pluginName);
+static void initializePageConverter(pageCnvCtx *converter);
+/*
+ * initializePageConverter()
+ *
+ * Initialize pageConverter struct.
+ */
+static void
+initializePageConverter(pageCnvCtx *converter)
+{
+ converter->startup = NULL;
+ converter->convertFile = NULL;
+ converter->convertVMFile = NULL;
+ converter->convertPage = NULL;
+ converter->shutdown = NULL;
+}
/*
* setupPageConverter()
@@ -34,16 +50,16 @@ static pageCnvCtx *loadConverterPlugin(
* returns a NULL pageCnvCtx pointer to indicate that page-by-page conversion
* is not required.
*/
-pageCnvCtx *
+const pageCnvCtx *
setupPageConverter(void)
{
uint16 oldPageVersion;
uint16 newPageVersion;
- pageCnvCtx *converter;
- const char *msg;
+ pageCnvCtx *converter = NULL;
char dstName[MAXPGPATH];
char srcName[MAXPGPATH];
+
snprintf(dstName, sizeof(dstName), "%s/global/%u", new_cluster.pgdata,
new_cluster.pg_database_oid);
snprintf(srcName, sizeof(srcName), "%s/global/%u", old_cluster.pgdata,
@@ -63,16 +79,33 @@ setupPageConverter(void)
* plugin that knows how to convert from the old page layout to the
* new page layout.
*/
-
if ((converter = loadConverterPlugin(newPageVersion, oldPageVersion)) == NULL)
pg_fatal("could not find plugin to convert from old page layout to new page layout\n");
+ }
- return converter;
+
+ /*
+ * Do we need to rewrite visibilitymap? if yes, load specific converter libarary.
+ */
+ if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ {
+ char libpath[MAXPGPATH];
+ char pluginName[MAXPGPATH];
+
+ get_lib_path(mypath, libpath);
+ snprintf(pluginName, sizeof(pluginName), "%s/plugins/convertLayoutVM_add_frozenbit%s",
+ libpath, DLSUFFIX);
+
+ if ((converter = loadAdditionalConverterPlugin(converter, pluginName)) == NULL)
+ pg_fatal("could not find additional plugin to convert from old page layout to new page layout\n");
}
- else
- return NULL;
-}
+ if (converter)
+ return converter;
+
+ return NULL;
+}
/*
* getPageVersion()
@@ -118,8 +151,8 @@ getPageVersion(uint16 *version, const char *pathName)
static pageCnvCtx *
loadConverterPlugin(uint16 newPageVersion, uint16 oldPageVersion)
{
- char pluginName[MAXPGPATH];
void *plugin;
+ char pluginName[MAXPGPATH];
/*
* Try to find a plugin that can convert pages of oldPageVersion into
@@ -135,19 +168,19 @@ loadConverterPlugin(uint16 newPageVersion, uint16 oldPageVersion)
snprintf(pluginName, sizeof(pluginName), "./plugins/convertLayout_%d_to_%d%s",
oldPageVersion, newPageVersion, DLSUFFIX);
- if ((plugin = pg_dlopen(pluginName)) == NULL)
+ if ((plugin = dlopen(pluginName, RTLD_NOW | RTLD_GLOBAL)) == NULL)
return NULL;
else
{
pageCnvCtx *result = (pageCnvCtx *) pg_malloc(sizeof(*result));
- result->old.PageVersion = oldPageVersion;
- result->new.PageVersion = newPageVersion;
+ result->oldPageVersion = oldPageVersion;
+ result->newPageVersion = newPageVersion;
- result->startup = (pluginStartup) pg_dlsym(plugin, "init");
- result->convertFile = (pluginConvertFile) pg_dlsym(plugin, "convertFile");
- result->convertPage = (pluginConvertPage) pg_dlsym(plugin, "convertPage");
- result->shutdown = (pluginShutdown) pg_dlsym(plugin, "fini");
+ result->startup = (pluginStartup) dlsym(plugin, "init");
+ result->convertFile = (pluginConvertFile) dlsym(plugin, "convertFile");
+ result->convertPage = (pluginConvertPage) dlsym(plugin, "convertPage");
+ result->shutdown = (pluginShutdown) dlsym(plugin, "fini");
result->pluginData = NULL;
/*
@@ -161,4 +194,29 @@ loadConverterPlugin(uint16 newPageVersion, uint16 oldPageVersion)
}
}
-#endif
+/*
+ * loadAdditionalConverterPlugin()
+ *
+ * This function loads a additional page-converter plugin library for forks
+ * and grabs a pointer to each of the (interesting) functions provided by that
+ * plugin. converter is NULL means that we didn't load main page-converter and
+ * need to allocate page-converter struct.
+ */
+static pageCnvCtx *
+loadAdditionalConverterPlugin(pageCnvCtx *converter, const char *pluginName)
+{
+ void *plugin;
+
+ if (!converter)
+ {
+ converter = (pageCnvCtx *) pg_malloc(sizeof(pageCnvCtx));
+ initializePageConverter(converter);
+ }
+
+ if ((plugin = dlopen(pluginName, RTLD_NOW | RTLD_GLOBAL)) == NULL)
+ return NULL;
+ else
+ converter->convertVMFile = (pluginConvertFile) dlsym(plugin, "convertVMFile");
+
+ return converter;
+}
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 984c395..71c69db 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -54,6 +54,7 @@ static void cleanup(void);
ClusterInfo old_cluster,
new_cluster;
OSInfo os_info;
+char mypath[MAXPGPATH];
char *output_files[] = {
SERVER_LOG_FILE,
@@ -76,6 +77,9 @@ main(int argc, char **argv)
parseCommandLine(argc, argv);
+ if (find_my_exec(argv[0], mypath) != 0)
+ pg_fatal("could not find own program executable\n");
+
get_restricted_token(os_info.progname);
adjust_data_dir(&old_cluster);
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index bc733c4..4a500c5 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -110,6 +110,10 @@ extern char *output_files[];
#define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031
/*
+ * The format of visibility map is changed with this 9.6 commit,
+ */
+#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201602021
+/*
* pg_multixact format changed in 9.3 commit 0ac5ad5134f2769ccbaefec73844f85,
* ("Improve concurrency of foreign key locking") which also updated catalog
* version to this value. pg_upgrade behavior depends on whether old and new
@@ -322,6 +326,7 @@ extern UserOpts user_opts;
extern ClusterInfo old_cluster,
new_cluster;
extern OSInfo os_info;
+extern char mypath[MAXPGPATH];
/* check.c */
@@ -364,7 +369,6 @@ bool pid_lock_file_exists(const char *datadir);
/* file.c */
-#ifdef PAGE_CONVERSION
typedef const char *(*pluginStartup) (uint16 migratorVersion,
uint16 *pluginVersion, uint16 newPageVersion,
uint16 oldPageVersion, void **pluginData);
@@ -383,21 +387,24 @@ typedef struct
pluginStartup startup; /* Pointer to plugin's startup function */
pluginConvertFile convertFile; /* Pointer to plugin's file converter
* function */
+ pluginConvertFile convertVMFile; /* Pointer to plugin's VM file converter
+ function */
pluginConvertPage convertPage; /* Pointer to plugin's page converter
* function */
pluginShutdown shutdown; /* Pointer to plugin's shutdown function */
} pageCnvCtx;
const pageCnvCtx *setupPageConverter(void);
-#else
-/* dummy */
-typedef void *pageCnvCtx;
-#endif
+
+#define convertRelfile(pageConverter) \
+ ((pageConverter) && \
+ ((pageCnvCtx *)(pageConverter)->convertFile || \
+ (pageCnvCtx *)(pageConverter)->convertPage))
const char *copyAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
- const char *dst, bool force);
+ const char *dst, const char *type_suffix, bool force);
const char *linkAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
- const char *dst);
+ const char *dst, const char *type_suffix, bool rewrite_vm);
void check_hard_link(void);
FILE *fopen_priv(const char *path, const char *mode);
diff --git a/src/bin/pg_upgrade/plugins/Makefile b/src/bin/pg_upgrade/plugins/Makefile
new file mode 100644
index 0000000..fb3f941
--- /dev/null
+++ b/src/bin/pg_upgrade/plugins/Makefile
@@ -0,0 +1,32 @@
+# src/bin/pg_upgrade/plugins/Makefile
+
+PGFILEDESC = "page conversion plugins for pg_upgrade"
+
+subdir = src/bin/pg_upgrade/plugins
+top_builddir = ../../../../
+include $(top_builddir)/src/Makefile.global
+
+#PG_CPPFLAGS=-I$(top_builddir)/src/bin/pg_upgrade
+override CPPFLAGS := -DDLSUFFIX=\"$(DLSUFFIX)\" -I$(srcdir) -I../ -I$(libpq_srcdir) $(CPPFLAGS)
+
+NAME = convertLayoutVM_add_frozenbit
+OBJS = convertLayoutVM_add_frozenbit.o
+plugindir = $(DESTDIR)$(libdir)/plugins
+
+all: all-shared-lib
+
+include $(top_srcdir)/src/Makefile.shlib
+
+install: all installdirs install-plugins
+
+installdirs:
+ $(MKDIR_P) '$(plugindir)'
+
+install-plugins:
+ $(INSTALL_SHLIB) $(NAME).so '$(plugindir)'
+
+uninstall:
+ rm -f '$(plugindir)/$(NAME).so'
+
+clean:
+ rm -f $(OBJS) $(NAME).so
\ No newline at end of file
diff --git a/src/bin/pg_upgrade/plugins/convertLayoutVM_add_frozenbit.c b/src/bin/pg_upgrade/plugins/convertLayoutVM_add_frozenbit.c
new file mode 100644
index 0000000..2245e30
--- /dev/null
+++ b/src/bin/pg_upgrade/plugins/convertLayoutVM_add_frozenbit.c
@@ -0,0 +1,159 @@
+/*
+ * convertLayoutVM_add_frozenbit.c
+ *
+ * Page converter plugin for Visibility Map
+ *
+ * Copyright (c) 2016, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/plugins.c
+ */
+
+#include "postgres_fe.h"
+
+#include "access/visibilitymap.h"
+#include "pg_upgrade.h"
+#include "storage/bufpage.h"
+#include "storage/checksum.h"
+#include "storage/checksum_impl.h"
+#include "port.h"
+
+#include <fcntl.h>
+
+/* plugin function */
+const char* convertVMFile(void *pluginData,
+ const char *dstName, const char *srcName);
+
+static const int rewriteVisibilitymap(const char *fromfile, const char *tofile,
+ bool checksum_enabled);
+
+/* table for fast rewriting vm file in order to add all-frozen information */
+static const uint16 rewrite_vm_table[256] = {
+ 0, 1, 4, 5, 16, 17, 20, 21, 64, 65, 68, 69, 80, 81, 84, 85,
+ 256, 257, 260, 261, 272, 273, 276, 277, 320, 321, 324, 325, 336, 337, 340, 341,
+ 1024, 1025, 1028, 1029, 1040, 1041, 1044, 1045, 1088, 1089, 1092, 1093, 1104, 1105, 1108, 1109,
+ 1280, 1281, 1284, 1285, 1296, 1297, 1300, 1301, 1344, 1345, 1348, 1349, 1360, 1361, 1364, 1365,
+ 4096, 4097, 4100, 4101, 4112, 4113, 4116, 4117, 4160, 4161, 4164, 4165, 4176, 4177, 4180, 4181,
+ 4352, 4353, 4356, 4357, 4368, 4369, 4372, 4373, 4416, 4417, 4420, 4421, 4432, 4433, 4436, 4437,
+ 5120, 5121, 5124, 5125, 5136, 5137, 5140, 5141, 5184, 5185, 5188, 5189, 5200, 5201, 5204, 5205,
+ 5376, 5377, 5380, 5381, 5392, 5393, 5396, 5397, 5440, 5441, 5444, 5445, 5456, 5457, 5460, 5461,
+ 16384, 16385, 16388, 16389, 16400, 16401, 16404, 16405, 16448, 16449, 16452, 16453, 16464, 16465, 16468, 16469,
+ 16640, 16641, 16644, 16645, 16656, 16657, 16660, 16661, 16704, 16705, 16708, 16709, 16720, 16721, 16724, 16725,
+ 17408, 17409, 17412, 17413, 17424, 17425, 17428, 17429, 17472, 17473, 17476, 17477, 17488, 17489, 17492, 17493,
+ 17664, 17665, 17668, 17669, 17680, 17681, 17684, 17685, 17728, 17729, 17732, 17733, 17744, 17745, 17748, 17749,
+ 20480, 20481, 20484, 20485, 20496, 20497, 20500, 20501, 20544, 20545, 20548, 20549, 20560, 20561, 20564, 20565,
+ 20736, 20737, 20740, 20741, 20752, 20753, 20756, 20757, 20800, 20801, 20804, 20805, 20816, 20817, 20820, 20821,
+ 21504, 21505, 21508, 21509, 21520, 21521, 21524, 21525, 21568, 21569, 21572, 21573, 21584, 21585, 21588, 21589,
+ 21760, 21761, 21764, 21765, 21776, 21777, 21780, 21781, 21824, 21825, 21828, 21829, 21840, 21841, 21844, 21845
+};
+
+/*
+ * convertVMFile()
+ *
+ * This plugin function is loaded by main procedure if required.
+ * pluginData has the information about that checksum are enabled on both
+ * cluster or not. If rewriting function failed then return error messages.
+ */
+const char *
+convertVMFile(void *pluginData, const char *dstName, const char *srcName)
+{
+ bool checksum_enabled;
+
+ checksum_enabled = *(bool *)pluginData;
+
+ if (rewriteVisibilitymap(srcName, dstName, checksum_enabled) == -1)
+ {
+#ifdef WIN32
+ _dosmaperr(GetLastError());
+#endif
+ return strdup(strerror(errno));
+ }
+
+ return NULL;
+}
+
+/*
+ * rewriteVisibilitymap()
+ *
+ * Copies a visibility map file while adding all-frozen bit(0) into each bit.
+ */
+static const int
+rewriteVisibilitymap(const char *fromfile, const char *tofile, bool checksum_enabled)
+{
+ int src_fd = 0;
+ int dst_fd = 0;
+ char buffer[BLCKSZ];
+ ssize_t bytesRead;
+ int rewriteVmBytesPerPage = (BLCKSZ - SizeOfPageHeaderData) / 2;
+ BlockNumber blkno = 0;
+
+ /* Reset errno */
+ errno = 0;
+
+ if ((fromfile == NULL) || (tofile == NULL))
+ goto err;
+
+ if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
+ goto err;
+
+ if ((dst_fd = open(tofile, O_RDWR | O_CREAT, S_IRUSR | S_IWUSR)) < 0)
+ goto err;
+
+ /* Perform data rewriting per page */
+ while ((bytesRead = read(src_fd, buffer, BLCKSZ)) == BLCKSZ)
+ {
+ char *cur, *end, *blkend;
+ PageHeaderData pageheader;
+ uint16 vm_bits;
+
+ /* Save the page header data */
+ memcpy(&pageheader, buffer, SizeOfPageHeaderData);
+
+ cur = buffer + SizeOfPageHeaderData;
+ end = buffer + SizeOfPageHeaderData + rewriteVmBytesPerPage;
+ blkend = buffer + bytesRead;
+
+ while (blkend >= end)
+ {
+ char vmbuf[BLCKSZ];
+ char *vmtmp = vmbuf;
+
+ /* Copy page header in advance */
+ memcpy(vmbuf, &pageheader, SizeOfPageHeaderData);
+
+ vmtmp += SizeOfPageHeaderData;
+
+ /* Rewrite visibility map bit one by one */
+ while (end > cur)
+ {
+ /* Write rewritten bit from table and its string representation */
+ vm_bits = rewrite_vm_table[(uint8) *cur];
+ memcpy(vmtmp, &vm_bits, BITS_PER_HEAPBLOCK);
+
+ cur++;
+ vmtmp += BITS_PER_HEAPBLOCK;
+ }
+
+ /* Set new checksum for a visibility map page, If enabled */
+ if (checksum_enabled)
+ ((PageHeader) vmbuf)->pd_checksum = pg_checksum_page(vmbuf, blkno);
+
+ if (write(dst_fd, vmbuf, BLCKSZ) != BLCKSZ)
+ {
+ if (errno == 0)
+ errno = ENOSPC;
+ goto err;
+ }
+
+ end += rewriteVmBytesPerPage;
+ blkno++;
+ }
+ }
+
+err:
+ if (src_fd != 0)
+ close(src_fd);
+
+ if (dst_fd != 0)
+ close(dst_fd);
+
+ return (errno == 0) ? 0 : -1;
+}
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index c059c5b..c4bf77b 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -18,7 +18,7 @@
static void transfer_single_new_db(pageCnvCtx *pageConverter,
FileNameMap *maps, int size, char *old_tablespace);
static void transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *suffix);
+ const char *suffix);
/*
@@ -82,6 +82,10 @@ transfer_all_new_dbs(DbInfoArr *old_db_arr, DbInfoArr *new_db_arr,
{
int old_dbnum,
new_dbnum;
+ pageCnvCtx *pageConverter = NULL;
+
+ /* Set up page-converter and load necessary plugin */
+ pageConverter = (pageCnvCtx *) setupPageConverter();
/* Scan the old cluster databases and transfer their files */
for (old_dbnum = new_dbnum = 0;
@@ -92,7 +96,6 @@ transfer_all_new_dbs(DbInfoArr *old_db_arr, DbInfoArr *new_db_arr,
*new_db = NULL;
FileNameMap *mappings;
int n_maps;
- pageCnvCtx *pageConverter = NULL;
/*
* Advance past any databases that exist in the new cluster but not in
@@ -115,10 +118,6 @@ transfer_all_new_dbs(DbInfoArr *old_db_arr, DbInfoArr *new_db_arr,
if (n_maps)
{
print_maps(mappings, n_maps, new_db->db_name);
-
-#ifdef PAGE_CONVERSION
- pageConverter = setupPageConverter();
-#endif
transfer_single_new_db(pageConverter, mappings, n_maps,
old_tablespace);
}
@@ -144,15 +143,9 @@ get_pg_database_relfilenode(ClusterInfo *cluster)
int i_relfile;
res = executeQueryOrDie(conn,
- "SELECT c.relname, c.relfilenode "
- "FROM pg_catalog.pg_class c, "
- " pg_catalog.pg_namespace n "
- "WHERE c.relnamespace = n.oid AND "
- " n.nspname = 'pg_catalog' AND "
- " c.relname = 'pg_database' "
- "ORDER BY c.relname");
-
- i_relfile = PQfnumber(res, "relfilenode");
+ "SELECT pg_relation_filenode('pg_database') AS filenode");
+
+ i_relfile = PQfnumber(res, "filenode");
cluster->pg_database_oid = atooid(PQgetvalue(res, 0, i_relfile));
PQclear(res);
@@ -268,7 +261,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
/* Copying files might take some time, so give feedback. */
pg_log(PG_STATUS, "%s", old_file);
- if ((user_opts.transfer_mode == TRANSFER_MODE_LINK) && (pageConverter != NULL))
+ if ((user_opts.transfer_mode == TRANSFER_MODE_LINK) && (convertRelfile(pageConverter)))
pg_fatal("This upgrade requires page-by-page conversion, "
"you must use copy mode instead of link mode.\n");
@@ -276,7 +269,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
+ if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, type_suffix, true)) != NULL)
pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
@@ -284,7 +277,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "linking \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = linkAndUpdateFile(pageConverter, old_file, new_file)) != NULL)
+ if ((msg = linkAndUpdateFile(pageConverter, old_file, new_file, type_suffix, true)) != NULL)
pg_fatal("error while creating link for relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
diff --git a/src/bin/pg_upgrade/test.sh b/src/bin/pg_upgrade/test.sh
index ba79fb3..cd9b17e 100644
--- a/src/bin/pg_upgrade/test.sh
+++ b/src/bin/pg_upgrade/test.sh
@@ -174,6 +174,11 @@ if "$MAKE" -C "$oldsrc" installcheck; then
mv "$temp_root"/dump1.sql "$temp_root"/dump1.sql.orig
sed "s;$oldsrc;$newsrc;g" "$temp_root"/dump1.sql.orig >"$temp_root"/dump1.sql
fi
+
+ vm_sql="SELECT c.relname, c.relallvisible FROM pg_class as c, pg_namespace as n WHERE c.relnamespace = n.oid AND n.nspname NOT IN ('information_schema', 'pg_toast', 'pg_catalog') ORDER BY c.relname;"
+ # Test for rewriting visibility map
+ vacuumdb -d regression || visibilitymap_vacuum1_status=$?
+ psql -d regression -c "$vm_sql" > "$temp_root"/vm_test1.txt || visibilitymap_test1_status=$?
else
make_installcheck_status=$?
fi
@@ -188,6 +193,14 @@ if [ -n "$pg_dumpall1_status" ]; then
echo "pg_dumpall of pre-upgrade database cluster failed"
exit 1
fi
+if [ -n "$visibilitymap_vacuum1_status" ];then
+ echo "VACUUM of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+if [ -n "$visibilitymap_test1_status" ];then
+ echo "SELECT of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
PGDATA=$BASE_PGDATA
@@ -203,6 +216,8 @@ case $testhost in
esac
pg_dumpall -f "$temp_root"/dump2.sql || pg_dumpall2_status=$?
+vacuumdb -d regression || visibilitymap_vacuum2_status=$?
+psql -d regression -c "$vm_sql" > "$temp_root"/vm_test2.txt || visibilitymap_test2_status=$?
pg_ctl -m fast stop
# no need to echo commands anymore
@@ -214,11 +229,26 @@ if [ -n "$pg_dumpall2_status" ]; then
exit 1
fi
+if [ -n "$visibilitymap_vacuum2_status" ];then
+ echo "VACUUM of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
+if [ -n "$visibilitymap_test2_status" ];then
+ echo "SELECT of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
case $testhost in
MINGW*) cmd /c delete_old_cluster.bat ;;
*) sh ./delete_old_cluster.sh ;;
esac
+if ! diff "$temp_root"/vm_test1.txt "$temp_root"/vm_test2.txt >/dev/null; then
+ echo "Visibility map rewriting test failed"
+ exit 1
+fi
+
if diff "$temp_root"/dump1.sql "$temp_root"/dump2.sql >/dev/null; then
echo PASSED
exit 0
002_enhance_visibilitymap_debug_messages_v34.patchtext/x-patch; charset=US-ASCII; name=002_enhance_visibilitymap_debug_messages_v34.patchDownload
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 6186caf..f4d878b 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -175,7 +175,7 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_clear %s, block %d", RelationGetRelationName(rel), heapBlk);
#endif
if (!BufferIsValid(buf) || BufferGetBlockNumber(buf) != mapBlock)
@@ -274,7 +274,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
uint8 *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_set %s, block %d, flags %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
@@ -364,7 +364,7 @@ visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *buf)
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_get_status %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_get_status %s, block %d", RelationGetRelationName(rel), heapBlk);
#endif
/* Reuse the old pinned buffer if possible */
@@ -467,7 +467,7 @@ visibilitymap_truncate(Relation rel, BlockNumber nheapblocks)
uint8 truncBit = HEAPBLK_TO_MAPBIT(nheapblocks);
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_truncate %s %d", RelationGetRelationName(rel), nheapblocks);
+ elog(DEBUG1, "vm_truncate %s, block %d", RelationGetRelationName(rel), nheapblocks);
#endif
RelationOpenSmgr(rel);
Hello,
At Thu, 4 Feb 2016 02:32:29 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoB1HnZ7thWYjqKve78gQ5+PyedbbkjAPbc5zLV3oA-CuA@mail.gmail.com>
On Tue, Feb 2, 2016 at 7:22 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Masahiko Sawada wrote:
I think we have two options.
1. Change page layout(PG_PAGE_LAYOUT_VERSION) to 5. pg_upgrade detects
it and then converts only VM files.
2. Change pg_upgrade plugin mechanism so that it can handle other name
conversion plugins (e.g., convertLayout_vm_to_vfm)I think #2 is better. Thought?
My vote is for #2 as well. Maybe we just didn't have forks when this
functionality was invented; maybe the author just didn't think hard
enough about what would be the right interface to do it.I've almost wrote up the very rough patch. (it can pass regression test)
Windows supporting is not yet, and Makefile is not correct.I've divided the main patch into two patches; add frozen bit patch and
pg_upgrade support patch.
000 patch is almost same as previous code. (includes small fix)
001 patch provides rewriting visibility map as a pageConverter routine.
002 patch is for enhancement debug message in visibilitymap.c
Thanks, it becomes easy to read.
In order to support pageConvert plugin, I made the following changes.
* Main changes
- Remove PAGE_CONVERSION
- pg_upgrade plugin is located to 'src/bin/pg_upgrade/plugins' directory.
- Move directory having plugins from '$(bin)/plugins' to '$(lib)/plugins'.
These seem fair.
- Add new page-converter plugin function for visibility map.
- Current code doesn't allow us to use link mode (-k) in the case
where page-converter is required.But I changed it so that if page-converter for fork file is
specified, we convert it actually even when link mode.* Interface designe
convertFile() and convertPage() are plugin function for main relation
file, and these functions are dynamically loaded by
loadConvertPlugin().
Though I haven't looked this so closer, loadConverterPlugin looks
to continue deciding what plugin to load using old and new page
layout versions. Currently the acutually possible versions are 4
and if we increment it now, 5.
On the other hand, _vm came at the *catalog version* 201107031
(9.1 release) and _fsm came at 8.4 release. Both of them are of
page layout version 4. Are we allowed to increment page layout
version for this reason? And is this framework under
reconstruction is flexible enough for this kind of changes in
future? I don't think so.
We have added _vm and _fsm so far so we must use a version number
that can determine when _vm, _fsm and _vfm are introduced. I'm
afraid that it is out of purpose, catalog version seems to be
most usable, it is already used to know when the crash safe VM
has been introduced.
Using the catalog version, the plugin we provide first would be
converteLayout_201105231_201602071.so which has only a converter
from _vm to _vfm. This plugin is loaded for the combination of
the source cluster with the catalog version of 201105231(when VM
has been introduced) or later and the destination cluster with
that *before* 201602071 (this version).
If we change the format of fsm (vm no longer exists), we would
have a new plugin maybe named
converteLayout_200904091_2017xxxxx.so which has an, maybe,
inplace file converter for fsm. It will be loaded when a source
database is of the catalog version of 200904091(when FSM has been
introduced) or later and a destination before 2017xxxxx(the
version). Catalog version seems to work fine.
So far, I assumed that we can safely assume that the name of
files to be converted is <oid>[fork_name] so the possible types
of conversions would be the following.
- per-page conversion
- per-file conversion between the files with the same fork name.
- per-file conversion between the files with different fork names.
Since the plugin filename doesn't tell such things, they should
be told by the plugin itself. So a plugin is to provide the
following interface,
typedef struct ConverterTable
{
char *src_fork_name;
char *dst_fork_name;
FileConverterFunc file_conveter;
PageConverterFunc page_conveter;
} ConverterTable[];
Following such name convention of plugins, we may load multiple
plugins at once, so we collect all entries of the table of all
loaded plugins and check if any src_fork_name from them don't
duplicate.
Here, we have sufficient information to choose what conveter to
invoke and execute conversion like this.
for (fork_name in all_fork_names_including_"" )
{
find a converter comparing fork_name with src_fork_name.
check dst_fork_name and rename the target file if needed.
invoke the converter.
}
If we need to convert clogs or similars and need to prepare for
such events, the ConverterTable might have an additional member
and change the meaning of some of existing members.
typedef struct ConverterTable
{
enum target_type; /* FILE_NAME or FORK_NAME */
char *src_name;
char *dst_name;
FileConverterFunc file_conveter;
PageConverterFunc page_conveter;
} ConverterTable[];
when target_type == FILE_NAME, src_name and dst_name represents
the target file names relatively to $PGDATA.
# Yeah, I know it is too complicated.
I added a new pageConvert plugin function converVMFile() for
visibility map (fork file).
If layout of CLOG, FSM or etc will be changed in the future, we could
expand some new pageConvert plugin functions like convertCLOGFile() or
convertFSMFile(), and these functions are dynamically loaded by
loadAdditionalConvertPlugin().
It means that main file and other fork file conversion are executed
independently, and conversion for fork file are executed even if link
mode is specified.
Each conversion plugin is loaded and used only when it's required.
As I asked upthread, It is one of the most important design point
of plugin mechanism that what characteristics of src and/or dest
cluster to trigger loading of a plugin. And if page layout format
is it, are we allowed to increment for such irrelevant events? Or
using another characteristics like catalog version?
I still agree with this plugin approach, but I felt it's still
complicated a bit, and I'm concerned that patch size has been
increased.
Please give me feedbacks.
Yeah, I feel the same. What make it worse, the plugin mechanism
will get further complex if we make it more flexible for possible
usage as I proposed above. It is apparently too complicated for
deciding whether to load *just one*, for now, converter
function. And no additional converter is in sight.
I incline to pull out all the plugin stuff of pg_upgrade. We are
so prudent to make changes of file formats so this kind of events
will happen with several-years intervals. The plugin mechanism
would be valuable if we are encouriged to change file formats
more frequently and freely by providing it, but such situation
absolutely introduces more untoward things..
If there are not objections about this, I'm going to spend time
to improve it.
Sorry, but I do have strong objection to this... Does anyone else
have opinions for that?
regareds,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Thank you for reviewing this patch!
On Wed, Feb 10, 2016 at 4:39 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Hello,
At Thu, 4 Feb 2016 02:32:29 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoB1HnZ7thWYjqKve78gQ5+PyedbbkjAPbc5zLV3oA-CuA@mail.gmail.com>
On Tue, Feb 2, 2016 at 7:22 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Masahiko Sawada wrote:
I think we have two options.
1. Change page layout(PG_PAGE_LAYOUT_VERSION) to 5. pg_upgrade detects
it and then converts only VM files.
2. Change pg_upgrade plugin mechanism so that it can handle other name
conversion plugins (e.g., convertLayout_vm_to_vfm)I think #2 is better. Thought?
My vote is for #2 as well. Maybe we just didn't have forks when this
functionality was invented; maybe the author just didn't think hard
enough about what would be the right interface to do it.I've almost wrote up the very rough patch. (it can pass regression test)
Windows supporting is not yet, and Makefile is not correct.I've divided the main patch into two patches; add frozen bit patch and
pg_upgrade support patch.
000 patch is almost same as previous code. (includes small fix)
001 patch provides rewriting visibility map as a pageConverter routine.
002 patch is for enhancement debug message in visibilitymap.cThanks, it becomes easy to read.
In order to support pageConvert plugin, I made the following changes.
* Main changes
- Remove PAGE_CONVERSION
- pg_upgrade plugin is located to 'src/bin/pg_upgrade/plugins' directory.
- Move directory having plugins from '$(bin)/plugins' to '$(lib)/plugins'.These seem fair.
- Add new page-converter plugin function for visibility map.
- Current code doesn't allow us to use link mode (-k) in the case
where page-converter is required.But I changed it so that if page-converter for fork file is
specified, we convert it actually even when link mode.* Interface designe
convertFile() and convertPage() are plugin function for main relation
file, and these functions are dynamically loaded by
loadConvertPlugin().Though I haven't looked this so closer, loadConverterPlugin looks
to continue deciding what plugin to load using old and new page
layout versions. Currently the acutually possible versions are 4
and if we increment it now, 5.On the other hand, _vm came at the *catalog version* 201107031
(9.1 release) and _fsm came at 8.4 release. Both of them are of
page layout version 4. Are we allowed to increment page layout
version for this reason? And is this framework under
reconstruction is flexible enough for this kind of changes in
future? I don't think so.
Yeah, I also think that page layout version should not be increased by
this layout change of vm.
This patch checks catalog version at first, and then decides what
plugin to load.
In this case, only the format of VM has been changed, so pg_upgrade
loads a plugin for VM and convert them.
pg_upgrade doesn't load for other plugin file, and other files are just copied.
We have added _vm and _fsm so far so we must use a version number
that can determine when _vm, _fsm and _vfm are introduced. I'm
afraid that it is out of purpose, catalog version seems to be
most usable, it is already used to know when the crash safe VM
has been introduced.Using the catalog version, the plugin we provide first would be
converteLayout_201105231_201602071.so which has only a converter
from _vm to _vfm. This plugin is loaded for the combination of
the source cluster with the catalog version of 201105231(when VM
has been introduced) or later and the destination cluster with
that *before* 201602071 (this version).If we change the format of fsm (vm no longer exists), we would
have a new plugin maybe named
converteLayout_200904091_2017xxxxx.so which has an, maybe,
inplace file converter for fsm. It will be loaded when a source
database is of the catalog version of 200904091(when FSM has been
introduced) or later and a destination before 2017xxxxx(the
version). Catalog version seems to work fine.
I think that it's not good idea to use catalog version for plugin name.
Because, even if catalog version is used for plugin file name as you
suggested, pg_upgrade still needs to decide what plugin name to load
by itself.
Also, the plugin file having catalog version would not easy to
understand what plugin does actually. It's not developer friendly.
The advanteage of using page layout version as plugin name is that
pg_upgrade can decide automatically which plugin should be loaded.
So far, I assumed that we can safely assume that the name of
files to be converted is <oid>[fork_name] so the possible types
of conversions would be the following.- per-page conversion
- per-file conversion between the files with the same fork name.
- per-file conversion between the files with different fork names.Since the plugin filename doesn't tell such things, they should
be told by the plugin itself. So a plugin is to provide the
following interface,typedef struct ConverterTable
{
char *src_fork_name;
char *dst_fork_name;
FileConverterFunc file_conveter;
PageConverterFunc page_conveter;
} ConverterTable[];Following such name convention of plugins, we may load multiple
plugins at once, so we collect all entries of the table of all
loaded plugins and check if any src_fork_name from them don't
duplicate.Here, we have sufficient information to choose what conveter to
invoke and execute conversion like this.for (fork_name in all_fork_names_including_"" )
{
find a converter comparing fork_name with src_fork_name.
check dst_fork_name and rename the target file if needed.
invoke the converter.
}If we need to convert clogs or similars and need to prepare for
such events, the ConverterTable might have an additional member
and change the meaning of some of existing members.typedef struct ConverterTable
{
enum target_type; /* FILE_NAME or FORK_NAME */
char *src_name;
char *dst_name;
FileConverterFunc file_conveter;
PageConverterFunc page_conveter;
} ConverterTable[];when target_type == FILE_NAME, src_name and dst_name represents
the target file names relatively to $PGDATA.# Yeah, I know it is too complicated.
I agree with having ConverterTable.
Since we have three kind of fiel suffix types; "", "_vm", "_fsm",
pg_upgrade will have three elements of ConverterTable[].
I added a new pageConvert plugin function converVMFile() for
visibility map (fork file).
If layout of CLOG, FSM or etc will be changed in the future, we could
expand some new pageConvert plugin functions like convertCLOGFile() or
convertFSMFile(), and these functions are dynamically loaded by
loadAdditionalConvertPlugin().
It means that main file and other fork file conversion are executed
independently, and conversion for fork file are executed even if link
mode is specified.
Each conversion plugin is loaded and used only when it's required.As I asked upthread, It is one of the most important design point
of plugin mechanism that what characteristics of src and/or dest
cluster to trigger loading of a plugin. And if page layout format
is it, are we allowed to increment for such irrelevant events? Or
using another characteristics like catalog version?I still agree with this plugin approach, but I felt it's still
complicated a bit, and I'm concerned that patch size has been
increased.
Please give me feedbacks.Yeah, I feel the same. What make it worse, the plugin mechanism
will get further complex if we make it more flexible for possible
usage as I proposed above. It is apparently too complicated for
deciding whether to load *just one*, for now, converter
function. And no additional converter is in sight.
There will be case where layout of other type relation file is
changed, so pg_upgrade will need to convert several types of relation
file at the same time.
I'm thinking that we need to support to load multiple plugin function at least.
I incline to pull out all the plugin stuff of pg_upgrade. We are
so prudent to make changes of file formats so this kind of events
will happen with several-years intervals. The plugin mechanism
would be valuable if we are encouriged to change file formats
more frequently and freely by providing it, but such situation
absolutely introduces more untoward things..
Yes, I think so too.
In fact, such a layout change is for the first time since pg_upgrade
has been introduced at 9.0.
Regards,
--
Masahiko Sawada
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Feb 3, 2016 at 12:32 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I've divided the main patch into two patches; add frozen bit patch and
pg_upgrade support patch.
000 patch is almost same as previous code. (includes small fix)
001 patch provides rewriting visibility map as a pageConverter routine.
002 patch is for enhancement debug message in visibilitymap.c
I'd like to suggest splitting 000 into two patches. The first one
would change the format of the visibility map, and the second one
would change VACUUM to optimize scans based on the new format. I
think that would make it easier to get this reviewed and committed.
I think this patch churns a bunch of things that don't really need to
be churned. For example, consider this hunk:
/*
* If we didn't pin the visibility map page and the page has become all
- * visible while we were busy locking the buffer, we'll have to unlock and
- * re-lock, to avoid holding the buffer lock across an I/O. That's a bit
- * unfortunate, but hopefully shouldn't happen often.
+ * visible or all frozen while we were busy locking the buffer, we'll
+ * have to unlock and re-lock, to avoid holding the buffer lock across an
+ * I/O. That's a bit unfortunate, but hopefully shouldn't happen often.
*/
Since the page can't become all-frozen without also becoming
all-visible, the original text is still 100% accurate, and the change
doesn't seem to add any useful clarity. Let's think about which
things really need to be changed and not just mechanically change
everything.
- Assert(PageIsAllVisible(heapPage));
+ /*
+ * Caller is expected to set PD_ALL_VISIBLE or
+ * PD_ALL_FROZEN first.
+ */
+ Assert(((flags | VISIBILITYMAP_ALL_VISIBLE) &&
PageIsAllVisible(heapPage)) ||
+ ((flags | VISIBILITYMAP_ALL_FROZEN) &&
PageIsAllFrozen(heapPage)));
I think this would be more clear as two separate assertions.
Your 000 patch has a little bit of whitespace damage:
[rhaas pgsql]$ git diff --check
src/backend/commands/vacuumlazy.c:1951: indent with spaces.
+ bool *all_visible, bool
*all_frozen)
src/include/access/heapam_xlog.h:393: indent with spaces.
+ Buffer vm_buffer, TransactionId
cutoff_xid, uint8 flags);
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Feb 12, 2016 at 4:46 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Feb 3, 2016 at 12:32 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I've divided the main patch into two patches; add frozen bit patch and
pg_upgrade support patch.
000 patch is almost same as previous code. (includes small fix)
001 patch provides rewriting visibility map as a pageConverter routine.
002 patch is for enhancement debug message in visibilitymap.cI'd like to suggest splitting 000 into two patches. The first one
would change the format of the visibility map, and the second one
would change VACUUM to optimize scans based on the new format. I
think that would make it easier to get this reviewed and committed.I think this patch churns a bunch of things that don't really need to
be churned. For example, consider this hunk:/* * If we didn't pin the visibility map page and the page has become all - * visible while we were busy locking the buffer, we'll have to unlock and - * re-lock, to avoid holding the buffer lock across an I/O. That's a bit - * unfortunate, but hopefully shouldn't happen often. + * visible or all frozen while we were busy locking the buffer, we'll + * have to unlock and re-lock, to avoid holding the buffer lock across an + * I/O. That's a bit unfortunate, but hopefully shouldn't happen often. */Since the page can't become all-frozen without also becoming
all-visible, the original text is still 100% accurate, and the change
doesn't seem to add any useful clarity. Let's think about which
things really need to be changed and not just mechanically change
everything.- Assert(PageIsAllVisible(heapPage)); + /* + * Caller is expected to set PD_ALL_VISIBLE or + * PD_ALL_FROZEN first. + */ + Assert(((flags | VISIBILITYMAP_ALL_VISIBLE) && PageIsAllVisible(heapPage)) || + ((flags | VISIBILITYMAP_ALL_FROZEN) && PageIsAllFrozen(heapPage)));I think this would be more clear as two separate assertions.
Your 000 patch has a little bit of whitespace damage:
[rhaas pgsql]$ git diff --check src/backend/commands/vacuumlazy.c:1951: indent with spaces. + bool *all_visible, bool *all_frozen) src/include/access/heapam_xlog.h:393: indent with spaces. + Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
Thank you for reviewing this patch.
I've divided 000 patch into two patches, and attached latest 4 patches in total.
I changed pg_upgrade plugin logic so that all kind of type suffix has
one convert plugin.
The type suffix which doesn't need to be converted has pg_copy_file()
function as plugin function.
Regards,
--
Masahiko Sawada
Attachments:
000_add_frozen_bit_into_visibilitymap_v35.patchapplication/octet-stream; name=000_add_frozen_bit_into_visibilitymap_v35.patchDownload
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 001988b..5d08c73 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -87,7 +87,7 @@ statapprox_heap(Relation rel, output_type *stat)
* If the page has only visible tuples, then we can find out the free
* space from the FSM and move on.
*/
- if (visibilitymap_test(rel, blkno, &vmbuffer))
+ if (VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
{
freespace = GetRecordedFreeSpace(rel, blkno);
stat->tuple_len += BLCKSZ - freespace;
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 164d08c..5dc8b04 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -623,18 +623,20 @@ can be used to examine the information stored in free space maps.
<para>
Each heap relation has a Visibility Map
(VM) to keep track of which pages contain only tuples that are known to be
-visible to all active transactions. It's stored
-alongside the main relation data in a separate relation fork, named after the
-filenode number of the relation, plus a <literal>_vm</> suffix. For example,
-if the filenode of a relation is 12345, the VM is stored in a file called
-<filename>12345_vm</>, in the same directory as the main relation file.
+visible to all active transactions, and pages contain only unfrozen tuples.
+It's stored alongside the main relation data in a separate relation fork,
+named after the filenode number of the relation, plus a <literal>_vm</> suffix.
+For example, if the filenode of a relation is 12345, the VM is stored in a file
+called <filename>12345_vm</>, in the same directory as the main relation file.
Note that indexes do not have VMs.
</para>
<para>
-The visibility map simply stores one bit per heap page. A set bit means
-that all tuples on the page are known to be visible to all transactions.
-This means that the page does not contain any tuples that need to be vacuumed.
+The visibility map stores two bits per heap page: all-visible, all-frozen.
+A set all-visible bit means that all tuples on the page are known to be visible
+to all transactions. A set all-frozen bit means that all tuples on the page are
+completely marked as frozen. This means that the page does not contain any tuples
+that need to be vacuumed, even if a scan of whole table is required.
This information can also be used by <firstterm>index-only scans</> to answer
queries using only the index tuple.
</para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f443742..5835e54 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7205,7 +7205,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
*/
XLogRecPtr
log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
- TransactionId cutoff_xid)
+ TransactionId cutoff_xid, uint8 vmflags)
{
xl_heap_visible xlrec;
XLogRecPtr recptr;
@@ -7215,6 +7215,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
Assert(BufferIsValid(vm_buffer));
xlrec.cutoff_xid = cutoff_xid;
+ xlrec.flags = vmflags;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
@@ -7804,7 +7805,12 @@ heap_xlog_visible(XLogReaderState *record)
* the subsequent update won't be replayed to clear the flag.
*/
page = BufferGetPage(buffer);
- PageSetAllVisible(page);
+
+ if (xlrec->flags & VISIBILITYMAP_ALL_VISIBLE)
+ PageSetAllVisible(page);
+ if (xlrec->flags & VISIBILITYMAP_ALL_FROZEN)
+ PageSetAllFrozen(page);
+
MarkBufferDirty(buffer);
}
else if (action == BLK_RESTORED)
@@ -7856,7 +7862,7 @@ heap_xlog_visible(XLogReaderState *record)
*/
if (lsn > PageGetLSN(vmpage))
visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
- xlrec->cutoff_xid);
+ xlrec->cutoff_xid, xlrec->flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index fc28f3f..e269a5d 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -15,39 +15,45 @@
* visibilitymap_pin - pin a map page for setting a bit
* visibilitymap_pin_ok - check whether correct map page is already pinned
* visibilitymap_set - set a bit in a previously pinned page
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_get_status - get status of bits
* visibilitymap_count - count number of bits set in visibility map
* visibilitymap_truncate - truncate the visibility map
*
* NOTES
*
- * The visibility map is a bitmap with one bit per heap page. A set bit means
- * that all tuples on the page are known visible to all transactions, and
- * therefore the page doesn't need to be vacuumed. The map is conservative in
- * the sense that we make sure that whenever a bit is set, we know the
- * condition is true, but if a bit is not set, it might or might not be true.
+ * The visibility map is a bitmap with two bits (all-visible and all-frozen)
+ * per heap page. A set all-visible bit means that all tuples on the page are
+ * known visible to all transactions, and therefore the page doesn't need to
+ * be vacuumed. A set all-frozen bit means that all tuples on the page are
+ * completely frozen, and therefore the page doesn't need to be vacuumed even
+ * if whole table scanning vacuum is required (e.g. anti-wraparound vacuum).
+ * The all-frozen bit must be set only when the page is already all-visible.
*
- * Clearing a visibility map bit is not separately WAL-logged. The callers
+ * The map is conservative in the sense that we make sure that whenever a bit
+ * is set, we know the condition is true, but if a bit is not set, it might or
+ * might not be true.
+ *
+ * Clearing both visibility map bits is not separately WAL-logged. The callers
* must make sure that whenever a bit is cleared, the bit is cleared on WAL
* replay of the updating operation as well.
*
* When we *set* a visibility map during VACUUM, we must write WAL. This may
* seem counterintuitive, since the bit is basically a hint: if it is clear,
- * it may still be the case that every tuple on the page is visible to all
- * transactions; we just don't know that for certain. The difficulty is that
- * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
- * on the page itself, and the visibility map bit. If a crash occurs after the
- * visibility map page makes it to disk and before the updated heap page makes
- * it to disk, redo must set the bit on the heap page. Otherwise, the next
- * insert, update, or delete on the heap page will fail to realize that the
- * visibility map bit must be cleared, possibly causing index-only scans to
- * return wrong answers.
+ * it may still be the case that every tuple on the page is all-visible or
+ * all-frozen we just don't know that for certain. The difficulty is that
+ * there are two bits which are typically set together: the PD_ALL_VISIBLE
+ * or PD_ALL_FROZEN bit on the page itself, and the corresponding visibility map
+ * bit. If a crash occurs after the visibility map page makes it to disk and before
+ * the updated heap page makes it to disk, redo must set the bit on the heap page.
+ * Otherwise, the next insert, update, or delete on the heap page will fail to
+ * realize that the visibility map bit must be cleared, possibly causing index-only
+ * scans to return wrong answers.
*
* VACUUM will normally skip pages for which the visibility map bit is set;
* such pages can't contain any dead tuples and therefore don't need vacuuming.
- * The visibility map is not used for anti-wraparound vacuums, because
- * an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
- * present in the table, even on pages that don't have any dead tuples.
+ * The visibility map has the all-frozen bit which indicates all tuples on
+ * corresponding page have been completely frozen, so the visibility map is also
+ * used for anti-wraparound vacuum, even if freezing of tuples is required.
*
* LOCKING
*
@@ -101,38 +107,50 @@
*/
#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
-/* Number of bits allocated for each heap block. */
-#define BITS_PER_HEAPBLOCK 1
-
-/* Number of heap blocks we can represent in one byte. */
-#define HEAPBLOCKS_PER_BYTE 8
-
/* Number of heap blocks we can represent in one visibility map page. */
#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
/* Mapping from heap block number to the right bit in the visibility map */
#define HEAPBLK_TO_MAPBLOCK(x) ((x) / HEAPBLOCKS_PER_PAGE)
#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
-#define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
-
-/* table for fast counting of set bits */
-static const uint8 number_of_ones[256] = {
- 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
+#define HEAPBLK_TO_MAPBIT(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)
+
+/* tables for fast counting of set bits for visible and frozen */
+static const uint8 number_of_ones_for_visible[256] = {
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4
+};
+static const uint8 number_of_ones_for_frozen[256] = {
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4
};
/* prototypes for internal routines */
@@ -141,7 +159,7 @@ static void vm_extend(Relation rel, BlockNumber nvmblocks);
/*
- * visibilitymap_clear - clear a bit in visibility map
+ * visibilitymap_clear - clear all bits in visibility map
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -153,7 +171,7 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
int mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- uint8 mask = 1 << mapBit;
+ uint8 mask = VISIBILITYMAP_VALID_BITS << mapBit;
char *map;
#ifdef TRACE_VISIBILITYMAP
@@ -186,7 +204,7 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
* visibilitymap_set to actually set the bit.
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by
- * an earlier call to visibilitymap_pin or visibilitymap_test on the same
+ * an earlier call to visibilitymap_pin or visibilitymap_get_status on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk.
*
@@ -212,7 +230,7 @@ visibilitymap_pin(Relation rel, BlockNumber heapBlk, Buffer *buf)
* visibilitymap_pin_ok - do we already have the correct page pinned?
*
* On entry, buf should be InvalidBuffer or a valid buffer returned by
- * an earlier call to visibilitymap_pin or visibilitymap_test on the same
+ * an earlier call to visibilitymap_pin or visibilitymap_get_status on the same
* relation. The return value indicates whether the buffer covers the
* given heapBlk.
*/
@@ -225,7 +243,7 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
}
/*
- * visibilitymap_set - set a bit on a previously pinned page
+ * visibilitymap_set - set bit(s) on a previously pinned page
*
* recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
* or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
@@ -234,10 +252,11 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
* marked all-visible; it is needed for Hot Standby, and can be
* InvalidTransactionId if the page contains no tuples.
*
- * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
- * this function. Except in recovery, caller should also pass the heap
- * buffer. When checksums are enabled and we're not in recovery, we must add
- * the heap buffer to the WAL chain to protect it from being torn.
+ * Caller is expected to set the heap page's PD_ALL_VISIBLE or PD_ALL_FROZEN
+ * bit before calling this function. Except in recovery, caller should also
+ * pass the heap buffer and flags which indicates what flag we want to set.
+ * When checksums are enabled and we're not in recovery, we must add the heap
+ * buffer to the WAL chain to protect it from being torn.
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -245,13 +264,14 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
*/
void
visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid)
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
uint8 mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
Page page;
- char *map;
+ uint8 *map;
#ifdef TRACE_VISIBILITYMAP
elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
@@ -259,6 +279,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
Assert(InRecovery || BufferIsValid(heapBuf));
+ Assert(flags & VISIBILITYMAP_VALID_BITS);
/* Check that we have the right heap page pinned, if present */
if (BufferIsValid(heapBuf) && BufferGetBlockNumber(heapBuf) != heapBlk)
@@ -269,14 +290,14 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
elog(ERROR, "wrong VM buffer passed to visibilitymap_set");
page = BufferGetPage(vmBuf);
- map = PageGetContents(page);
+ map = (uint8 *)PageGetContents(page);
LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
- if (!(map[mapByte] & (1 << mapBit)))
+ if (flags != (map[mapByte] >> mapBit & VISIBILITYMAP_VALID_BITS))
{
START_CRIT_SECTION();
- map[mapByte] |= (1 << mapBit);
+ map[mapByte] |= (flags << mapBit);
MarkBufferDirty(vmBuf);
if (RelationNeedsWAL(rel))
@@ -285,7 +306,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Assert(!InRecovery);
recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
- cutoff_xid);
+ cutoff_xid, flags);
/*
* If data checksums are enabled (or wal_log_hints=on), we
@@ -295,11 +316,19 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Page heapPage = BufferGetPage(heapBuf);
- /* caller is expected to set PD_ALL_VISIBLE first */
- Assert(PageIsAllVisible(heapPage));
+ /*
+ * Caller is expected to set PD_ALL_VISIBLE or
+ * PD_ALL_FROZEN first.
+ */
+ if (flags & VISIBILITYMAP_ALL_VISIBLE)
+ Assert(PageIsAllVisible(heapPage));
+ if (flags & VISIBILITYMAP_ALL_FROZEN)
+ Assert(PageIsAllFrozen(heapPage));
+
PageSetLSN(heapPage, recptr);
}
}
+
PageSetLSN(page, recptr);
}
@@ -310,15 +339,17 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
}
/*
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_get_status - get status of bits
*
- * Are all tuples on heapBlk visible to all, according to the visibility map?
+ * Are all tuples on heapBlk visible to all or are marked frozen, according
+ * to the visibility map?
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by an
- * earlier call to visibilitymap_pin or visibilitymap_test on the same
+ * earlier call to visibilitymap_pin or visibilitymap_get_status on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk, or InvalidBuffer. The caller is responsible for
- * releasing *buf after it's done testing and setting bits.
+ * releasing *buf after it's done testing and setting bits, and must pass flags
+ * for which it needs to check the value in visibility map.
*
* NOTE: This function is typically called without a lock on the heap page,
* so somebody else could change the bit just after we look at it. In fact,
@@ -327,17 +358,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
* we might see the old value. It is the caller's responsibility to deal with
* all concurrency issues!
*/
-bool
-visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
+uint8
+visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *buf)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
uint8 mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- bool result;
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_test %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_get_status %s %d", RelationGetRelationName(rel), heapBlk);
#endif
/* Reuse the old pinned buffer if possible */
@@ -360,13 +390,11 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
map = PageGetContents(BufferGetPage(*buf));
/*
- * A single-bit read is atomic. There could be memory-ordering effects
+ * The double bits read is atomic. There could be memory-ordering effects
* here, but for performance reasons we make it the caller's job to worry
* about that.
*/
- result = (map[mapByte] & (1 << mapBit)) ? true : false;
-
- return result;
+ return ((map[mapByte] >> mapBit) & VISIBILITYMAP_VALID_BITS);
}
/*
@@ -374,14 +402,20 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
*
* Note: we ignore the possibility of race conditions when the table is being
* extended concurrently with the call. New pages added to the table aren't
- * going to be marked all-visible, so they won't affect the result.
+ * going to be marked all-visible or all-frozen, so they won't affect the result.
*/
-BlockNumber
-visibilitymap_count(Relation rel)
+void
+visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen)
{
- BlockNumber result = 0;
BlockNumber mapBlock;
+ /* all_visible must be specified */
+ Assert(all_visible);
+
+ *all_visible = 0;
+ if (all_frozen)
+ *all_frozen = 0;
+
for (mapBlock = 0;; mapBlock++)
{
Buffer mapBuffer;
@@ -406,13 +440,13 @@ visibilitymap_count(Relation rel)
for (i = 0; i < MAPSIZE; i++)
{
- result += number_of_ones[map[i]];
+ *all_visible += number_of_ones_for_visible[map[i]];
+ if (all_frozen)
+ *all_frozen += number_of_ones_for_frozen[map[i]];
}
ReleaseBuffer(mapBuffer);
}
-
- return result;
}
/*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index a309c44..3737b10 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1920,7 +1920,7 @@ index_update_stats(Relation rel,
BlockNumber relallvisible;
if (rd_rel->relkind != RELKIND_INDEX)
- relallvisible = visibilitymap_count(rel);
+ visibilitymap_count(rel, &relallvisible, NULL);
else /* don't bother for indexes */
relallvisible = 0;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 923fe58..86437c6 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -452,6 +452,7 @@ CREATE VIEW pg_stat_all_tables AS
pg_stat_get_live_tuples(C.oid) AS n_live_tup,
pg_stat_get_dead_tuples(C.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(C.oid) AS n_mod_since_analyze,
+ pg_stat_get_frozen_pages(C.oid) AS n_frozen_pages,
pg_stat_get_last_vacuum_time(C.oid) as last_vacuum,
pg_stat_get_last_autovacuum_time(C.oid) as last_autovacuum,
pg_stat_get_last_analyze_time(C.oid) as last_analyze,
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 070df29..d7f3035 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -314,6 +314,8 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
+ BlockNumber relallvisible,
+ relallfrozen;
if (inh)
ereport(elevel,
@@ -564,51 +566,56 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
}
}
- /*
- * Update pages/tuples stats in pg_class ... but not if we're doing
- * inherited stats.
- */
if (!inh)
+ {
+ /* Calculate the number of all-visible and all-frozen bit */
+ visibilitymap_count(onerel, &relallvisible, &relallfrozen);
+
+ /*
+ * Update pages/tuples stats in pg_class ... but not if we're doing
+ * inherited stats.
+ */
vac_update_relstats(onerel,
relpages,
totalrows,
- visibilitymap_count(onerel),
+ relallvisible,
hasindex,
InvalidTransactionId,
InvalidMultiXactId,
in_outer_xact);
- /*
- * Same for indexes. Vacuum always scans all indexes, so if we're part of
- * VACUUM ANALYZE, don't overwrite the accurate count already inserted by
- * VACUUM.
- */
- if (!inh && !(options & VACOPT_VACUUM))
- {
- for (ind = 0; ind < nindexes; ind++)
+ /*
+ * Same for indexes. Vacuum always scans all indexes, so if we're part of
+ * VACUUM ANALYZE, don't overwrite the accurate count already inserted by
+ * VACUUM.
+ */
+ if (!(options & VACOPT_VACUUM))
{
- AnlIndexData *thisdata = &indexdata[ind];
- double totalindexrows;
-
- totalindexrows = ceil(thisdata->tupleFract * totalrows);
- vac_update_relstats(Irel[ind],
- RelationGetNumberOfBlocks(Irel[ind]),
- totalindexrows,
- 0,
- false,
- InvalidTransactionId,
- InvalidMultiXactId,
- in_outer_xact);
+ for (ind = 0; ind < nindexes; ind++)
+ {
+ AnlIndexData *thisdata = &indexdata[ind];
+ double totalindexrows;
+
+ totalindexrows = ceil(thisdata->tupleFract * totalrows);
+ vac_update_relstats(Irel[ind],
+ RelationGetNumberOfBlocks(Irel[ind]),
+ totalindexrows,
+ 0,
+ false,
+ InvalidTransactionId,
+ InvalidMultiXactId,
+ in_outer_xact);
+ }
}
- }
- /*
- * Report ANALYZE to the stats collector, too. However, if doing
- * inherited stats we shouldn't report, because the stats collector only
- * tracks per-table stats.
- */
- if (!inh)
- pgstat_report_analyze(onerel, totalrows, totaldeadrows);
+ /*
+ * Report ANALYZE to the stats collector, too. However, if doing
+ * inherited stats we shouldn't report, because the stats collector only
+ * tracks per-table stats.
+ */
+ pgstat_report_analyze(onerel, totalrows, totaldeadrows, relallfrozen);
+
+ }
/* If this isn't part of VACUUM ANALYZE, let index AMs do cleanup */
if (!(options & VACOPT_VACUUM))
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 90afbdc..4f6f91c 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -85,9 +85,9 @@ IndexOnlyNext(IndexOnlyScanState *node)
* which all tuples are known visible to everybody. In any case,
* we'll use the index tuple not the heap tuple as the data source.
*
- * Note on Memory Ordering Effects: visibilitymap_test does not lock
- * the visibility map buffer, and therefore the result we read here
- * could be slightly stale. However, it can't be stale enough to
+ * Note on Memory Ordering Effects: visibilitymap_get_status does not
+ * lock the visibility map buffer, and therefore the result we read
+ * here could be slightly stale. However, it can't be stale enough to
* matter.
*
* We need to detect clearing a VM bit due to an insert right away,
@@ -114,9 +114,9 @@ IndexOnlyNext(IndexOnlyScanState *node)
* It's worth going through this complexity to avoid needing to lock
* the VM buffer, which could cause significant contention.
*/
- if (!visibilitymap_test(scandesc->heapRelation,
- ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
+ if (!VM_ALL_VISIBLE(scandesc->heapRelation,
+ ItemPointerGetBlockNumber(tid),
+ &node->ioss_VMBuffer))
{
/*
* Rats, we have to visit the heap to check visibility.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index da768c6..08b61cb 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1329,7 +1329,8 @@ pgstat_report_autovac(Oid dboid)
*/
void
pgstat_report_vacuum(Oid tableoid, bool shared,
- PgStat_Counter livetuples, PgStat_Counter deadtuples)
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages)
{
PgStat_MsgVacuum msg;
@@ -1343,6 +1344,7 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
msg.m_vacuumtime = GetCurrentTimestamp();
msg.m_live_tuples = livetuples;
msg.m_dead_tuples = deadtuples;
+ msg.m_frozen_pages = frozenpages;
pgstat_send(&msg, sizeof(msg));
}
@@ -1354,7 +1356,8 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
*/
void
pgstat_report_analyze(Relation rel,
- PgStat_Counter livetuples, PgStat_Counter deadtuples)
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages)
{
PgStat_MsgAnalyze msg;
@@ -1394,6 +1397,7 @@ pgstat_report_analyze(Relation rel,
msg.m_analyzetime = GetCurrentTimestamp();
msg.m_live_tuples = livetuples;
msg.m_dead_tuples = deadtuples;
+ msg.m_frozen_pages = frozenpages;
pgstat_send(&msg, sizeof(msg));
}
@@ -3702,6 +3706,7 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
result->n_live_tuples = 0;
result->n_dead_tuples = 0;
result->changes_since_analyze = 0;
+ result->n_frozen_pages = 0;
result->blocks_fetched = 0;
result->blocks_hit = 0;
result->vacuum_timestamp = 0;
@@ -5069,6 +5074,7 @@ pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
tabentry->n_live_tuples = msg->m_live_tuples;
tabentry->n_dead_tuples = msg->m_dead_tuples;
+ tabentry->n_frozen_pages = msg->m_frozen_pages;
if (msg->m_autovacuum)
{
@@ -5103,6 +5109,7 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
tabentry->n_live_tuples = msg->m_live_tuples;
tabentry->n_dead_tuples = msg->m_dead_tuples;
+ tabentry->n_frozen_pages = msg->m_frozen_pages;
/*
* We reset changes_since_analyze to zero, forgetting any changes that
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 1b22fcc..7c57b3e 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -46,6 +46,7 @@ extern Datum pg_stat_get_vacuum_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_autovacuum_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_analyze_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_autoanalyze_count(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_frozen_pages(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_function_calls(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_function_total_time(PG_FUNCTION_ARGS);
@@ -450,6 +451,21 @@ pg_stat_get_autoanalyze_count(PG_FUNCTION_ARGS)
}
Datum
+pg_stat_get_frozen_pages(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int32 result;
+ PgStat_StatTabEntry *tabentry;
+
+ if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+ result = 0;
+ else
+ result = (int32) (tabentry->n_frozen_pages);
+
+ PG_RETURN_INT32(result);
+}
+
+Datum
pg_stat_get_function_calls(PG_FUNCTION_ARGS)
{
Oid funcid = PG_GETARG_OID(0);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index f77489b..583b55a 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -320,9 +320,10 @@ typedef struct xl_heap_freeze_page
typedef struct xl_heap_visible
{
TransactionId cutoff_xid;
+ uint8 flags;
} xl_heap_visible;
-#define SizeOfHeapVisible (offsetof(xl_heap_visible, cutoff_xid) + sizeof(TransactionId))
+#define SizeOfHeapVisible (offsetof(xl_heap_visible, flags) + sizeof(uint8))
typedef struct xl_heap_new_cid
{
@@ -389,6 +390,6 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
xl_heap_freeze_tuple *xlrec_tp);
extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
- Buffer vm_buffer, TransactionId cutoff_xid);
+ Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
#endif /* HEAPAM_XLOG_H */
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index d447daf..a75de5c 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -19,15 +19,36 @@
#include "storage/buf.h"
#include "utils/relcache.h"
+/*
+ * Number of bits allocated for each heap block.
+ * One for all-visible, other for all-frozen.
+*/
+#define BITS_PER_HEAPBLOCK 2
+
+/* Number of heap blocks we can represent in one byte. */
+#define HEAPBLOCKS_PER_BYTE 4
+
+/* Flags for bit map */
+#define VISIBILITYMAP_ALL_VISIBLE 0x01
+#define VISIBILITYMAP_ALL_FROZEN 0x02
+#define VISIBILITYMAP_VALID_BITS 0x03 /* OR of all valid visiblitymap flags bits */
+
+/* Macros for visibilitymap test */
+#define VM_ALL_VISIBLE(r, b, v) \
+ ((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_VISIBLE) != 0)
+#define VM_ALL_FROZEN(r, b, v) \
+ ((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_FROZEN) != 0)
+
extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk,
Buffer vmbuf);
extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
Buffer *vmbuf);
extern bool visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid);
-extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
-extern BlockNumber visibilitymap_count(Relation rel);
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags);
+extern uint8 visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
+extern void visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen);
extern void visibilitymap_truncate(Relation rel, BlockNumber nheapblocks);
#endif /* VISIBILITYMAP_H */
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 378c40f..a0b210b 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 201602071
+#define CATALOG_VERSION_NO 201602131
#endif
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 1c0ef9a..a0420b5 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2706,6 +2706,8 @@ DATA(insert OID = 3056 ( pg_stat_get_analyze_count PGNSP PGUID 12 1 0 0 0 f f f
DESCR("statistics: number of manual analyzes for a table");
DATA(insert OID = 3057 ( pg_stat_get_autoanalyze_count PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_autoanalyze_count _null_ _null_ _null_ ));
DESCR("statistics: number of auto analyzes for a table");
+DATA(insert OID = 6015 ( pg_stat_get_frozen_pages PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 23 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_frozen_pages _null_ _null_ _null_ ));
+DESCR("statistics: number of frozen pages of table");
DATA(insert OID = 1936 ( pg_stat_get_backend_idset PGNSP PGUID 12 1 100 0 0 f f f f t t s r 0 0 23 "" _null_ _null_ _null_ _null_ _null_ pg_stat_get_backend_idset _null_ _null_ _null_ ));
DESCR("statistics: currently active backend IDs");
DATA(insert OID = 2022 ( pg_stat_get_activity PGNSP PGUID 12 1 100 0 0 f f f f f t s r 1 0 2249 "23" "{23,26,23,26,25,25,25,16,1184,1184,1184,1184,869,25,23,28,28,16,25,25,23,16,25}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,datid,pid,usesysid,application_name,state,query,waiting,xact_start,query_start,backend_start,state_change,client_addr,client_hostname,client_port,backend_xid,backend_xmin,ssl,sslversion,sslcipher,sslbits,sslcompression,sslclientdn}" _null_ _null_ pg_stat_get_activity _null_ _null_ _null_ ));
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 65e968e..ad40b70 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -17,6 +17,7 @@
#include "portability/instr_time.h"
#include "postmaster/pgarch.h"
#include "storage/barrier.h"
+#include "storage/block.h"
#include "utils/hsearch.h"
#include "utils/relcache.h"
@@ -355,6 +356,7 @@ typedef struct PgStat_MsgVacuum
TimestampTz m_vacuumtime;
PgStat_Counter m_live_tuples;
PgStat_Counter m_dead_tuples;
+ BlockNumber m_frozen_pages;
} PgStat_MsgVacuum;
@@ -372,6 +374,7 @@ typedef struct PgStat_MsgAnalyze
TimestampTz m_analyzetime;
PgStat_Counter m_live_tuples;
PgStat_Counter m_dead_tuples;
+ int32 m_frozen_pages;
} PgStat_MsgAnalyze;
@@ -551,7 +554,7 @@ typedef union PgStat_Msg
* ------------------------------------------------------------
*/
-#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9D
+#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9E
/* ----------
* PgStat_StatDBEntry The collector's data per database
@@ -615,6 +618,8 @@ typedef struct PgStat_StatTabEntry
PgStat_Counter n_dead_tuples;
PgStat_Counter changes_since_analyze;
+ int32 n_frozen_pages;
+
PgStat_Counter blocks_fetched;
PgStat_Counter blocks_hit;
@@ -917,9 +922,11 @@ extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type t
extern void pgstat_report_autovac(Oid dboid);
extern void pgstat_report_vacuum(Oid tableoid, bool shared,
- PgStat_Counter livetuples, PgStat_Counter deadtuples);
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages);
extern void pgstat_report_analyze(Relation rel,
- PgStat_Counter livetuples, PgStat_Counter deadtuples);
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages);
extern void pgstat_report_recovery_conflict(int reason);
extern void pgstat_report_deadlock(void);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 2ce3be7..0b023b3 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -178,8 +178,10 @@ typedef PageHeaderData *PageHeader;
* tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
+#define PD_ALL_FROZEN 0x0008 /* all tuples on page are completely
+ frozen */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x000F /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -367,7 +369,12 @@ typedef PageHeaderData *PageHeader;
#define PageSetAllVisible(page) \
(((PageHeader) (page))->pd_flags |= PD_ALL_VISIBLE)
#define PageClearAllVisible(page) \
- (((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
+ (((PageHeader) (page))->pd_flags &= ~(PD_ALL_VISIBLE | PD_ALL_FROZEN))
+
+#define PageIsAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags & PD_ALL_FROZEN)
+#define PageSetAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags |= PD_ALL_FROZEN)
#define PageIsPrunable(page, oldestxmin) \
( \
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 2bdba2d..0f13ab0 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1683,6 +1683,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,
pg_stat_get_live_tuples(c.oid) AS n_live_tup,
pg_stat_get_dead_tuples(c.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(c.oid) AS n_mod_since_analyze,
+ pg_stat_get_frozen_pages(c.oid) AS n_frozen_pages,
pg_stat_get_last_vacuum_time(c.oid) AS last_vacuum,
pg_stat_get_last_autovacuum_time(c.oid) AS last_autovacuum,
pg_stat_get_last_analyze_time(c.oid) AS last_analyze,
@@ -1795,6 +1796,7 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
+ pg_stat_all_tables.n_frozen_pages,
pg_stat_all_tables.last_vacuum,
pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,
@@ -1838,6 +1840,7 @@ pg_stat_user_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
+ pg_stat_all_tables.n_frozen_pages,
pg_stat_all_tables.last_vacuum,
pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index f5be70f..95ababf 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -127,6 +127,8 @@ SELECT count(*) FROM tenk2 WHERE unique1 = 1;
1
(1 row)
+-- do VACUUM FREEZE
+VACUUM FREEZE tenk2;
-- force the rate-limiting logic in pgstat_report_tabstat() to time out
-- and send a message
SELECT pg_sleep(1.0);
@@ -175,6 +177,14 @@ SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages,
t | t
(1 row)
+SELECT n_frozen_pages = (pg_relation_size('tenk2') / current_setting('block_size')::int)
+ FROM pg_stat_user_tables
+ WHERE relname ='tenk2';
+ ?column?
+----------
+ t
+(1 row)
+
SELECT pr.snap_ts < pg_stat_get_snapshot_timestamp() as snapshot_newer
FROM prevstats AS pr;
snapshot_newer
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index cd2d592..dea5553 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -120,6 +120,8 @@ ROLLBACK;
SELECT count(*) FROM tenk2;
-- do an indexscan
SELECT count(*) FROM tenk2 WHERE unique1 = 1;
+-- do VACUUM FREEZE
+VACUUM FREEZE tenk2;
-- force the rate-limiting logic in pgstat_report_tabstat() to time out
-- and send a message
@@ -145,6 +147,10 @@ SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages,
FROM pg_statio_user_tables AS st, pg_class AS cl, prevstats AS pr
WHERE st.relname='tenk2' AND cl.relname='tenk2';
+SELECT n_frozen_pages = (pg_relation_size('tenk2') / current_setting('block_size')::int)
+ FROM pg_stat_user_tables
+ WHERE relname ='tenk2';
+
SELECT pr.snap_ts < pg_stat_get_snapshot_timestamp() as snapshot_newer
FROM prevstats AS pr;
001_optimize_vacuum_scan_based_on_freezemap_v35.patchapplication/octet-stream; name=001_optimize_vacuum_scan_based_on_freezemap_v35.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index de84b77..dc39e94 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5961,7 +5961,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs an eager freezing if the table's
<structname>pg_class</>.<structfield>relfrozenxid</> field has reached
the age specified by this setting. The default is 150 million
transactions. Although users can set this value anywhere from zero to
@@ -6005,7 +6005,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs an eager freezing if the table's
<structname>pg_class</>.<structfield>relminmxid</> field has reached
the age specified by this setting. The default is 150 million multixacts.
Although users can set this value anywhere from zero to two billions,
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 5204b34..7cc975d 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -352,9 +352,9 @@
Vacuum maintains a <link linkend="storage-vm">visibility map</> for each
table to keep track of which pages contain only tuples that are known to be
visible to all active transactions (and all future transactions, until the
- page is again modified). This has two purposes. First, vacuum
- itself can skip such pages on the next run, since there is nothing to
- clean up.
+ page is again modified), and pages contain only frozen tuples.
+ This has two purposes. First, vacuum itself can skip such pages
+ on the next run, since there is nothing to clean up.
</para>
<para>
@@ -438,28 +438,25 @@
</para>
<para>
- <command>VACUUM</> normally skips pages that don't have any dead row
- versions, but those pages might still have row versions with old XID
- values. To ensure all old row versions have been frozen, a
- scan of the whole table is needed.
+ <command>VACUUM</> skips pages that don't have any dead row
+ versions, and pages that have only frozen rows. To ensure all old
+ row versions have been frozen, a scan of all unfrozen pages is needed.
<xref linkend="guc-vacuum-freeze-table-age"> controls when
- <command>VACUUM</> does that: a whole table sweep is forced if
- the table hasn't been fully scanned for <varname>vacuum_freeze_table_age</>
- minus <varname>vacuum_freeze_min_age</> transactions. Setting it to 0
- forces <command>VACUUM</> to always scan all pages, effectively ignoring
- the visibility map.
+ <command>VACUUM</> does that: a whole-table freezing is forced if
+ the table hasn't been ensured all row versions are frozen for
+ <varname>vacuum_freeze_table_age</> minus <varname>vacuum_freeze_min_age</>
+ transcation.
</para>
<para>
The maximum time that a table can go unvacuumed is two billion
transactions minus the <varname>vacuum_freeze_min_age</> value at
- the time <command>VACUUM</> last scanned the whole table. If it were to go
- unvacuumed for longer than
- that, data loss could result. To ensure that this does not happen,
- autovacuum is invoked on any table that might contain unfrozen rows with
- XIDs older than the age specified by the configuration parameter <xref
- linkend="guc-autovacuum-freeze-max-age">. (This will happen even if
- autovacuum is disabled.)
+ the time <command>VACUUM</> last scanned unfrozen pages. If it were to go
+ unvacuumed for longer than that, data loss could result. To ensure
+ that this does not happen, autovacuum is invoked on any table that might
+ contain unfrozen rows with XIDs older than the age specified by the
+ configuration parameter <xref linkend="guc-autovacuum-freeze-max-age">.
+ (This will happen even if autovacuum is disabled.)
</para>
<para>
@@ -490,8 +487,7 @@
a regularly scheduled <command>VACUUM</> or an autovacuum triggered by
normal delete and update activity is run in that window. Setting it too
close could lead to anti-wraparound autovacuums, even though the table
- was recently vacuumed to reclaim space, whereas lower values lead to more
- frequent whole-table scans.
+ was recently vacuumed to reclaim space.
</para>
<para>
@@ -527,7 +523,7 @@
<structname>pg_database</>. In particular,
the <structfield>relfrozenxid</> column of a table's
<structname>pg_class</> row contains the freeze cutoff XID that was used
- by the last whole-table <command>VACUUM</> for that table. All rows
+ by the last whole-table freezing for that table. All rows
inserted by transactions with XIDs older than this cutoff XID are
guaranteed to have been frozen. Similarly,
the <structfield>datfrozenxid</> column of a database's
@@ -555,17 +551,18 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
<command>VACUUM</> normally
only scans pages that have been modified since the last vacuum, but
<structfield>relfrozenxid</> can only be advanced when the whole table is
- scanned. The whole table is scanned when <structfield>relfrozenxid</> is
- more than <varname>vacuum_freeze_table_age</> transactions old, when
- <command>VACUUM</>'s <literal>FREEZE</> option is used, or when all pages
- happen to
- require vacuuming to remove dead row versions. When <command>VACUUM</>
- scans the whole table, after it's finished <literal>age(relfrozenxid)</>
- should be a little more than the <varname>vacuum_freeze_min_age</> setting
- that was used (more by the number of transactions started since the
- <command>VACUUM</> started). If no whole-table-scanning <command>VACUUM</>
- is issued on the table until <varname>autovacuum_freeze_max_age</> is
- reached, an autovacuum will soon be forced for the table.
+ frozen. Freezing occurs on the whole table once all pages of this relation
+ require it. In other cases such as where <structfield>relfrozenxid</> is more
+ than <varname>vacuum_freeze_table_age</> transactions old, where
+ <command>VACUUM</>'s <literal>FREEZE</> option is used, <command>VACUUM</>
+ can skip the pages that all tuples on the page itself are marked as frozen.
+ When all pages of table are eventually marked as frozen by <command>VACUUM</>,
+ after it's finished <literal>age(relfrozenxid)</> should be a little more
+ than the <varname>vacuum_freeze_min_age</> setting that was used (more by
+ the number of transactions started since the <command>VACUUM</> started).
+ If the advancing of <structfield>relfrozenxid</> is not happend until
+ <varname>autovacuum_freeze_max_age</> is reached, an autovacuum will soon
+ be forced for the table.
</para>
<para>
@@ -642,28 +639,28 @@ HINT: Stop the postmaster and vacuum that database in single-user mode.
<structname>pg_class</>.<structfield>relminmxid</> stores the oldest
possible multixact ID still appearing in any tuple of that table.
If this value is older than
- <xref linkend="guc-vacuum-multixact-freeze-table-age">, a whole-table
+ <xref linkend="guc-vacuum-multixact-freeze-table-age">, a table
scan is forced. <function>mxid_age()</> can be used on
<structname>pg_class</>.<structfield>relminmxid</> to find its age.
</para>
<para>
- Whole-table <command>VACUUM</> scans, regardless of
- what causes them, enable advancing the value for that table.
+ When <command>VACUUM</> scans all unfrozen pages, regardless of what causes
+ them, enable advancing the value for that table.
Eventually, as all tables in all databases are scanned and their
oldest multixact values are advanced, on-disk storage for older
multixacts can be removed.
</para>
<para>
- As a safety device, a whole-table vacuum scan will occur for any table
- whose multixact-age is greater than
- <xref linkend="guc-autovacuum-multixact-freeze-max-age">. Whole-table
+ As a safety device, a vacuum scan will occur for any table whose
+ multixact-age is greater than
+ <xref linkend="guc-autovacuum-multixact-freeze-max-age">. A
vacuum scans will also occur progressively for all tables, starting with
those that have the oldest multixact-age, if the amount of used member
storage space exceeds the amount 50% of the addressable storage space.
- Both of these kinds of whole-table scans will occur even if autovacuum is
- nominally disabled.
+ Both of these kinds of scans will occur even if autovacuum is nominally
+ disabled.
</para>
</sect3>
</sect2>
@@ -743,8 +740,8 @@ vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuple
<command>UPDATE</command> and <command>DELETE</command> operation. (It
is only semi-accurate because some information might be lost under heavy
load.) If the <structfield>relfrozenxid</> value of the table is more
- than <varname>vacuum_freeze_table_age</> transactions old, the whole
- table is scanned to freeze old tuples and advance
+ than <varname>vacuum_freeze_table_age</> transactions old, the table is
+ scanned to freeze old tuples and advance
<structfield>relfrozenxid</>, otherwise only pages that have been modified
since the last vacuum are scanned.
</para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 85459d0..0bcd52d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1423,6 +1423,11 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
<entry>Estimated number of rows modified since this table was last analyzed</entry>
</row>
<row>
+ <entry><structfield>n_frozen_pages</></entry>
+ <entry><type>integer</></entry>
+ <entry>Number of frozen pages</entry>
+ </row>
+ <row>
<entry><structfield>last_vacuum</></entry>
<entry><type>timestamp with time zone</></entry>
<entry>Last time at which this table was manually vacuumed
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 4f6f6e7..2a174a2 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -106,6 +106,8 @@ typedef struct LVRelStats
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* number of pages we examined */
BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber vmskipped_frozen_pages; /* # of pages we skipped by all-frozen bit
+ of visibility map */
double scanned_tuples; /* counts only tuples on scanned pages */
double old_rel_tuples; /* previous value of pg_class.reltuples */
double new_rel_tuples; /* new estimated total # of tuples */
@@ -156,8 +158,9 @@ static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
ItemPointer itemptr);
static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
static int vac_cmp_itemptr(const void *left, const void *right);
-static bool heap_page_is_all_visible(Relation rel, Buffer buf,
- TransactionId *visibility_cutoff_xid);
+static void heap_page_visible_status(Relation rel, Buffer buf,
+ TransactionId *visibility_cutoff_xid,
+ bool *all_visible, bool *all_frozen);
/*
@@ -188,7 +191,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
MultiXactId mxactFullScanLimit;
BlockNumber new_rel_pages;
double new_rel_tuples;
- BlockNumber new_rel_allvisible;
+ BlockNumber new_rel_allvisible,
+ new_rel_allfrozen;
double new_live_tuples;
TransactionId new_frozen_xid;
MultiXactId new_min_multi;
@@ -221,7 +225,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* We request a full scan if either the table's frozen Xid is now older
* than or equal to the requested Xid full-table scan limit; or if the
* table's minimum MultiXactId is older than or equal to the requested
- * mxid full-table scan limit.
+ * mxid full-table scan limit. During full scan, we could skip some pages
+ * according to all-frozen bit of visibility map.
*/
scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
xidFullScanLimit);
@@ -253,7 +258,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
- if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages + vacrelstats->vmskipped_frozen_pages)
+ < vacrelstats->rel_pages)
{
Assert(!scan_all);
scanned_all = false;
@@ -274,15 +280,15 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* Update statistics in pg_class.
*
* A corner case here is that if we scanned no pages at all because every
- * page is all-visible, we should not update relpages/reltuples, because
- * we have no new information to contribute. In particular this keeps us
- * from replacing relpages=reltuples=0 (which means "unknown tuple
+ * page is all-visible or all-frozen, we should not update relpages/reltuples,
+ * because we have no new information to contribute. In particular this keeps
+ * us from replacing relpages=reltuples=0 (which means "unknown tuple
* density") with nonzero relpages and reltuples=0 (which means "zero
* tuple density") unless there's some actual evidence for the latter.
*
- * We do update relallvisible even in the corner case, since if the table
- * is all-visible we'd definitely like to know that. But clamp the value
- * to be not more than what we're setting relpages to.
+ * We do update relallvisible and relallfrozen even in the corner case,
+ * since if the table is all-visible we'd definitely like to know that.
+ * But clamp the value to be not more than what we're setting relpages to.
*
* Also, don't change relfrozenxid/relminmxid if we skipped any pages,
* since then we don't know for certain that all tuples have a newer xmin.
@@ -295,10 +301,13 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_tuples = vacrelstats->old_rel_tuples;
}
- new_rel_allvisible = visibilitymap_count(onerel);
+ visibilitymap_count(onerel, &new_rel_allvisible, &new_rel_allfrozen);
if (new_rel_allvisible > new_rel_pages)
new_rel_allvisible = new_rel_pages;
+ if (new_rel_allfrozen > new_rel_pages)
+ new_rel_allfrozen = new_rel_pages;
+
new_frozen_xid = scanned_all ? FreezeLimit : InvalidTransactionId;
new_min_multi = scanned_all ? MultiXactCutoff : InvalidMultiXactId;
@@ -319,7 +328,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
pgstat_report_vacuum(RelationGetRelid(onerel),
onerel->rd_rel->relisshared,
new_live_tuples,
- vacrelstats->new_dead_tuples);
+ vacrelstats->new_dead_tuples,
+ new_rel_allfrozen);
/* and log the action if appropriate */
if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
@@ -354,10 +364,11 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel),
vacrelstats->num_index_scans);
- appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins\n"),
+ appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped according to vm\n"),
vacrelstats->pages_removed,
vacrelstats->rel_pages,
- vacrelstats->pinskipped_pages);
+ vacrelstats->pinskipped_pages,
+ vacrelstats->vmskipped_frozen_pages);
appendStringInfo(&buf,
_("tuples: %.0f removed, %.0f remain, %.0f are dead but not yet removable\n"),
vacrelstats->tuples_deleted,
@@ -480,9 +491,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* consecutive pages. Since we're reading sequentially, the OS should be
* doing readahead for us, so there's no gain in skipping a page now and
* then; that's likely to disable readahead and so be counterproductive.
- * Also, skipping even a single page means that we can't update
- * relfrozenxid, so we only want to do it if we can skip a goodly number
- * of pages.
+ * Also, skipping even a single page according to all-visible bit of
+ * visibility map means that we might not be able to update relfrozenxid,
+ * so we only want to do it if we can skip a goodly number. On the other hand,
+ * we count both how many pages we skipped according to all-frozen bit of
+ * visibility map and how many pages we froze, so we can update relfrozenxid
+ * if the sum of two is as many as pages of table.
*
* Before entering the main loop, establish the invariant that
* next_not_all_visible_block is the next block number >= blkno that's not
@@ -492,18 +506,18 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* started skipping blocks, we may as well skip everything up to the next
* not-all-visible block.
*
- * Note: if scan_all is true, we won't actually skip any pages; but we
+ * Note: if scan_all is true, we might not actually skip any pages; but we
* maintain next_not_all_visible_block anyway, so as to set up the
* all_visible_according_to_vm flag correctly for each page.
*
- * Note: The value returned by visibilitymap_test could be slightly
+ * Note: The value returned by visibilitymap_get_status could be slightly
* out-of-date, since we make this test before reading the corresponding
* heap page or locking the buffer. This is OK. If we mistakenly think
- * that the page is all-visible when in fact the flag's just been cleared,
- * we might fail to vacuum the page. But it's OK to skip pages when
- * scan_all is not set, so no great harm done; the next vacuum will find
- * them. If we make the reverse mistake and vacuum a page unnecessarily,
- * it'll just be a no-op.
+ * that the page is all-visible/all-frozen when in fact the flag's just
+ * been cleared, we might fail to vacuum the page. But it's OK to skip
+ * pages when scan_all is not set, so no great harm done; the next vacuum
+ * will find them. If we make the reverse mistake and vacuum a page
+ * unnecessarily, it'll just be a no-op.
*
* We will scan the table's last page, at least to the extent of
* determining whether it has tuples or not, even if it should be skipped
@@ -518,7 +532,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer))
+ if (!VM_ALL_VISIBLE(onerel, next_not_all_visible_block, &vmbuffer))
break;
vacuum_delay_point();
}
@@ -536,9 +550,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
bool tupgone,
hastup;
int prev_dead_count;
- int nfrozen;
+ int nfrozen; /* # of tuples is frozen */
+ int nalready_frozen; /* # of tuples is already frozen */
+ int ntotal_frozen; /* # of frozen tuples in single page */
+ int ntup_per_page;
Size freespace;
bool all_visible_according_to_vm;
+ bool all_frozen_according_to_vm;
bool all_visible;
bool has_dead_tuples;
TransactionId visibility_cutoff_xid = InvalidTransactionId;
@@ -554,8 +572,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block,
- &vmbuffer))
+ if (!VM_ALL_VISIBLE(onerel, next_not_all_visible_block, &vmbuffer))
break;
vacuum_delay_point();
}
@@ -569,14 +586,29 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
skipping_all_visible_blocks = true;
else
skipping_all_visible_blocks = false;
+
all_visible_according_to_vm = false;
+ all_frozen_according_to_vm = false;
}
else
{
- /* Current block is all-visible */
- if (skipping_all_visible_blocks && !scan_all && !FORCE_CHECK_PAGE())
- continue;
+ /*
+ * This block is at least all-visible according to visibility map.
+ * We check whether this block is all-frozen or not, to skip to
+ * vacuum this page even if scan_all is true.
+ */
+ bool all_frozen = VM_ALL_FROZEN(onerel, blkno, &vmbuffer);
+
+ if (scan_all && all_frozen && !FORCE_CHECK_PAGE())
+ {
+ vacrelstats->vmskipped_frozen_pages++;
+ continue;
+ }
+ else if (!scan_all && skipping_all_visible_blocks && !FORCE_CHECK_PAGE())
+ continue;
+
all_visible_according_to_vm = true;
+ all_frozen_according_to_vm = all_frozen;
}
vacuum_delay_point();
@@ -743,7 +775,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
empty_pages++;
freespace = PageGetHeapFreeSpace(page);
- /* empty pages are always all-visible */
+ /* empty pages are always all-visible and all-frozen */
if (!PageIsAllVisible(page))
{
START_CRIT_SECTION();
@@ -766,8 +798,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
log_newpage_buffer(buf, true);
PageSetAllVisible(page);
+ PageSetAllFrozen(page);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, InvalidTransactionId);
+ vmbuffer, InvalidTransactionId,
+ VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
END_CRIT_SECTION();
}
@@ -791,13 +825,15 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
all_visible = true;
has_dead_tuples = false;
nfrozen = 0;
+ nalready_frozen = 0;
+ ntup_per_page = 0;
hastup = false;
prev_dead_count = vacrelstats->num_dead_tuples;
maxoff = PageGetMaxOffsetNumber(page);
/*
* Note: If you change anything in the loop below, also look at
- * heap_page_is_all_visible to see if that needs to be changed.
+ * heap_page_visible_status to see if that needs to be changed.
*/
for (offnum = FirstOffsetNumber;
offnum <= maxoff;
@@ -945,8 +981,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
{
num_tuples += 1;
+ ntup_per_page += 1;;
hastup = true;
+ /* Check whether this tuple is already frozen or not */
+ if (HeapTupleHeaderXminFrozen(tuple.t_data))
+ nalready_frozen += 1;
+
/*
* Each non-removable tuple must be checked to see if it needs
* freezing. Note we already have exclusive buffer lock.
@@ -993,6 +1034,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
END_CRIT_SECTION();
}
+ /* Compute total number of frozen tuples in a page */
+ ntotal_frozen = nfrozen + nalready_frozen;
+
/*
* If there are no indexes then we can vacuum the page right now
* instead of doing a second scan.
@@ -1015,26 +1059,47 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
freespace = PageGetHeapFreeSpace(page);
- /* mark page all-visible, if appropriate */
- if (all_visible && !all_visible_according_to_vm)
+ /* This page is all visible */
+ if (all_visible)
{
- /*
- * It should never be the case that the visibility map page is set
- * while the page-level bit is clear, but the reverse is allowed
- * (if checksums are not enabled). Regardless, set the both bits
- * so that we get back in sync.
- *
- * NB: If the heap page is all-visible but the VM bit is not set,
- * we don't need to dirty the heap page. However, if checksums
- * are enabled, we do need to make sure that the heap page is
- * dirtied before passing it to visibilitymap_set(), because it
- * may be logged. Given that this situation should only happen in
- * rare cases after a crash, it is not worth optimizing.
- */
- PageSetAllVisible(page);
- MarkBufferDirty(buf);
- visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, visibility_cutoff_xid);
+ uint8 flags = 0;
+
+ /* mark page all-visible, if appropriate */
+ if (!all_visible_according_to_vm)
+ {
+ /*
+ * It should never be the case that the visibility map page is set
+ * while the page-level bit is clear, but the reverse is allowed
+ * (if checksums are not enabled). Regardless, set the both bits
+ * so that we get back in sync.
+ *
+ * NB: If the heap page is all-visible but the VM bit is not set,
+ * we don't need to dirty the heap page. However, if checksums
+ * are enabled, we do need to make sure that the heap page is
+ * dirtied before passing it to visibilitymap_set(), because it
+ * may be logged. Given that this situation should only happen in
+ * rare cases after a crash, it is not worth optimizing.
+ */
+ PageSetAllVisible(page);
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+ }
+
+ /* mark page all-frozen, if all tuples are frozen and not marked yet */
+ if ((ntotal_frozen == ntup_per_page) && !all_frozen_according_to_vm)
+ {
+ Assert(PageIsAllVisible(page));
+
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ {
+ MarkBufferDirty(buf);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+ vmbuffer, visibility_cutoff_xid, flags);
+ }
+
}
/*
@@ -1045,9 +1110,14 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* that something bad has happened.
*/
else if (all_visible_according_to_vm && !PageIsAllVisible(page)
- && visibilitymap_test(onerel, blkno, &vmbuffer))
+ && VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
{
- elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
+ /* If the all-frozen is set then all-visible must be set */
+ if (all_frozen_according_to_vm)
+ Assert(VM_ALL_FROZEN(onerel, blkno, &vmbuffer) &&
+ VM_ALL_VISIBLE(onerel, blkno, &vmbuffer));
+
+ elog(WARNING, "page is not marked all-visible (and all-frozen) but visibility map bit(s) is set in relation \"%s\" page %u",
relname, blkno);
visibilitymap_clear(onerel, blkno, vmbuffer);
}
@@ -1055,19 +1125,25 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
/*
* It's possible for the value returned by GetOldestXmin() to move
* backwards, so it's not wrong for us to see tuples that appear to
- * not be visible to everyone yet, while PD_ALL_VISIBLE is already
- * set. The real safe xmin value never moves backwards, but
- * GetOldestXmin() is conservative and sometimes returns a value
+ * not be visible to everyone yet, while PD_ALL_VISIBLE (and PD_ALL_FROZEN)
+ * are already set. The real safe xmin value never moves backwards,
+ * but GetOldestXmin() is conservative and sometimes returns a value
* that's unnecessarily small, so if we see that contradiction it just
* means that the tuples that we think are not visible to everyone yet
- * actually are, and the PD_ALL_VISIBLE flag is correct.
+ * actually are, and the PD_ALL_VISIBLE (and PD_ALL_FROZEN) flags are
+ * correct.
*
* There should never be dead tuples on a page with PD_ALL_VISIBLE
* set, however.
*/
else if (PageIsAllVisible(page) && has_dead_tuples)
{
- elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
+ /* If all-frozen is set then all-visible must be set */
+ if (PageIsAllFrozen(page))
+ Assert(VM_ALL_FROZEN(onerel, blkno, &vmbuffer) &&
+ VM_ALL_VISIBLE(onerel, blkno, &vmbuffer));
+
+ elog(WARNING, "page containing dead tuples is marked as all-visible (and all-frozen) in relation \"%s\" page %u",
relname, blkno);
PageClearAllVisible(page);
MarkBufferDirty(buf);
@@ -1141,6 +1217,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
RelationGetRelationName(onerel),
tups_vacuumed, vacuumed_pages)));
+ /* Report how many frozen pages vacuum skipped according to visibility map */
+ ereport(elevel,
+ (errmsg_plural("skipped %d frozen page according to visibility map",
+ "skipped %d frozen pages according to visibility map",
+ vacrelstats->vmskipped_frozen_pages,
+ vacrelstats->vmskipped_frozen_pages)));
+
/*
* This is pretty messy, but we split it up so that we can skip emitting
* individual parts of the message when not applicable.
@@ -1257,6 +1340,8 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
OffsetNumber unused[MaxOffsetNumber];
int uncnt = 0;
TransactionId visibility_cutoff_xid;
+ bool all_visible;
+ bool all_frozen;
START_CRIT_SECTION();
@@ -1308,19 +1393,36 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
* dirty, exclusively locked, and, if needed, a full page image has been
* emitted in the log_heap_clean() above.
*/
- if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid))
+ heap_page_visible_status(onerel, buffer, &visibility_cutoff_xid,
+ &all_visible, &all_frozen);
+ if (all_visible)
PageSetAllVisible(page);
/*
* All the changes to the heap page have been done. If the all-visible
- * flag is now set, also set the VM bit.
+ * flag is now set, also set the VM all-visible bit.
+ * Also, if this page is all-frozen, set the VM all-frozen bit and flag.
*/
- if (PageIsAllVisible(page) &&
- !visibilitymap_test(onerel, blkno, vmbuffer))
+ if (PageIsAllVisible(page))
{
+ uint8 vm_status = visibilitymap_get_status(onerel, blkno, vmbuffer);
+ uint8 flags = 0;
+
+ if (!(vm_status & VISIBILITYMAP_ALL_VISIBLE))
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+
+ /* Set the VM all-frozen bit to flag, if needed */
+ if (all_frozen && !(vm_status & VISIBILITYMAP_ALL_FROZEN))
+ {
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
Assert(BufferIsValid(*vmbuffer));
- visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
- visibility_cutoff_xid);
+
+ if (vm_status != flags)
+ visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
+ visibility_cutoff_xid, flags);
}
return tupindex;
@@ -1842,18 +1944,21 @@ vac_cmp_itemptr(const void *left, const void *right)
/*
* Check if every tuple in the given page is visible to all current and future
* transactions. Also return the visibility_cutoff_xid which is the highest
- * xmin amongst the visible tuples.
+ * xmin amongst the visible tuples, and all_frozen which implies that all tuples
+ * of this page are frozen.
*/
-static bool
-heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid)
+static void
+heap_page_visible_status(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid,
+ bool *all_visible, bool *all_frozen)
{
Page page = BufferGetPage(buf);
BlockNumber blockno = BufferGetBlockNumber(buf);
OffsetNumber offnum,
maxoff;
- bool all_visible = true;
*visibility_cutoff_xid = InvalidTransactionId;
+ *all_visible = true;
+ *all_frozen = true;
/*
* This is a stripped down version of the line pointer scan in
@@ -1861,7 +1966,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
*/
maxoff = PageGetMaxOffsetNumber(page);
for (offnum = FirstOffsetNumber;
- offnum <= maxoff && all_visible;
+ offnum <= maxoff && *all_visible;
offnum = OffsetNumberNext(offnum))
{
ItemId itemid;
@@ -1877,11 +1982,12 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/*
* Dead line pointers can have index pointers pointing to them. So
- * they can't be treated as visible
+ * they can't be treated as visible and frozen.
*/
if (ItemIdIsDead(itemid))
{
- all_visible = false;
+ *all_visible = false;
+ *all_frozen = false;
break;
}
@@ -1900,7 +2006,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/* Check comments in lazy_scan_heap. */
if (!HeapTupleHeaderXminCommitted(tuple.t_data))
{
- all_visible = false;
+ *all_visible = false;
break;
}
@@ -1911,13 +2017,17 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
xmin = HeapTupleHeaderGetXmin(tuple.t_data);
if (!TransactionIdPrecedes(xmin, OldestXmin))
{
- all_visible = false;
+ *all_visible = false;
break;
}
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
+
+ /* Check whether this tuple is already frozen or not */
+ if (!HeapTupleHeaderXminFrozen(tuple.t_data))
+ *all_frozen = false;
}
break;
@@ -1925,7 +2035,8 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
case HEAPTUPLE_RECENTLY_DEAD:
case HEAPTUPLE_INSERT_IN_PROGRESS:
case HEAPTUPLE_DELETE_IN_PROGRESS:
- all_visible = false;
+ *all_visible = false;
+ *all_frozen = false;
break;
default:
@@ -1934,5 +2045,6 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
}
} /* scan along page */
- return all_visible;
+ if (!(*all_visible))
+ *all_frozen = false;
}
diff --git a/src/test/regress/expected/visibilitymap.out b/src/test/regress/expected/visibilitymap.out
new file mode 100644
index 0000000..87206b6
--- /dev/null
+++ b/src/test/regress/expected/visibilitymap.out
@@ -0,0 +1,22 @@
+--
+-- Visibility Map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+-- All pages are become all-visible
+VACUUM FREEZE vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+ ?column?
+----------
+ t
+(1 row)
+
+-- Check whether vacuum skips all-frozen pages
+\set VERBOSITY terse
+VACUUM FREEZE VERBOSE vmtest;
+INFO: vacuuming "public.vmtest"
+INFO: index "vmtest_pkey" now contains 10000 row versions in 30 pages
+INFO: skipped 44 frozen pages according to visibility map
+INFO: "vmtest": found 0 removable, 56 nonremovable row versions in 1 out of 45 pages
+\set VERBOSITY default
+DROP TABLE vmtest;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index bec0316..2324420 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -110,3 +110,6 @@ test: event_trigger
# run stats by itself because its delay may be insufficient under heavy load
test: stats
+
+# page info map and vacuum test cannot run concurrently with any test that runs SQL
+test: visibilitymap
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 7e9b319..df4c717 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -162,3 +162,4 @@ test: with
test: xml
test: event_trigger
test: stats
+test: visibilitymap
diff --git a/src/test/regress/sql/visibilitymap.sql b/src/test/regress/sql/visibilitymap.sql
new file mode 100644
index 0000000..365570b
--- /dev/null
+++ b/src/test/regress/sql/visibilitymap.sql
@@ -0,0 +1,16 @@
+--
+-- Visibility Map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+
+-- All pages are become all-visible
+VACUUM FREEZE vmtest;
+SELECT relallvisible = (pg_relation_size('vmtest') / current_setting('block_size')::int) FROM pg_class WHERE relname = 'vmtest';
+
+-- Check whether vacuum skips all-frozen pages
+\set VERBOSITY terse
+VACUUM FREEZE VERBOSE vmtest;
+\set VERBOSITY default
+
+DROP TABLE vmtest;
002_freezemap_support_for_pg_upgrade_v35.patchapplication/octet-stream; name=002_freezemap_support_for_pg_upgrade_v35.patchDownload
diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile
index d9c8145..153622d 100644
--- a/src/bin/pg_upgrade/Makefile
+++ b/src/bin/pg_upgrade/Makefile
@@ -11,8 +11,11 @@ OBJS = check.o controldata.o dump.o exec.o file.o function.o info.o \
option.o page.o parallel.o pg_upgrade.o relfilenode.o server.o \
tablespace.o util.o version.o $(WIN32RES)
+SUBDIRS = plugins
+
override CPPFLAGS := -DDLSUFFIX=\"$(DLSUFFIX)\" -I$(srcdir) -I$(libpq_srcdir) $(CPPFLAGS)
+$(recurse)
all: pg_upgrade
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 9357ad8..648013e 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -14,75 +14,63 @@
#include <fcntl.h>
-
#ifndef WIN32
static int copy_file(const char *fromfile, const char *tofile, bool force);
#else
static int win32_pghardlink(const char *src, const char *dst);
#endif
-
/*
* copyAndUpdateFile()
*
- * Copies a relation file from src to dst. If pageConverter is non-NULL, this function
- * uses that pageConverter to do a page-by-page conversion.
+ * Copies a relation file from src to dst. This function uses that pageConverter
+ * to do a page-by-page or file-by-file conversion.
*/
const char *
copyAndUpdateFile(pageCnvCtx *pageConverter,
- const char *src, const char *dst, bool force)
+ const char *src, const char *dst, const char *type_suffix)
{
- if (pageConverter == NULL)
+ int i;
+ convertPlugin *plugin;
+
+ /* Find apporopiate plugin function from ConverterTable */
+ for (i = 0; i < LOAD_PLUGIN_NUM; i++)
{
-#ifndef WIN32
- if (copy_file(src, dst, force) == -1)
-#else
- if (CopyFile(src, dst, !force) == 0)
-#endif
- return getErrorText();
- else
- return NULL;
+ plugin = &(pageConverter->converterTable[i]);
+
+ if (strcmp(plugin->src_type_suffix, type_suffix) == 0)
+ break;
}
+
+ /* If this plugin has convertFile function, invoke it */
+ if (plugin->convertFile)
+ return plugin->convertFile(plugin->pluginData, dst, src);
else
{
- /*
- * We have a pageConverter object - that implies that the
- * PageLayoutVersion differs between the two clusters so we have to
- * perform a page-by-page conversion.
- *
- * If the pageConverter can convert the entire file at once, invoke
- * that plugin function, otherwise, read each page in the relation
- * file and call the convertPage plugin function.
- */
-
-#ifdef PAGE_CONVERSION
- if (pageConverter->convertFile)
- return pageConverter->convertFile(pageConverter->pluginData,
- dst, src);
- else
-#endif
+ /* We perform a page-by-page conversion */
+
+ int src_fd;
+ int dstfd;
+ char buf[BLCKSZ];
+ ssize_t bytesRead;
+ const char *msg = NULL;
+
+ if ((src_fd = open(src, O_RDONLY, 0)) < 0)
+ return "could not open source file";
+
+ if ((dstfd = open(dst, O_RDWR | O_CREAT | O_EXCL, S_IRUSR | S_IWUSR)) < 0)
{
- int src_fd;
- int dstfd;
- char buf[BLCKSZ];
- ssize_t bytesRead;
- const char *msg = NULL;
-
- if ((src_fd = open(src, O_RDONLY, 0)) < 0)
- return "could not open source file";
-
- if ((dstfd = open(dst, O_RDWR | O_CREAT | O_EXCL, S_IRUSR | S_IWUSR)) < 0)
- {
- close(src_fd);
- return "could not create destination file";
- }
+ close(src_fd);
+ return "could not create destination file";
+ }
- while ((bytesRead = read(src_fd, buf, BLCKSZ)) == BLCKSZ)
- {
-#ifdef PAGE_CONVERSION
- if ((msg = pageConverter->convertPage(pageConverter->pluginData, buf, buf)) != NULL)
- break;
-#endif
+ while ((bytesRead = read(src_fd, buf, BLCKSZ)) == BLCKSZ)
+ {
+ if (plugin->convertPage)
+ if ((msg = plugin->convertPage(plugin->pluginData,
+ buf,
+ buf)) != NULL)
+ break;
if (write(dstfd, buf, BLCKSZ) != BLCKSZ)
{
msg = "could not write new page to destination";
@@ -100,10 +88,8 @@ copyAndUpdateFile(pageCnvCtx *pageConverter,
else
return NULL;
}
- }
}
-
/*
* linkAndUpdateFile()
*
@@ -115,10 +101,26 @@ copyAndUpdateFile(pageCnvCtx *pageConverter,
*/
const char *
linkAndUpdateFile(pageCnvCtx *pageConverter,
- const char *src, const char *dst)
+ const char *src, const char *dst, const char *type_suffix)
{
- if (pageConverter != NULL)
- return "Cannot in-place update this cluster, page-by-page conversion is required";
+ /*
+ * Even if link mode, we do conversion actually for additional file such as
+ * visibility map, free space map.
+ */
+ if (strcmp(type_suffix, "_vm") == 0 ||
+ strcmp(type_suffix, "_fsm") == 0)
+ {
+ int i;
+ convertPlugin *convert_table = pageConverter->converterTable;
+
+ for (i = 0; i < LOAD_PLUGIN_NUM; i++)
+ {
+ if (strcmp((convert_table[i]).src_type_suffix, type_suffix) == 0)
+ return (convert_table[i]).convertFile((convert_table[i]).pluginData,
+ dst, src);
+ }
+ return NULL;
+ }
if (pg_link_file(src, dst) == -1)
return getErrorText();
@@ -204,7 +206,6 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
}
#endif
-
void
check_hard_link(void)
{
@@ -224,6 +225,23 @@ check_hard_link(void)
unlink(new_link_file);
}
+/*
+ * This function just copies file. Giving pluginData has information
+ * about force which will be passed to copy function.
+ */
+const char *
+pg_copy_file(void *pluginData, const char *dstName, const char *srcName)
+{
+#ifndef WIN32
+ if (copy_file(srcName, dstName, true) == -1)
+#else
+ if (CopyFile(srcName, dstName, false) == 0)
+#endif
+ return getErrorText();
+ else
+ return NULL;
+}
+
#ifdef WIN32
static int
win32_pghardlink(const char *src, const char *dst)
diff --git a/src/bin/pg_upgrade/page.c b/src/bin/pg_upgrade/page.c
index e5686e5..8c53b43 100644
--- a/src/bin/pg_upgrade/page.c
+++ b/src/bin/pg_upgrade/page.c
@@ -13,36 +13,82 @@
#include "storage/bufpage.h"
-
-#ifdef PAGE_CONVERSION
+#include <dlfcn.h>
static void getPageVersion(
uint16 *version, const char *pathName);
-static pageCnvCtx *loadConverterPlugin(
- uint16 newPageVersion, uint16 oldPageVersion);
+static bool loadConverterPlugin(pageCnvCtx *converter, int pos,
+ const char *pluginName);
+static void setCopyPluginToPageConverter(convertPlugin *plugin);
+static void setupMainPageConverter(pageCnvCtx *converter, int pos);
+static void setupVMPageConverter(pageCnvCtx *converter, int pos);
+static void setupFSMPageConverter(pageCnvCtx *converter, int pos);
+/*
+ * Set convert plugins that just copy file, doesn't convert actually.
+ * It's for the case where page conversion is not necessary.
+ */
+static void
+setCopyPluginToPageConverter(convertPlugin *plugin)
+{
+ plugin->startup = NULL;
+ plugin->convertFile = (pluginConvertFile) pg_copy_file;
+ plugin->convertPage = NULL;
+ plugin->shutdown = NULL;
+ plugin->pluginData = NULL;
+}
/*
* setupPageConverter()
*
- * This function determines the PageLayoutVersion of the old cluster and
- * the PageLayoutVersion of the new cluster. If the versions differ, this
- * function loads a converter plugin and returns a pointer to a pageCnvCtx
- * object (in *result) that knows how to convert pages from the old format
- * to the new format. If the versions are identical, this function just
- * returns a NULL pageCnvCtx pointer to indicate that page-by-page conversion
- * is not required.
+ * This function set up all page converters and return pointer
+ * to pageCnvCtx. After loading all plugin funcions, invoke start up
+ * function if that plugin has.
*/
-pageCnvCtx *
+const pageCnvCtx *
setupPageConverter(void)
{
+ pageCnvCtx *converter = (pageCnvCtx *) pg_malloc(sizeof(pageCnvCtx));
+ int i;
+
+ /* Load convert plugin for main relation file */
+ setupMainPageConverter(converter, 0);
+
+ /* Other additional convert plugins are load here */
+ setupVMPageConverter(converter, 1);
+ setupFSMPageConverter(converter, 2);
+
+ /* Invoke all startup plugin functions if plugin has */
+ for (i = 0; i < LOAD_PLUGIN_NUM; i++)
+ {
+ convertPlugin *plugin = &(converter->converterTable[i]);
+ if (plugin->startup)
+ plugin->startup(MIGRATOR_API_VERSION, &plugin->pluginVersion,
+ converter->newPageVersion, converter->oldPageVersion,
+ &plugin->pluginData);
+ }
+
+ return converter;
+}
+
+/*
+ * setupMainPageConverter()
+ *
+ * This function determines the PageLayoutVersion of the old cluster and
+ * the PageLayoutVersion of the new cluster. If the versions differ, this
+ * function loads a converter plugin that knows how to convert pages from
+ * the old format to the new format. If the versions are identical, this
+ * function just set plugin function that just copies file.
+ */
+static void
+setupMainPageConverter(pageCnvCtx *converter, int pos)
+{
uint16 oldPageVersion;
uint16 newPageVersion;
- pageCnvCtx *converter;
- const char *msg;
char dstName[MAXPGPATH];
char srcName[MAXPGPATH];
+ convertPlugin *plugin = &(converter->converterTable[pos]);
snprintf(dstName, sizeof(dstName), "%s/global/%u", new_cluster.pgdata,
new_cluster.pg_database_oid);
@@ -52,27 +98,114 @@ setupPageConverter(void)
getPageVersion(&oldPageVersion, srcName);
getPageVersion(&newPageVersion, dstName);
- /*
- * If the old cluster and new cluster use the same page layouts, then we
- * don't need a page converter.
- */
+ converter->oldPageVersion = oldPageVersion;
+ converter->newPageVersion = newPageVersion;
+ plugin->src_type_suffix = "";
+ plugin->dst_type_suffix = "";
+
if (newPageVersion != oldPageVersion)
{
+ char pluginName[MAXPGPATH];
+
+ /*
+ * Try to find a plugin that can convert pages of oldPageVersion into
+ * pages of newPageVersion. For example, if we oldPageVersion = 3 and
+ * newPageVersion is 4, we search for a plugin named:
+ * plugins/convertLayout_3_to_4.dll
+ */
+
+ /*
+ * FIXME: we are searching for plugins relative to the current directory,
+ * we should really search relative to our own executable instead.
+ */
+ snprintf(pluginName, sizeof(pluginName), "./plugins/convertLayout_%d_to_%d%s",
+ oldPageVersion, newPageVersion, DLSUFFIX);
+
/*
* The clusters use differing page layouts, see if we can find a
* plugin that knows how to convert from the old page layout to the
* new page layout.
*/
-
- if ((converter = loadConverterPlugin(newPageVersion, oldPageVersion)) == NULL)
+ if (!loadConverterPlugin(converter, pos, pluginName))
pg_fatal("could not find plugin to convert from old page layout to new page layout\n");
+ }
+ else
+ {
+ /*
+ * If we don't need to do any conversion then we set plugin function
+ * which just copies file.
+ */
+ setCopyPluginToPageConverter(plugin);
+ }
+}
+
+/*
+ * setupVMPageConverter()
+ *
+ * Set up convert plugin function for file having "_vm" type suffix.
+ */
+static void
+setupVMPageConverter(pageCnvCtx *converter, int pos)
+{
+ convertPlugin *plugin = &(converter->converterTable[pos]);
+
+ /* Set type suffix for visibility map */
+ plugin->src_type_suffix = "_vm";
+ plugin->dst_type_suffix = "_vm";
+
+ /*
+ * Do we need to add frozen bit into visibility map?
+ */
+ if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ {
+ char libpath[MAXPGPATH];
+ char pluginName[MAXPGPATH];
+ bool *checksum_enabled = pg_malloc(sizeof(bool));
+
+ get_lib_path(mypath, libpath);
+ snprintf(pluginName, sizeof(pluginName), "%s/plugins/convertLayoutVM_add_frozenbit%s",
+ libpath, DLSUFFIX);
- return converter;
+ if (!(loadConverterPlugin(converter, pos, pluginName)))
+ pg_fatal("could not find additional plugin to convert from old page layout to new page layout\n");
+
+ *checksum_enabled = false;
+
+ /* Check whether checksum is enabled on both cluster */
+ if (old_cluster.controldata.data_checksum_version != 0 &&
+ new_cluster.controldata.data_checksum_version != 0)
+ *checksum_enabled = true;
+
+ plugin->pluginData = (void *) checksum_enabled;
}
else
- return NULL;
+ /*
+ * If we don't need to do any convertion for visibility map then
+ * we set plugin function which just copies file.
+ */
+ setCopyPluginToPageConverter(plugin);
}
+/*
+ * setupFSMPageConverter()
+ *
+ * Set up page converter plugin function for the file having "_fsm" type suffix.
+ */
+static void
+setupFSMPageConverter(pageCnvCtx *converter, int pos)
+{
+ convertPlugin *plugin = &(converter->converterTable[pos]);
+
+ /* Set type suffix for free space map */
+ plugin->src_type_suffix = "_fsm";
+ plugin->dst_type_suffix = "_fsm";
+
+ /*
+ * We doesn't need do any convertion for free space map for now.
+ */
+ setCopyPluginToPageConverter(plugin);
+}
/*
* getPageVersion()
@@ -103,62 +236,31 @@ getPageVersion(uint16 *version, const char *pathName)
return;
}
-
/*
* loadConverterPlugin()
*
* This function loads a page-converter plugin library and grabs a
* pointer to each of the (interesting) functions provided by that
- * plugin. The name of the plugin library is derived from the given
- * newPageVersion and oldPageVersion. If a plugin is found, this
- * function returns a pointer to a pageCnvCtx object (which will contain
- * a collection of plugin function pointers). If the required plugin
- * is not found, this function returns NULL.
+ * plugin. The name of the plugin library is given. If a plugin is
+ * loaded successfully, this function returns true.
*/
-static pageCnvCtx *
-loadConverterPlugin(uint16 newPageVersion, uint16 oldPageVersion)
+static bool
+loadConverterPlugin(pageCnvCtx *converter, int pos, const char *pluginName)
{
- char pluginName[MAXPGPATH];
void *plugin;
- /*
- * Try to find a plugin that can convert pages of oldPageVersion into
- * pages of newPageVersion. For example, if we oldPageVersion = 3 and
- * newPageVersion is 4, we search for a plugin named:
- * plugins/convertLayout_3_to_4.dll
- */
-
- /*
- * FIXME: we are searching for plugins relative to the current directory,
- * we should really search relative to our own executable instead.
- */
- snprintf(pluginName, sizeof(pluginName), "./plugins/convertLayout_%d_to_%d%s",
- oldPageVersion, newPageVersion, DLSUFFIX);
-
- if ((plugin = pg_dlopen(pluginName)) == NULL)
- return NULL;
+ if ((plugin = dlopen(pluginName, RTLD_NOW | RTLD_GLOBAL)) == NULL)
+ return false;
else
{
- pageCnvCtx *result = (pageCnvCtx *) pg_malloc(sizeof(*result));
-
- result->old.PageVersion = oldPageVersion;
- result->new.PageVersion = newPageVersion;
+ convertPlugin *convert_plugin = &(converter->converterTable[pos]);
- result->startup = (pluginStartup) pg_dlsym(plugin, "init");
- result->convertFile = (pluginConvertFile) pg_dlsym(plugin, "convertFile");
- result->convertPage = (pluginConvertPage) pg_dlsym(plugin, "convertPage");
- result->shutdown = (pluginShutdown) pg_dlsym(plugin, "fini");
- result->pluginData = NULL;
-
- /*
- * If the plugin has exported an initializer, go ahead and invoke it.
- */
- if (result->startup)
- result->startup(MIGRATOR_API_VERSION, &result->pluginVersion,
- newPageVersion, oldPageVersion, &result->pluginData);
-
- return result;
+ convert_plugin->startup = (pluginStartup) dlsym(plugin, "init");
+ convert_plugin->convertFile = (pluginConvertFile) dlsym(plugin, "convertFile");
+ convert_plugin->convertPage = (pluginConvertPage) dlsym(plugin, "convertPage");
+ convert_plugin->shutdown = (pluginShutdown) dlsym(plugin, "fini");
+ convert_plugin->pluginData = NULL;
}
-}
-#endif
+ return true;
+}
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 984c395..71c69db 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -54,6 +54,7 @@ static void cleanup(void);
ClusterInfo old_cluster,
new_cluster;
OSInfo os_info;
+char mypath[MAXPGPATH];
char *output_files[] = {
SERVER_LOG_FILE,
@@ -76,6 +77,9 @@ main(int argc, char **argv)
parseCommandLine(argc, argv);
+ if (find_my_exec(argv[0], mypath) != 0)
+ pg_fatal("could not find own program executable\n");
+
get_restricted_token(os_info.progname);
adjust_data_dir(&old_cluster);
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index bc733c4..5f30623 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -110,6 +110,10 @@ extern char *output_files[];
#define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031
/*
+ * The format of visibility map is changed with this 9.6 commit,
+ */
+#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201602131
+/*
* pg_multixact format changed in 9.3 commit 0ac5ad5134f2769ccbaefec73844f85,
* ("Improve concurrency of foreign key locking") which also updated catalog
* version to this value. pg_upgrade behavior depends on whether old and new
@@ -322,6 +326,7 @@ extern UserOpts user_opts;
extern ClusterInfo old_cluster,
new_cluster;
extern OSInfo os_info;
+extern char mypath[MAXPGPATH];
/* check.c */
@@ -360,11 +365,15 @@ bool exec_prog(const char *log_file, const char *opt_log_file,
bool throw_error, const char *fmt,...) pg_attribute_printf(4, 5);
void verify_directories(void);
bool pid_lock_file_exists(const char *datadir);
-
+const char *pg_copy_file(void *pluginData, const char *src, const char *dst);
/* file.c */
-#ifdef PAGE_CONVERSION
+/*
+ * We have three kinds of file suffix; "", "vm", "_fsm".
+ */
+#define LOAD_PLUGIN_NUM 3 /* page converter for VM and others */
+
typedef const char *(*pluginStartup) (uint16 migratorVersion,
uint16 *pluginVersion, uint16 newPageVersion,
uint16 oldPageVersion, void **pluginData);
@@ -376,28 +385,32 @@ typedef const char *(*pluginShutdown) (void *pluginData);
typedef struct
{
+ uint16 pluginVersion; /* API version of converter plugin */
+ char *src_type_suffix;
+ char *dst_type_suffix;
+ pluginStartup startup; /* Pointer to plugin's startup function */
+ pluginConvertFile convertFile; /* Pointer to plugin's file converter
+ * function */
+ pluginConvertPage convertPage; /* Pointer to plugin's page converter
+ * function */
+ pluginShutdown shutdown; /* Pointer to plugin's shutdown function */
+ void *pluginData; /* Plugin data (set by plugin) */
+} convertPlugin;
+
+typedef struct
+{
uint16 oldPageVersion; /* Page layout version of the old cluster */
uint16 newPageVersion; /* Page layout version of the new cluster */
uint16 pluginVersion; /* API version of converter plugin */
- void *pluginData; /* Plugin data (set by plugin) */
- pluginStartup startup; /* Pointer to plugin's startup function */
- pluginConvertFile convertFile; /* Pointer to plugin's file converter
- * function */
- pluginConvertPage convertPage; /* Pointer to plugin's page converter
- * function */
- pluginShutdown shutdown; /* Pointer to plugin's shutdown function */
+ convertPlugin converterTable[LOAD_PLUGIN_NUM];
} pageCnvCtx;
const pageCnvCtx *setupPageConverter(void);
-#else
-/* dummy */
-typedef void *pageCnvCtx;
-#endif
const char *copyAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
- const char *dst, bool force);
+ const char *dst, const char *type_suffix);
const char *linkAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
- const char *dst);
+ const char *dst, const char *type_suffix);
void check_hard_link(void);
FILE *fopen_priv(const char *path, const char *mode);
diff --git a/src/bin/pg_upgrade/plugins/Makefile b/src/bin/pg_upgrade/plugins/Makefile
new file mode 100644
index 0000000..fb3f941
--- /dev/null
+++ b/src/bin/pg_upgrade/plugins/Makefile
@@ -0,0 +1,32 @@
+# src/bin/pg_upgrade/plugins/Makefile
+
+PGFILEDESC = "page conversion plugins for pg_upgrade"
+
+subdir = src/bin/pg_upgrade/plugins
+top_builddir = ../../../../
+include $(top_builddir)/src/Makefile.global
+
+#PG_CPPFLAGS=-I$(top_builddir)/src/bin/pg_upgrade
+override CPPFLAGS := -DDLSUFFIX=\"$(DLSUFFIX)\" -I$(srcdir) -I../ -I$(libpq_srcdir) $(CPPFLAGS)
+
+NAME = convertLayoutVM_add_frozenbit
+OBJS = convertLayoutVM_add_frozenbit.o
+plugindir = $(DESTDIR)$(libdir)/plugins
+
+all: all-shared-lib
+
+include $(top_srcdir)/src/Makefile.shlib
+
+install: all installdirs install-plugins
+
+installdirs:
+ $(MKDIR_P) '$(plugindir)'
+
+install-plugins:
+ $(INSTALL_SHLIB) $(NAME).so '$(plugindir)'
+
+uninstall:
+ rm -f '$(plugindir)/$(NAME).so'
+
+clean:
+ rm -f $(OBJS) $(NAME).so
\ No newline at end of file
diff --git a/src/bin/pg_upgrade/plugins/convertLayoutVM_add_frozenbit.c b/src/bin/pg_upgrade/plugins/convertLayoutVM_add_frozenbit.c
new file mode 100644
index 0000000..f0ab08f
--- /dev/null
+++ b/src/bin/pg_upgrade/plugins/convertLayoutVM_add_frozenbit.c
@@ -0,0 +1,159 @@
+/*
+ * convertLayoutVM_add_frozenbit.c
+ *
+ * Page converter plugin for Visibility Map
+ *
+ * Copyright (c) 2016, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/plugins.c
+ */
+
+#include "postgres_fe.h"
+
+#include "access/visibilitymap.h"
+#include "pg_upgrade.h"
+#include "storage/bufpage.h"
+#include "storage/checksum.h"
+#include "storage/checksum_impl.h"
+#include "port.h"
+
+#include <fcntl.h>
+
+/* plugin function */
+const char* convertFile(void *pluginData,
+ const char *dstName, const char *srcName);
+
+static const int rewriteVisibilitymap(const char *fromfile, const char *tofile,
+ bool checksum_enabled);
+
+/* table for fast rewriting vm file in order to add all-frozen information */
+static const uint16 rewrite_vm_table[256] = {
+ 0, 1, 4, 5, 16, 17, 20, 21, 64, 65, 68, 69, 80, 81, 84, 85,
+ 256, 257, 260, 261, 272, 273, 276, 277, 320, 321, 324, 325, 336, 337, 340, 341,
+ 1024, 1025, 1028, 1029, 1040, 1041, 1044, 1045, 1088, 1089, 1092, 1093, 1104, 1105, 1108, 1109,
+ 1280, 1281, 1284, 1285, 1296, 1297, 1300, 1301, 1344, 1345, 1348, 1349, 1360, 1361, 1364, 1365,
+ 4096, 4097, 4100, 4101, 4112, 4113, 4116, 4117, 4160, 4161, 4164, 4165, 4176, 4177, 4180, 4181,
+ 4352, 4353, 4356, 4357, 4368, 4369, 4372, 4373, 4416, 4417, 4420, 4421, 4432, 4433, 4436, 4437,
+ 5120, 5121, 5124, 5125, 5136, 5137, 5140, 5141, 5184, 5185, 5188, 5189, 5200, 5201, 5204, 5205,
+ 5376, 5377, 5380, 5381, 5392, 5393, 5396, 5397, 5440, 5441, 5444, 5445, 5456, 5457, 5460, 5461,
+ 16384, 16385, 16388, 16389, 16400, 16401, 16404, 16405, 16448, 16449, 16452, 16453, 16464, 16465, 16468, 16469,
+ 16640, 16641, 16644, 16645, 16656, 16657, 16660, 16661, 16704, 16705, 16708, 16709, 16720, 16721, 16724, 16725,
+ 17408, 17409, 17412, 17413, 17424, 17425, 17428, 17429, 17472, 17473, 17476, 17477, 17488, 17489, 17492, 17493,
+ 17664, 17665, 17668, 17669, 17680, 17681, 17684, 17685, 17728, 17729, 17732, 17733, 17744, 17745, 17748, 17749,
+ 20480, 20481, 20484, 20485, 20496, 20497, 20500, 20501, 20544, 20545, 20548, 20549, 20560, 20561, 20564, 20565,
+ 20736, 20737, 20740, 20741, 20752, 20753, 20756, 20757, 20800, 20801, 20804, 20805, 20816, 20817, 20820, 20821,
+ 21504, 21505, 21508, 21509, 21520, 21521, 21524, 21525, 21568, 21569, 21572, 21573, 21584, 21585, 21588, 21589,
+ 21760, 21761, 21764, 21765, 21776, 21777, 21780, 21781, 21824, 21825, 21828, 21829, 21840, 21841, 21844, 21845
+};
+
+/*
+ * convertVMFile()
+ *
+ * This plugin function is loaded by main procedure if required.
+ * pluginData has the information about that checksum are enabled on both
+ * cluster or not. If rewriting function failed then return error messages.
+ */
+const char *
+convertFile(void *pluginData, const char *dstName, const char *srcName)
+{
+ bool checksum_enabled;
+
+ checksum_enabled = *(bool *)pluginData;
+
+ if (rewriteVisibilitymap(srcName, dstName, checksum_enabled) == -1)
+ {
+#ifdef WIN32
+ _dosmaperr(GetLastError());
+#endif
+ return strdup(strerror(errno));
+ }
+
+ return NULL;
+}
+
+/*
+ * rewriteVisibilitymap()
+ *
+ * Copies a visibility map file while adding all-frozen bit(0) into each bit.
+ */
+static const int
+rewriteVisibilitymap(const char *fromfile, const char *tofile, bool checksum_enabled)
+{
+ int src_fd = 0;
+ int dst_fd = 0;
+ char buffer[BLCKSZ];
+ ssize_t bytesRead;
+ int rewriteVmBytesPerPage = (BLCKSZ - SizeOfPageHeaderData) / 2;
+ BlockNumber blkno = 0;
+
+ /* Reset errno */
+ errno = 0;
+
+ if ((fromfile == NULL) || (tofile == NULL))
+ goto err;
+
+ if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
+ goto err;
+
+ if ((dst_fd = open(tofile, O_RDWR | O_CREAT, S_IRUSR | S_IWUSR)) < 0)
+ goto err;
+
+ /* Perform data rewriting per page */
+ while ((bytesRead = read(src_fd, buffer, BLCKSZ)) == BLCKSZ)
+ {
+ char *cur, *end, *blkend;
+ PageHeaderData pageheader;
+ uint16 vm_bits;
+
+ /* Save the page header data */
+ memcpy(&pageheader, buffer, SizeOfPageHeaderData);
+
+ cur = buffer + SizeOfPageHeaderData;
+ end = buffer + SizeOfPageHeaderData + rewriteVmBytesPerPage;
+ blkend = buffer + bytesRead;
+
+ while (blkend >= end)
+ {
+ char vmbuf[BLCKSZ];
+ char *vmtmp = vmbuf;
+
+ /* Copy page header in advance */
+ memcpy(vmbuf, &pageheader, SizeOfPageHeaderData);
+
+ vmtmp += SizeOfPageHeaderData;
+
+ /* Rewrite visibility map bit one by one */
+ while (end > cur)
+ {
+ /* Write rewritten bit from table and its string representation */
+ vm_bits = rewrite_vm_table[(uint8) *cur];
+ memcpy(vmtmp, &vm_bits, BITS_PER_HEAPBLOCK);
+
+ cur++;
+ vmtmp += BITS_PER_HEAPBLOCK;
+ }
+
+ /* Set new checksum for a visibility map page, If enabled */
+ if (checksum_enabled)
+ ((PageHeader) vmbuf)->pd_checksum = pg_checksum_page(vmbuf, blkno);
+
+ if (write(dst_fd, vmbuf, BLCKSZ) != BLCKSZ)
+ {
+ if (errno == 0)
+ errno = ENOSPC;
+ goto err;
+ }
+
+ end += rewriteVmBytesPerPage;
+ blkno++;
+ }
+ }
+
+err:
+ if (src_fd != 0)
+ close(src_fd);
+
+ if (dst_fd != 0)
+ close(dst_fd);
+
+ return (errno == 0) ? 0 : -1;
+}
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index c059c5b..571393b 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -18,7 +18,7 @@
static void transfer_single_new_db(pageCnvCtx *pageConverter,
FileNameMap *maps, int size, char *old_tablespace);
static void transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *suffix);
+ const char *suffix);
/*
@@ -82,6 +82,10 @@ transfer_all_new_dbs(DbInfoArr *old_db_arr, DbInfoArr *new_db_arr,
{
int old_dbnum,
new_dbnum;
+ pageCnvCtx *pageConverter = NULL;
+
+ /* Set up page-converter and load necessary plugin */
+ pageConverter = (pageCnvCtx *) setupPageConverter();
/* Scan the old cluster databases and transfer their files */
for (old_dbnum = new_dbnum = 0;
@@ -92,7 +96,6 @@ transfer_all_new_dbs(DbInfoArr *old_db_arr, DbInfoArr *new_db_arr,
*new_db = NULL;
FileNameMap *mappings;
int n_maps;
- pageCnvCtx *pageConverter = NULL;
/*
* Advance past any databases that exist in the new cluster but not in
@@ -115,10 +118,6 @@ transfer_all_new_dbs(DbInfoArr *old_db_arr, DbInfoArr *new_db_arr,
if (n_maps)
{
print_maps(mappings, n_maps, new_db->db_name);
-
-#ifdef PAGE_CONVERSION
- pageConverter = setupPageConverter();
-#endif
transfer_single_new_db(pageConverter, mappings, n_maps,
old_tablespace);
}
@@ -144,15 +143,9 @@ get_pg_database_relfilenode(ClusterInfo *cluster)
int i_relfile;
res = executeQueryOrDie(conn,
- "SELECT c.relname, c.relfilenode "
- "FROM pg_catalog.pg_class c, "
- " pg_catalog.pg_namespace n "
- "WHERE c.relnamespace = n.oid AND "
- " n.nspname = 'pg_catalog' AND "
- " c.relname = 'pg_database' "
- "ORDER BY c.relname");
-
- i_relfile = PQfnumber(res, "relfilenode");
+ "SELECT pg_relation_filenode('pg_database') AS filenode");
+
+ i_relfile = PQfnumber(res, "filenode");
cluster->pg_database_oid = atooid(PQgetvalue(res, 0, i_relfile));
PQclear(res);
@@ -268,15 +261,11 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
/* Copying files might take some time, so give feedback. */
pg_log(PG_STATUS, "%s", old_file);
- if ((user_opts.transfer_mode == TRANSFER_MODE_LINK) && (pageConverter != NULL))
- pg_fatal("This upgrade requires page-by-page conversion, "
- "you must use copy mode instead of link mode.\n");
-
if (user_opts.transfer_mode == TRANSFER_MODE_COPY)
{
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
+ if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, type_suffix)) != NULL)
pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
@@ -284,7 +273,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "linking \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = linkAndUpdateFile(pageConverter, old_file, new_file)) != NULL)
+ if ((msg = linkAndUpdateFile(pageConverter, old_file, new_file, type_suffix)) != NULL)
pg_fatal("error while creating link for relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
diff --git a/src/bin/pg_upgrade/test.sh b/src/bin/pg_upgrade/test.sh
index ba79fb3..cd9b17e 100644
--- a/src/bin/pg_upgrade/test.sh
+++ b/src/bin/pg_upgrade/test.sh
@@ -174,6 +174,11 @@ if "$MAKE" -C "$oldsrc" installcheck; then
mv "$temp_root"/dump1.sql "$temp_root"/dump1.sql.orig
sed "s;$oldsrc;$newsrc;g" "$temp_root"/dump1.sql.orig >"$temp_root"/dump1.sql
fi
+
+ vm_sql="SELECT c.relname, c.relallvisible FROM pg_class as c, pg_namespace as n WHERE c.relnamespace = n.oid AND n.nspname NOT IN ('information_schema', 'pg_toast', 'pg_catalog') ORDER BY c.relname;"
+ # Test for rewriting visibility map
+ vacuumdb -d regression || visibilitymap_vacuum1_status=$?
+ psql -d regression -c "$vm_sql" > "$temp_root"/vm_test1.txt || visibilitymap_test1_status=$?
else
make_installcheck_status=$?
fi
@@ -188,6 +193,14 @@ if [ -n "$pg_dumpall1_status" ]; then
echo "pg_dumpall of pre-upgrade database cluster failed"
exit 1
fi
+if [ -n "$visibilitymap_vacuum1_status" ];then
+ echo "VACUUM of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+if [ -n "$visibilitymap_test1_status" ];then
+ echo "SELECT of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
PGDATA=$BASE_PGDATA
@@ -203,6 +216,8 @@ case $testhost in
esac
pg_dumpall -f "$temp_root"/dump2.sql || pg_dumpall2_status=$?
+vacuumdb -d regression || visibilitymap_vacuum2_status=$?
+psql -d regression -c "$vm_sql" > "$temp_root"/vm_test2.txt || visibilitymap_test2_status=$?
pg_ctl -m fast stop
# no need to echo commands anymore
@@ -214,11 +229,26 @@ if [ -n "$pg_dumpall2_status" ]; then
exit 1
fi
+if [ -n "$visibilitymap_vacuum2_status" ];then
+ echo "VACUUM of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
+if [ -n "$visibilitymap_test2_status" ];then
+ echo "SELECT of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
case $testhost in
MINGW*) cmd /c delete_old_cluster.bat ;;
*) sh ./delete_old_cluster.sh ;;
esac
+if ! diff "$temp_root"/vm_test1.txt "$temp_root"/vm_test2.txt >/dev/null; then
+ echo "Visibility map rewriting test failed"
+ exit 1
+fi
+
if diff "$temp_root"/dump1.sql "$temp_root"/dump2.sql >/dev/null; then
echo PASSED
exit 0
003_enhance_visibilitymap_debug_messages_v35.patchapplication/octet-stream; name=003_enhance_visibilitymap_debug_messages_v35.patchDownload
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 6186caf..f4d878b 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -175,7 +175,7 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_clear %s, block %d", RelationGetRelationName(rel), heapBlk);
#endif
if (!BufferIsValid(buf) || BufferGetBlockNumber(buf) != mapBlock)
@@ -274,7 +274,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
uint8 *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_set %s, block %d, flags %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
@@ -364,7 +364,7 @@ visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *buf)
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_get_status %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_get_status %s, block %d", RelationGetRelationName(rel), heapBlk);
#endif
/* Reuse the old pinned buffer if possible */
@@ -467,7 +467,7 @@ visibilitymap_truncate(Relation rel, BlockNumber nheapblocks)
uint8 truncBit = HEAPBLK_TO_MAPBIT(nheapblocks);
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_truncate %s %d", RelationGetRelationName(rel), nheapblocks);
+ elog(DEBUG1, "vm_truncate %s, block %d", RelationGetRelationName(rel), nheapblocks);
#endif
RelationOpenSmgr(rel);
On Sun, Feb 14, 2016 at 12:19 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Thank you for reviewing this patch.
I've divided 000 patch into two patches, and attached latest 4 patches in total.
Thank you! I'll go through this again as soon as I have a free moment.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Feb 10, 2016 at 04:39:15PM +0900, Kyotaro HORIGUCHI wrote:
I still agree with this plugin approach, but I felt it's still
complicated a bit, and I'm concerned that patch size has been
increased.
Please give me feedbacks.Yeah, I feel the same. What make it worse, the plugin mechanism
will get further complex if we make it more flexible for possible
usage as I proposed above. It is apparently too complicated for
deciding whether to load *just one*, for now, converter
function. And no additional converter is in sight.I incline to pull out all the plugin stuff of pg_upgrade. We are
so prudent to make changes of file formats so this kind of events
will happen with several-years intervals. The plugin mechanism
would be valuable if we are encouraged to change file formats
more frequently and freely by providing it, but such situation
absolutely introduces more untoward things..
I agreed on ripping out the converter plugin ability of pg_upgrade.
Remember pg_upgrade was originally written by EnterpriseDB staff, and I
think they expected their closed-source fork of Postgres might need a
custom page converter someday, but it never needed one, and at this
point I think having the code in there is just making things more
complex. I see _no_ reason for community Postgres to use a plugin
converter because we are going to need that code for every upgrade from
pre-9.6 to 9.6+, so why not just hard-code in the functions we need. We
can remove it once 9.5 is end-of-life.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Feb 16, 2016 at 6:13 AM, Bruce Momjian <bruce@momjian.us> wrote:
On Wed, Feb 10, 2016 at 04:39:15PM +0900, Kyotaro HORIGUCHI wrote:
I still agree with this plugin approach, but I felt it's still
complicated a bit, and I'm concerned that patch size has been
increased.
Please give me feedbacks.Yeah, I feel the same. What make it worse, the plugin mechanism
will get further complex if we make it more flexible for possible
usage as I proposed above. It is apparently too complicated for
deciding whether to load *just one*, for now, converter
function. And no additional converter is in sight.I incline to pull out all the plugin stuff of pg_upgrade. We are
so prudent to make changes of file formats so this kind of events
will happen with several-years intervals. The plugin mechanism
would be valuable if we are encouraged to change file formats
more frequently and freely by providing it, but such situation
absolutely introduces more untoward things..I agreed on ripping out the converter plugin ability of pg_upgrade.
Remember pg_upgrade was originally written by EnterpriseDB staff, and I
think they expected their closed-source fork of Postgres might need a
custom page converter someday, but it never needed one, and at this
point I think having the code in there is just making things more
complex. I see _no_ reason for community Postgres to use a plugin
converter because we are going to need that code for every upgrade from
pre-9.6 to 9.6+, so why not just hard-code in the functions we need. We
can remove it once 9.5 is end-of-life.
Hm, we should rather remove the source code around PAGE_CONVERSION and
page.c at 9.6?
Regards,
--
Masahiko Sawada
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Feb 16, 2016 at 11:56:25PM +0900, Masahiko Sawada wrote:
I agreed on ripping out the converter plugin ability of pg_upgrade.
Remember pg_upgrade was originally written by EnterpriseDB staff, and I
think they expected their closed-source fork of Postgres might need a
custom page converter someday, but it never needed one, and at this
point I think having the code in there is just making things more
complex. I see _no_ reason for community Postgres to use a plugin
converter because we are going to need that code for every upgrade from
pre-9.6 to 9.6+, so why not just hard-code in the functions we need. We
can remove it once 9.5 is end-of-life.Hm, we should rather remove the source code around PAGE_CONVERSION and
page.c at 9.6?
Yes. I can do it if you wish.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Feb 17, 2016 at 12:02 AM, Bruce Momjian <bruce@momjian.us> wrote:
On Tue, Feb 16, 2016 at 11:56:25PM +0900, Masahiko Sawada wrote:
I agreed on ripping out the converter plugin ability of pg_upgrade.
Remember pg_upgrade was originally written by EnterpriseDB staff, and I
think they expected their closed-source fork of Postgres might need a
custom page converter someday, but it never needed one, and at this
point I think having the code in there is just making things more
complex. I see _no_ reason for community Postgres to use a plugin
converter because we are going to need that code for every upgrade from
pre-9.6 to 9.6+, so why not just hard-code in the functions we need. We
can remove it once 9.5 is end-of-life.Hm, we should rather remove the source code around PAGE_CONVERSION and
page.c at 9.6?Yes. I can do it if you wish.
I see. I understand that page-converter code would be useful for some
future cases, but makes thing more complex.
So I will post the patch without page-converter If no objection from
other hackers.
Regards,
--
Masahiko Sawada
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Masahiko Sawada wrote:
On Wed, Feb 17, 2016 at 12:02 AM, Bruce Momjian <bruce@momjian.us> wrote:
On Tue, Feb 16, 2016 at 11:56:25PM +0900, Masahiko Sawada wrote:
I agreed on ripping out the converter plugin ability of pg_upgrade.
Remember pg_upgrade was originally written by EnterpriseDB staff, and I
think they expected their closed-source fork of Postgres might need a
custom page converter someday, but it never needed one, and at this
point I think having the code in there is just making things more
complex. I see _no_ reason for community Postgres to use a plugin
converter because we are going to need that code for every upgrade from
pre-9.6 to 9.6+, so why not just hard-code in the functions we need. We
can remove it once 9.5 is end-of-life.Hm, we should rather remove the source code around PAGE_CONVERSION and
page.c at 9.6?Yes. I can do it if you wish.
I see. I understand that page-converter code would be useful for some
future cases, but makes thing more complex.
If we're not going to use it, let's get rid of it right away. There's
no point in having a feature that adds complexity just because we might
find some hypothetical use of it in a not-yet-imagined future.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Feb 16, 2016 at 03:57:01PM -0300, Alvaro Herrera wrote:
Masahiko Sawada wrote:
On Wed, Feb 17, 2016 at 12:02 AM, Bruce Momjian <bruce@momjian.us> wrote:
On Tue, Feb 16, 2016 at 11:56:25PM +0900, Masahiko Sawada wrote:
I agreed on ripping out the converter plugin ability of pg_upgrade.
Remember pg_upgrade was originally written by EnterpriseDB staff, and I
think they expected their closed-source fork of Postgres might need a
custom page converter someday, but it never needed one, and at this
point I think having the code in there is just making things more
complex. I see _no_ reason for community Postgres to use a plugin
converter because we are going to need that code for every upgrade from
pre-9.6 to 9.6+, so why not just hard-code in the functions we need. We
can remove it once 9.5 is end-of-life.Hm, we should rather remove the source code around PAGE_CONVERSION and
page.c at 9.6?Yes. I can do it if you wish.
I see. I understand that page-converter code would be useful for some
future cases, but makes thing more complex.If we're not going to use it, let's get rid of it right away. There's
no point in having a feature that adds complexity just because we might
find some hypothetical use of it in a not-yet-imagined future.
Agreed. We can always add it later if we need it.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Feb 17, 2016 at 4:08 AM, Bruce Momjian <bruce@momjian.us> wrote:
On Tue, Feb 16, 2016 at 03:57:01PM -0300, Alvaro Herrera wrote:
Masahiko Sawada wrote:
On Wed, Feb 17, 2016 at 12:02 AM, Bruce Momjian <bruce@momjian.us> wrote:
On Tue, Feb 16, 2016 at 11:56:25PM +0900, Masahiko Sawada wrote:
I agreed on ripping out the converter plugin ability of pg_upgrade.
Remember pg_upgrade was originally written by EnterpriseDB staff, and I
think they expected their closed-source fork of Postgres might need a
custom page converter someday, but it never needed one, and at this
point I think having the code in there is just making things more
complex. I see _no_ reason for community Postgres to use a plugin
converter because we are going to need that code for every upgrade from
pre-9.6 to 9.6+, so why not just hard-code in the functions we need. We
can remove it once 9.5 is end-of-life.Hm, we should rather remove the source code around PAGE_CONVERSION and
page.c at 9.6?Yes. I can do it if you wish.
I see. I understand that page-converter code would be useful for some
future cases, but makes thing more complex.If we're not going to use it, let's get rid of it right away. There's
no point in having a feature that adds complexity just because we might
find some hypothetical use of it in a not-yet-imagined future.Agreed. We can always add it later if we need it.
Attached patch gets rid of page conversion code.
Regards,
--
Masahiko Sawada
Attachments:
Remove_page_conversion_from_pg_upgrade.patchbinary/octet-stream; name=Remove_page_conversion_from_pg_upgrade.patchDownload
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 9357ad8..115d506 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -25,15 +25,11 @@ static int win32_pghardlink(const char *src, const char *dst);
/*
* copyAndUpdateFile()
*
- * Copies a relation file from src to dst. If pageConverter is non-NULL, this function
- * uses that pageConverter to do a page-by-page conversion.
+ * Copies a relation file from src to dst.
*/
const char *
-copyAndUpdateFile(pageCnvCtx *pageConverter,
- const char *src, const char *dst, bool force)
+copyAndUpdateFile(const char *src, const char *dst, bool force)
{
- if (pageConverter == NULL)
- {
#ifndef WIN32
if (copy_file(src, dst, force) == -1)
#else
@@ -42,65 +38,6 @@ copyAndUpdateFile(pageCnvCtx *pageConverter,
return getErrorText();
else
return NULL;
- }
- else
- {
- /*
- * We have a pageConverter object - that implies that the
- * PageLayoutVersion differs between the two clusters so we have to
- * perform a page-by-page conversion.
- *
- * If the pageConverter can convert the entire file at once, invoke
- * that plugin function, otherwise, read each page in the relation
- * file and call the convertPage plugin function.
- */
-
-#ifdef PAGE_CONVERSION
- if (pageConverter->convertFile)
- return pageConverter->convertFile(pageConverter->pluginData,
- dst, src);
- else
-#endif
- {
- int src_fd;
- int dstfd;
- char buf[BLCKSZ];
- ssize_t bytesRead;
- const char *msg = NULL;
-
- if ((src_fd = open(src, O_RDONLY, 0)) < 0)
- return "could not open source file";
-
- if ((dstfd = open(dst, O_RDWR | O_CREAT | O_EXCL, S_IRUSR | S_IWUSR)) < 0)
- {
- close(src_fd);
- return "could not create destination file";
- }
-
- while ((bytesRead = read(src_fd, buf, BLCKSZ)) == BLCKSZ)
- {
-#ifdef PAGE_CONVERSION
- if ((msg = pageConverter->convertPage(pageConverter->pluginData, buf, buf)) != NULL)
- break;
-#endif
- if (write(dstfd, buf, BLCKSZ) != BLCKSZ)
- {
- msg = "could not write new page to destination";
- break;
- }
- }
-
- close(src_fd);
- close(dstfd);
-
- if (msg)
- return msg;
- else if (bytesRead != 0)
- return "found partial page in source file";
- else
- return NULL;
- }
- }
}
@@ -114,12 +51,8 @@ copyAndUpdateFile(pageCnvCtx *pageConverter,
* instead of copying the data from the old cluster to the new cluster.
*/
const char *
-linkAndUpdateFile(pageCnvCtx *pageConverter,
- const char *src, const char *dst)
+linkAndUpdateFile(const char *src, const char *dst)
{
- if (pageConverter != NULL)
- return "Cannot in-place update this cluster, page-by-page conversion is required";
-
if (pg_link_file(src, dst) == -1)
return getErrorText();
else
diff --git a/src/bin/pg_upgrade/page.c b/src/bin/pg_upgrade/page.c
deleted file mode 100644
index e5686e5..0000000
--- a/src/bin/pg_upgrade/page.c
+++ /dev/null
@@ -1,164 +0,0 @@
-/*
- * page.c
- *
- * per-page conversion operations
- *
- * Copyright (c) 2010-2016, PostgreSQL Global Development Group
- * src/bin/pg_upgrade/page.c
- */
-
-#include "postgres_fe.h"
-
-#include "pg_upgrade.h"
-
-#include "storage/bufpage.h"
-
-
-#ifdef PAGE_CONVERSION
-
-
-static void getPageVersion(
- uint16 *version, const char *pathName);
-static pageCnvCtx *loadConverterPlugin(
- uint16 newPageVersion, uint16 oldPageVersion);
-
-
-/*
- * setupPageConverter()
- *
- * This function determines the PageLayoutVersion of the old cluster and
- * the PageLayoutVersion of the new cluster. If the versions differ, this
- * function loads a converter plugin and returns a pointer to a pageCnvCtx
- * object (in *result) that knows how to convert pages from the old format
- * to the new format. If the versions are identical, this function just
- * returns a NULL pageCnvCtx pointer to indicate that page-by-page conversion
- * is not required.
- */
-pageCnvCtx *
-setupPageConverter(void)
-{
- uint16 oldPageVersion;
- uint16 newPageVersion;
- pageCnvCtx *converter;
- const char *msg;
- char dstName[MAXPGPATH];
- char srcName[MAXPGPATH];
-
- snprintf(dstName, sizeof(dstName), "%s/global/%u", new_cluster.pgdata,
- new_cluster.pg_database_oid);
- snprintf(srcName, sizeof(srcName), "%s/global/%u", old_cluster.pgdata,
- old_cluster.pg_database_oid);
-
- getPageVersion(&oldPageVersion, srcName);
- getPageVersion(&newPageVersion, dstName);
-
- /*
- * If the old cluster and new cluster use the same page layouts, then we
- * don't need a page converter.
- */
- if (newPageVersion != oldPageVersion)
- {
- /*
- * The clusters use differing page layouts, see if we can find a
- * plugin that knows how to convert from the old page layout to the
- * new page layout.
- */
-
- if ((converter = loadConverterPlugin(newPageVersion, oldPageVersion)) == NULL)
- pg_fatal("could not find plugin to convert from old page layout to new page layout\n");
-
- return converter;
- }
- else
- return NULL;
-}
-
-
-/*
- * getPageVersion()
- *
- * Retrieves the PageLayoutVersion for the given relation.
- *
- * Returns NULL on success (and stores the PageLayoutVersion at *version),
- * if an error occurs, this function returns an error message (in the form
- * of a null-terminated string).
- */
-static void
-getPageVersion(uint16 *version, const char *pathName)
-{
- int relfd;
- PageHeaderData page;
- ssize_t bytesRead;
-
- if ((relfd = open(pathName, O_RDONLY, 0)) < 0)
- pg_fatal("could not open relation %s\n", pathName);
-
- if ((bytesRead = read(relfd, &page, sizeof(page))) != sizeof(page))
- pg_fatal("could not read page header of %s\n", pathName);
-
- *version = PageGetPageLayoutVersion(&page);
-
- close(relfd);
-
- return;
-}
-
-
-/*
- * loadConverterPlugin()
- *
- * This function loads a page-converter plugin library and grabs a
- * pointer to each of the (interesting) functions provided by that
- * plugin. The name of the plugin library is derived from the given
- * newPageVersion and oldPageVersion. If a plugin is found, this
- * function returns a pointer to a pageCnvCtx object (which will contain
- * a collection of plugin function pointers). If the required plugin
- * is not found, this function returns NULL.
- */
-static pageCnvCtx *
-loadConverterPlugin(uint16 newPageVersion, uint16 oldPageVersion)
-{
- char pluginName[MAXPGPATH];
- void *plugin;
-
- /*
- * Try to find a plugin that can convert pages of oldPageVersion into
- * pages of newPageVersion. For example, if we oldPageVersion = 3 and
- * newPageVersion is 4, we search for a plugin named:
- * plugins/convertLayout_3_to_4.dll
- */
-
- /*
- * FIXME: we are searching for plugins relative to the current directory,
- * we should really search relative to our own executable instead.
- */
- snprintf(pluginName, sizeof(pluginName), "./plugins/convertLayout_%d_to_%d%s",
- oldPageVersion, newPageVersion, DLSUFFIX);
-
- if ((plugin = pg_dlopen(pluginName)) == NULL)
- return NULL;
- else
- {
- pageCnvCtx *result = (pageCnvCtx *) pg_malloc(sizeof(*result));
-
- result->old.PageVersion = oldPageVersion;
- result->new.PageVersion = newPageVersion;
-
- result->startup = (pluginStartup) pg_dlsym(plugin, "init");
- result->convertFile = (pluginConvertFile) pg_dlsym(plugin, "convertFile");
- result->convertPage = (pluginConvertPage) pg_dlsym(plugin, "convertPage");
- result->shutdown = (pluginShutdown) pg_dlsym(plugin, "fini");
- result->pluginData = NULL;
-
- /*
- * If the plugin has exported an initializer, go ahead and invoke it.
- */
- if (result->startup)
- result->startup(MIGRATOR_API_VERSION, &result->pluginVersion,
- newPageVersion, oldPageVersion, &result->pluginData);
-
- return result;
- }
-}
-
-#endif
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index bc733c4..327c1e9 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -364,40 +364,8 @@ bool pid_lock_file_exists(const char *datadir);
/* file.c */
-#ifdef PAGE_CONVERSION
-typedef const char *(*pluginStartup) (uint16 migratorVersion,
- uint16 *pluginVersion, uint16 newPageVersion,
- uint16 oldPageVersion, void **pluginData);
-typedef const char *(*pluginConvertFile) (void *pluginData,
- const char *dstName, const char *srcName);
-typedef const char *(*pluginConvertPage) (void *pluginData,
- const char *dstPage, const char *srcPage);
-typedef const char *(*pluginShutdown) (void *pluginData);
-
-typedef struct
-{
- uint16 oldPageVersion; /* Page layout version of the old cluster */
- uint16 newPageVersion; /* Page layout version of the new cluster */
- uint16 pluginVersion; /* API version of converter plugin */
- void *pluginData; /* Plugin data (set by plugin) */
- pluginStartup startup; /* Pointer to plugin's startup function */
- pluginConvertFile convertFile; /* Pointer to plugin's file converter
- * function */
- pluginConvertPage convertPage; /* Pointer to plugin's page converter
- * function */
- pluginShutdown shutdown; /* Pointer to plugin's shutdown function */
-} pageCnvCtx;
-
-const pageCnvCtx *setupPageConverter(void);
-#else
-/* dummy */
-typedef void *pageCnvCtx;
-#endif
-
-const char *copyAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
- const char *dst, bool force);
-const char *linkAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
- const char *dst);
+const char *copyAndUpdateFile(const char *src, const char *dst, bool force);
+const char *linkAndUpdateFile(const char *src, const char *dst);
void check_hard_link(void);
FILE *fopen_priv(const char *path, const char *mode);
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index c059c5b..3c82342 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -15,10 +15,8 @@
#include "access/transam.h"
-static void transfer_single_new_db(pageCnvCtx *pageConverter,
- FileNameMap *maps, int size, char *old_tablespace);
-static void transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *suffix);
+static void transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace);
+static void transfer_relfile(FileNameMap *map, const char *suffix);
/*
@@ -92,7 +90,6 @@ transfer_all_new_dbs(DbInfoArr *old_db_arr, DbInfoArr *new_db_arr,
*new_db = NULL;
FileNameMap *mappings;
int n_maps;
- pageCnvCtx *pageConverter = NULL;
/*
* Advance past any databases that exist in the new cluster but not in
@@ -116,11 +113,7 @@ transfer_all_new_dbs(DbInfoArr *old_db_arr, DbInfoArr *new_db_arr,
{
print_maps(mappings, n_maps, new_db->db_name);
-#ifdef PAGE_CONVERSION
- pageConverter = setupPageConverter();
-#endif
- transfer_single_new_db(pageConverter, mappings, n_maps,
- old_tablespace);
+ transfer_single_new_db(mappings, n_maps, old_tablespace);
}
/* We allocate something even for n_maps == 0 */
pg_free(mappings);
@@ -166,8 +159,7 @@ get_pg_database_relfilenode(ClusterInfo *cluster)
* create links for mappings stored in "maps" array.
*/
static void
-transfer_single_new_db(pageCnvCtx *pageConverter,
- FileNameMap *maps, int size, char *old_tablespace)
+transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
{
int mapnum;
bool vm_crashsafe_match = true;
@@ -186,7 +178,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
strcmp(maps[mapnum].old_tablespace, old_tablespace) == 0)
{
/* transfer primary file */
- transfer_relfile(pageConverter, &maps[mapnum], "");
+ transfer_relfile(&maps[mapnum], "");
/* fsm/vm files added in PG 8.4 */
if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)
@@ -194,9 +186,9 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
/*
* Copy/link any fsm and vm files, if they exist
*/
- transfer_relfile(pageConverter, &maps[mapnum], "_fsm");
+ transfer_relfile(&maps[mapnum], "_fsm");
if (vm_crashsafe_match)
- transfer_relfile(pageConverter, &maps[mapnum], "_vm");
+ transfer_relfile(&maps[mapnum], "_vm");
}
}
}
@@ -209,8 +201,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
* Copy or link file from old cluster to new one.
*/
static void
-transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *type_suffix)
+transfer_relfile(FileNameMap *map, const char *type_suffix)
{
const char *msg;
char old_file[MAXPGPATH];
@@ -268,15 +259,11 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
/* Copying files might take some time, so give feedback. */
pg_log(PG_STATUS, "%s", old_file);
- if ((user_opts.transfer_mode == TRANSFER_MODE_LINK) && (pageConverter != NULL))
- pg_fatal("This upgrade requires page-by-page conversion, "
- "you must use copy mode instead of link mode.\n");
-
if (user_opts.transfer_mode == TRANSFER_MODE_COPY)
{
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
+ if ((msg = copyAndUpdateFile(old_file, new_file, true)) != NULL)
pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
@@ -284,7 +271,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "linking \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = linkAndUpdateFile(pageConverter, old_file, new_file)) != NULL)
+ if ((msg = linkAndUpdateFile(old_file, new_file)) != NULL)
pg_fatal("error while creating link for relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
On Wed, Feb 17, 2016 at 4:29 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Wed, Feb 17, 2016 at 4:08 AM, Bruce Momjian <bruce@momjian.us> wrote:
On Tue, Feb 16, 2016 at 03:57:01PM -0300, Alvaro Herrera wrote:
Masahiko Sawada wrote:
On Wed, Feb 17, 2016 at 12:02 AM, Bruce Momjian <bruce@momjian.us> wrote:
On Tue, Feb 16, 2016 at 11:56:25PM +0900, Masahiko Sawada wrote:
I agreed on ripping out the converter plugin ability of pg_upgrade.
Remember pg_upgrade was originally written by EnterpriseDB staff, and I
think they expected their closed-source fork of Postgres might need a
custom page converter someday, but it never needed one, and at this
point I think having the code in there is just making things more
complex. I see _no_ reason for community Postgres to use a plugin
converter because we are going to need that code for every upgrade from
pre-9.6 to 9.6+, so why not just hard-code in the functions we need. We
can remove it once 9.5 is end-of-life.Hm, we should rather remove the source code around PAGE_CONVERSION and
page.c at 9.6?Yes. I can do it if you wish.
I see. I understand that page-converter code would be useful for some
future cases, but makes thing more complex.If we're not going to use it, let's get rid of it right away. There's
no point in having a feature that adds complexity just because we might
find some hypothetical use of it in a not-yet-imagined future.Agreed. We can always add it later if we need it.
Attached patch gets rid of page conversion code.
Sorry, previous patch is incorrect..
Fixed version patch attached.
Regards,
--
Masahiko Sawada
Attachments:
Remove_page_conversion_from_pg_upgrade_v2.patchbinary/octet-stream; name=Remove_page_conversion_from_pg_upgrade_v2.patchDownload
diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile
index d9c8145..0c882d9 100644
--- a/src/bin/pg_upgrade/Makefile
+++ b/src/bin/pg_upgrade/Makefile
@@ -8,7 +8,7 @@ top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
OBJS = check.o controldata.o dump.o exec.o file.o function.o info.o \
- option.o page.o parallel.o pg_upgrade.o relfilenode.o server.o \
+ option.o parallel.o pg_upgrade.o relfilenode.o server.o \
tablespace.o util.o version.o $(WIN32RES)
override CPPFLAGS := -DDLSUFFIX=\"$(DLSUFFIX)\" -I$(srcdir) -I$(libpq_srcdir) $(CPPFLAGS)
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 9357ad8..115d506 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -25,15 +25,11 @@ static int win32_pghardlink(const char *src, const char *dst);
/*
* copyAndUpdateFile()
*
- * Copies a relation file from src to dst. If pageConverter is non-NULL, this function
- * uses that pageConverter to do a page-by-page conversion.
+ * Copies a relation file from src to dst.
*/
const char *
-copyAndUpdateFile(pageCnvCtx *pageConverter,
- const char *src, const char *dst, bool force)
+copyAndUpdateFile(const char *src, const char *dst, bool force)
{
- if (pageConverter == NULL)
- {
#ifndef WIN32
if (copy_file(src, dst, force) == -1)
#else
@@ -42,65 +38,6 @@ copyAndUpdateFile(pageCnvCtx *pageConverter,
return getErrorText();
else
return NULL;
- }
- else
- {
- /*
- * We have a pageConverter object - that implies that the
- * PageLayoutVersion differs between the two clusters so we have to
- * perform a page-by-page conversion.
- *
- * If the pageConverter can convert the entire file at once, invoke
- * that plugin function, otherwise, read each page in the relation
- * file and call the convertPage plugin function.
- */
-
-#ifdef PAGE_CONVERSION
- if (pageConverter->convertFile)
- return pageConverter->convertFile(pageConverter->pluginData,
- dst, src);
- else
-#endif
- {
- int src_fd;
- int dstfd;
- char buf[BLCKSZ];
- ssize_t bytesRead;
- const char *msg = NULL;
-
- if ((src_fd = open(src, O_RDONLY, 0)) < 0)
- return "could not open source file";
-
- if ((dstfd = open(dst, O_RDWR | O_CREAT | O_EXCL, S_IRUSR | S_IWUSR)) < 0)
- {
- close(src_fd);
- return "could not create destination file";
- }
-
- while ((bytesRead = read(src_fd, buf, BLCKSZ)) == BLCKSZ)
- {
-#ifdef PAGE_CONVERSION
- if ((msg = pageConverter->convertPage(pageConverter->pluginData, buf, buf)) != NULL)
- break;
-#endif
- if (write(dstfd, buf, BLCKSZ) != BLCKSZ)
- {
- msg = "could not write new page to destination";
- break;
- }
- }
-
- close(src_fd);
- close(dstfd);
-
- if (msg)
- return msg;
- else if (bytesRead != 0)
- return "found partial page in source file";
- else
- return NULL;
- }
- }
}
@@ -114,12 +51,8 @@ copyAndUpdateFile(pageCnvCtx *pageConverter,
* instead of copying the data from the old cluster to the new cluster.
*/
const char *
-linkAndUpdateFile(pageCnvCtx *pageConverter,
- const char *src, const char *dst)
+linkAndUpdateFile(const char *src, const char *dst)
{
- if (pageConverter != NULL)
- return "Cannot in-place update this cluster, page-by-page conversion is required";
-
if (pg_link_file(src, dst) == -1)
return getErrorText();
else
diff --git a/src/bin/pg_upgrade/page.c b/src/bin/pg_upgrade/page.c
deleted file mode 100644
index e5686e5..0000000
--- a/src/bin/pg_upgrade/page.c
+++ /dev/null
@@ -1,164 +0,0 @@
-/*
- * page.c
- *
- * per-page conversion operations
- *
- * Copyright (c) 2010-2016, PostgreSQL Global Development Group
- * src/bin/pg_upgrade/page.c
- */
-
-#include "postgres_fe.h"
-
-#include "pg_upgrade.h"
-
-#include "storage/bufpage.h"
-
-
-#ifdef PAGE_CONVERSION
-
-
-static void getPageVersion(
- uint16 *version, const char *pathName);
-static pageCnvCtx *loadConverterPlugin(
- uint16 newPageVersion, uint16 oldPageVersion);
-
-
-/*
- * setupPageConverter()
- *
- * This function determines the PageLayoutVersion of the old cluster and
- * the PageLayoutVersion of the new cluster. If the versions differ, this
- * function loads a converter plugin and returns a pointer to a pageCnvCtx
- * object (in *result) that knows how to convert pages from the old format
- * to the new format. If the versions are identical, this function just
- * returns a NULL pageCnvCtx pointer to indicate that page-by-page conversion
- * is not required.
- */
-pageCnvCtx *
-setupPageConverter(void)
-{
- uint16 oldPageVersion;
- uint16 newPageVersion;
- pageCnvCtx *converter;
- const char *msg;
- char dstName[MAXPGPATH];
- char srcName[MAXPGPATH];
-
- snprintf(dstName, sizeof(dstName), "%s/global/%u", new_cluster.pgdata,
- new_cluster.pg_database_oid);
- snprintf(srcName, sizeof(srcName), "%s/global/%u", old_cluster.pgdata,
- old_cluster.pg_database_oid);
-
- getPageVersion(&oldPageVersion, srcName);
- getPageVersion(&newPageVersion, dstName);
-
- /*
- * If the old cluster and new cluster use the same page layouts, then we
- * don't need a page converter.
- */
- if (newPageVersion != oldPageVersion)
- {
- /*
- * The clusters use differing page layouts, see if we can find a
- * plugin that knows how to convert from the old page layout to the
- * new page layout.
- */
-
- if ((converter = loadConverterPlugin(newPageVersion, oldPageVersion)) == NULL)
- pg_fatal("could not find plugin to convert from old page layout to new page layout\n");
-
- return converter;
- }
- else
- return NULL;
-}
-
-
-/*
- * getPageVersion()
- *
- * Retrieves the PageLayoutVersion for the given relation.
- *
- * Returns NULL on success (and stores the PageLayoutVersion at *version),
- * if an error occurs, this function returns an error message (in the form
- * of a null-terminated string).
- */
-static void
-getPageVersion(uint16 *version, const char *pathName)
-{
- int relfd;
- PageHeaderData page;
- ssize_t bytesRead;
-
- if ((relfd = open(pathName, O_RDONLY, 0)) < 0)
- pg_fatal("could not open relation %s\n", pathName);
-
- if ((bytesRead = read(relfd, &page, sizeof(page))) != sizeof(page))
- pg_fatal("could not read page header of %s\n", pathName);
-
- *version = PageGetPageLayoutVersion(&page);
-
- close(relfd);
-
- return;
-}
-
-
-/*
- * loadConverterPlugin()
- *
- * This function loads a page-converter plugin library and grabs a
- * pointer to each of the (interesting) functions provided by that
- * plugin. The name of the plugin library is derived from the given
- * newPageVersion and oldPageVersion. If a plugin is found, this
- * function returns a pointer to a pageCnvCtx object (which will contain
- * a collection of plugin function pointers). If the required plugin
- * is not found, this function returns NULL.
- */
-static pageCnvCtx *
-loadConverterPlugin(uint16 newPageVersion, uint16 oldPageVersion)
-{
- char pluginName[MAXPGPATH];
- void *plugin;
-
- /*
- * Try to find a plugin that can convert pages of oldPageVersion into
- * pages of newPageVersion. For example, if we oldPageVersion = 3 and
- * newPageVersion is 4, we search for a plugin named:
- * plugins/convertLayout_3_to_4.dll
- */
-
- /*
- * FIXME: we are searching for plugins relative to the current directory,
- * we should really search relative to our own executable instead.
- */
- snprintf(pluginName, sizeof(pluginName), "./plugins/convertLayout_%d_to_%d%s",
- oldPageVersion, newPageVersion, DLSUFFIX);
-
- if ((plugin = pg_dlopen(pluginName)) == NULL)
- return NULL;
- else
- {
- pageCnvCtx *result = (pageCnvCtx *) pg_malloc(sizeof(*result));
-
- result->old.PageVersion = oldPageVersion;
- result->new.PageVersion = newPageVersion;
-
- result->startup = (pluginStartup) pg_dlsym(plugin, "init");
- result->convertFile = (pluginConvertFile) pg_dlsym(plugin, "convertFile");
- result->convertPage = (pluginConvertPage) pg_dlsym(plugin, "convertPage");
- result->shutdown = (pluginShutdown) pg_dlsym(plugin, "fini");
- result->pluginData = NULL;
-
- /*
- * If the plugin has exported an initializer, go ahead and invoke it.
- */
- if (result->startup)
- result->startup(MIGRATOR_API_VERSION, &result->pluginVersion,
- newPageVersion, oldPageVersion, &result->pluginData);
-
- return result;
- }
-}
-
-#endif
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index bc733c4..327c1e9 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -364,40 +364,8 @@ bool pid_lock_file_exists(const char *datadir);
/* file.c */
-#ifdef PAGE_CONVERSION
-typedef const char *(*pluginStartup) (uint16 migratorVersion,
- uint16 *pluginVersion, uint16 newPageVersion,
- uint16 oldPageVersion, void **pluginData);
-typedef const char *(*pluginConvertFile) (void *pluginData,
- const char *dstName, const char *srcName);
-typedef const char *(*pluginConvertPage) (void *pluginData,
- const char *dstPage, const char *srcPage);
-typedef const char *(*pluginShutdown) (void *pluginData);
-
-typedef struct
-{
- uint16 oldPageVersion; /* Page layout version of the old cluster */
- uint16 newPageVersion; /* Page layout version of the new cluster */
- uint16 pluginVersion; /* API version of converter plugin */
- void *pluginData; /* Plugin data (set by plugin) */
- pluginStartup startup; /* Pointer to plugin's startup function */
- pluginConvertFile convertFile; /* Pointer to plugin's file converter
- * function */
- pluginConvertPage convertPage; /* Pointer to plugin's page converter
- * function */
- pluginShutdown shutdown; /* Pointer to plugin's shutdown function */
-} pageCnvCtx;
-
-const pageCnvCtx *setupPageConverter(void);
-#else
-/* dummy */
-typedef void *pageCnvCtx;
-#endif
-
-const char *copyAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
- const char *dst, bool force);
-const char *linkAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
- const char *dst);
+const char *copyAndUpdateFile(const char *src, const char *dst, bool force);
+const char *linkAndUpdateFile(const char *src, const char *dst);
void check_hard_link(void);
FILE *fopen_priv(const char *path, const char *mode);
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index c059c5b..3c82342 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -15,10 +15,8 @@
#include "access/transam.h"
-static void transfer_single_new_db(pageCnvCtx *pageConverter,
- FileNameMap *maps, int size, char *old_tablespace);
-static void transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *suffix);
+static void transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace);
+static void transfer_relfile(FileNameMap *map, const char *suffix);
/*
@@ -92,7 +90,6 @@ transfer_all_new_dbs(DbInfoArr *old_db_arr, DbInfoArr *new_db_arr,
*new_db = NULL;
FileNameMap *mappings;
int n_maps;
- pageCnvCtx *pageConverter = NULL;
/*
* Advance past any databases that exist in the new cluster but not in
@@ -116,11 +113,7 @@ transfer_all_new_dbs(DbInfoArr *old_db_arr, DbInfoArr *new_db_arr,
{
print_maps(mappings, n_maps, new_db->db_name);
-#ifdef PAGE_CONVERSION
- pageConverter = setupPageConverter();
-#endif
- transfer_single_new_db(pageConverter, mappings, n_maps,
- old_tablespace);
+ transfer_single_new_db(mappings, n_maps, old_tablespace);
}
/* We allocate something even for n_maps == 0 */
pg_free(mappings);
@@ -166,8 +159,7 @@ get_pg_database_relfilenode(ClusterInfo *cluster)
* create links for mappings stored in "maps" array.
*/
static void
-transfer_single_new_db(pageCnvCtx *pageConverter,
- FileNameMap *maps, int size, char *old_tablespace)
+transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
{
int mapnum;
bool vm_crashsafe_match = true;
@@ -186,7 +178,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
strcmp(maps[mapnum].old_tablespace, old_tablespace) == 0)
{
/* transfer primary file */
- transfer_relfile(pageConverter, &maps[mapnum], "");
+ transfer_relfile(&maps[mapnum], "");
/* fsm/vm files added in PG 8.4 */
if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)
@@ -194,9 +186,9 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
/*
* Copy/link any fsm and vm files, if they exist
*/
- transfer_relfile(pageConverter, &maps[mapnum], "_fsm");
+ transfer_relfile(&maps[mapnum], "_fsm");
if (vm_crashsafe_match)
- transfer_relfile(pageConverter, &maps[mapnum], "_vm");
+ transfer_relfile(&maps[mapnum], "_vm");
}
}
}
@@ -209,8 +201,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
* Copy or link file from old cluster to new one.
*/
static void
-transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *type_suffix)
+transfer_relfile(FileNameMap *map, const char *type_suffix)
{
const char *msg;
char old_file[MAXPGPATH];
@@ -268,15 +259,11 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
/* Copying files might take some time, so give feedback. */
pg_log(PG_STATUS, "%s", old_file);
- if ((user_opts.transfer_mode == TRANSFER_MODE_LINK) && (pageConverter != NULL))
- pg_fatal("This upgrade requires page-by-page conversion, "
- "you must use copy mode instead of link mode.\n");
-
if (user_opts.transfer_mode == TRANSFER_MODE_COPY)
{
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
+ if ((msg = copyAndUpdateFile(old_file, new_file, true)) != NULL)
pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
@@ -284,7 +271,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "linking \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = linkAndUpdateFile(pageConverter, old_file, new_file)) != NULL)
+ if ((msg = linkAndUpdateFile(old_file, new_file)) != NULL)
pg_fatal("error while creating link for relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
On Wed, Feb 17, 2016 at 4:44 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Wed, Feb 17, 2016 at 4:29 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Wed, Feb 17, 2016 at 4:08 AM, Bruce Momjian <bruce@momjian.us> wrote:
On Tue, Feb 16, 2016 at 03:57:01PM -0300, Alvaro Herrera wrote:
Masahiko Sawada wrote:
On Wed, Feb 17, 2016 at 12:02 AM, Bruce Momjian <bruce@momjian.us> wrote:
On Tue, Feb 16, 2016 at 11:56:25PM +0900, Masahiko Sawada wrote:
I agreed on ripping out the converter plugin ability of pg_upgrade.
Remember pg_upgrade was originally written by EnterpriseDB staff, and I
think they expected their closed-source fork of Postgres might need a
custom page converter someday, but it never needed one, and at this
point I think having the code in there is just making things more
complex. I see _no_ reason for community Postgres to use a plugin
converter because we are going to need that code for every upgrade from
pre-9.6 to 9.6+, so why not just hard-code in the functions we need. We
can remove it once 9.5 is end-of-life.Hm, we should rather remove the source code around PAGE_CONVERSION and
page.c at 9.6?Yes. I can do it if you wish.
I see. I understand that page-converter code would be useful for some
future cases, but makes thing more complex.If we're not going to use it, let's get rid of it right away. There's
no point in having a feature that adds complexity just because we might
find some hypothetical use of it in a not-yet-imagined future.Agreed. We can always add it later if we need it.
Attached patch gets rid of page conversion code.
Attached updated 5 patches.
I would like to explain these patch shortly again here to make
reviewing more easier.
We can divided these patches into 2 purposes.
1. Freeze map
000_ patch adds additional frozen bit into visibility map, but doesn't
include the logic for improve freezing performance.
001_ patch gets rid of page-conversion code from pg_upgrade. (This
patch doesn't related to this feature essentially, but is required by
002_ patch.)
002_ patch adds upgrading mechanism from 9.6- to 9.6+ and its regression test.
2. Improve freezing logic
003_ patch changes the VACUUM to optimize scans based on freeze map
(i.g., 000_ patch), and its regression test.
004_ patch enhances debug messages in src/backend/access/heap/visibilitymap.c
Please review them.
Regards,
--
Masahiko Sawada
Attachments:
000_add_frozen_bit_into_visibilitymap_v36.patchbinary/octet-stream; name=000_add_frozen_bit_into_visibilitymap_v36.patchDownload
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 001988b..5d08c73 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -87,7 +87,7 @@ statapprox_heap(Relation rel, output_type *stat)
* If the page has only visible tuples, then we can find out the free
* space from the FSM and move on.
*/
- if (visibilitymap_test(rel, blkno, &vmbuffer))
+ if (VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
{
freespace = GetRecordedFreeSpace(rel, blkno);
stat->tuple_len += BLCKSZ - freespace;
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 164d08c..ed429d8 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -623,18 +623,20 @@ can be used to examine the information stored in free space maps.
<para>
Each heap relation has a Visibility Map
(VM) to keep track of which pages contain only tuples that are known to be
-visible to all active transactions. It's stored
-alongside the main relation data in a separate relation fork, named after the
-filenode number of the relation, plus a <literal>_vm</> suffix. For example,
-if the filenode of a relation is 12345, the VM is stored in a file called
-<filename>12345_vm</>, in the same directory as the main relation file.
+visible to all active transactions, and pages contain only unfrozen tuples.
+It's stored alongside the main relation data in a separate relation fork,
+named after the filenode number of the relation, plus a <literal>_vm</> suffix.
+For example, if the filenode of a relation is 12345, the VM is stored in a file
+called <filename>12345_vm</>, in the same directory as the main relation file.
Note that indexes do not have VMs.
</para>
<para>
-The visibility map simply stores one bit per heap page. A set bit means
-that all tuples on the page are known to be visible to all transactions.
-This means that the page does not contain any tuples that need to be vacuumed.
+The visibility map stores two bits per heap page: all-visible, all-frozen.
+A set all-visible bit means that all tuples on the page are known to be visible
+to all transactions. A set all-frozen bit means that all tuples on the page are
+completely marked as frozen. This means that the page does not contain any tuples
+that need to be vacuumed and frozen.
This information can also be used by <firstterm>index-only scans</> to answer
queries using only the index tuple.
</para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f443742..5835e54 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7205,7 +7205,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
*/
XLogRecPtr
log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
- TransactionId cutoff_xid)
+ TransactionId cutoff_xid, uint8 vmflags)
{
xl_heap_visible xlrec;
XLogRecPtr recptr;
@@ -7215,6 +7215,7 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
Assert(BufferIsValid(vm_buffer));
xlrec.cutoff_xid = cutoff_xid;
+ xlrec.flags = vmflags;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
@@ -7804,7 +7805,12 @@ heap_xlog_visible(XLogReaderState *record)
* the subsequent update won't be replayed to clear the flag.
*/
page = BufferGetPage(buffer);
- PageSetAllVisible(page);
+
+ if (xlrec->flags & VISIBILITYMAP_ALL_VISIBLE)
+ PageSetAllVisible(page);
+ if (xlrec->flags & VISIBILITYMAP_ALL_FROZEN)
+ PageSetAllFrozen(page);
+
MarkBufferDirty(buffer);
}
else if (action == BLK_RESTORED)
@@ -7856,7 +7862,7 @@ heap_xlog_visible(XLogReaderState *record)
*/
if (lsn > PageGetLSN(vmpage))
visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
- xlrec->cutoff_xid);
+ xlrec->cutoff_xid, xlrec->flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index fc28f3f..217c694 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -15,39 +15,45 @@
* visibilitymap_pin - pin a map page for setting a bit
* visibilitymap_pin_ok - check whether correct map page is already pinned
* visibilitymap_set - set a bit in a previously pinned page
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_get_status - get status of bits
* visibilitymap_count - count number of bits set in visibility map
* visibilitymap_truncate - truncate the visibility map
*
* NOTES
*
- * The visibility map is a bitmap with one bit per heap page. A set bit means
- * that all tuples on the page are known visible to all transactions, and
- * therefore the page doesn't need to be vacuumed. The map is conservative in
- * the sense that we make sure that whenever a bit is set, we know the
- * condition is true, but if a bit is not set, it might or might not be true.
+ * The visibility map is a bitmap with two bits (all-visible and all-frozen)
+ * per heap page. A set all-visible bit means that all tuples on the page are
+ * known visible to all transactions, and therefore the page doesn't need to
+ * be vacuumed. A set all-frozen bit means that all tuples on the page are
+ * completely frozen, and therefore the page doesn't need to be vacuumed even
+ * if whole table scanning vacuum is required (e.g. anti-wraparound vacuum).
+ * The all-frozen bit must be set only when the page is already all-visible.
*
- * Clearing a visibility map bit is not separately WAL-logged. The callers
+ * The map is conservative in the sense that we make sure that whenever a bit
+ * is set, we know the condition is true, but if a bit is not set, it might or
+ * might not be true.
+ *
+ * Clearing both visibility map bits is not separately WAL-logged. The callers
* must make sure that whenever a bit is cleared, the bit is cleared on WAL
* replay of the updating operation as well.
*
* When we *set* a visibility map during VACUUM, we must write WAL. This may
* seem counterintuitive, since the bit is basically a hint: if it is clear,
- * it may still be the case that every tuple on the page is visible to all
- * transactions; we just don't know that for certain. The difficulty is that
- * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
- * on the page itself, and the visibility map bit. If a crash occurs after the
- * visibility map page makes it to disk and before the updated heap page makes
- * it to disk, redo must set the bit on the heap page. Otherwise, the next
- * insert, update, or delete on the heap page will fail to realize that the
- * visibility map bit must be cleared, possibly causing index-only scans to
- * return wrong answers.
+ * it may still be the case that every tuple on the page is all-visible or
+ * all-frozen we just don't know that for certain. The difficulty is that
+ * there are two bits which are typically set together: the PD_ALL_VISIBLE
+ * or PD_ALL_FROZEN bit on the page itself, and the corresponding visibility map
+ * bit. If a crash occurs after the visibility map page makes it to disk and before
+ * the updated heap page makes it to disk, redo must set the bit on the heap page.
+ * Otherwise, the next insert, update, or delete on the heap page will fail to
+ * realize that the visibility map bit must be cleared, possibly causing index-only
+ * scans to return wrong answers.
*
* VACUUM will normally skip pages for which the visibility map bit is set;
* such pages can't contain any dead tuples and therefore don't need vacuuming.
- * The visibility map is not used for anti-wraparound vacuums, because
- * an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
- * present in the table, even on pages that don't have any dead tuples.
+ * The visibility map has the all-frozen bit which indicates all tuples on
+ * corresponding page have been completely frozen, so the visibility map is also
+ * used for anti-wraparound vacuum, even if freezing of tuples is required.
*
* LOCKING
*
@@ -101,38 +107,50 @@
*/
#define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
-/* Number of bits allocated for each heap block. */
-#define BITS_PER_HEAPBLOCK 1
-
-/* Number of heap blocks we can represent in one byte. */
-#define HEAPBLOCKS_PER_BYTE 8
-
/* Number of heap blocks we can represent in one visibility map page. */
#define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE)
/* Mapping from heap block number to the right bit in the visibility map */
#define HEAPBLK_TO_MAPBLOCK(x) ((x) / HEAPBLOCKS_PER_PAGE)
#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE)
-#define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE)
-
-/* table for fast counting of set bits */
-static const uint8 number_of_ones[256] = {
- 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
- 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
+#define HEAPBLK_TO_MAPBIT(x) (((x) % HEAPBLOCKS_PER_BYTE) * BITS_PER_HEAPBLOCK)
+
+/* tables for fast counting of set bits for visible and frozen */
+static const uint8 number_of_ones_for_visible[256] = {
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 0, 1, 0, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 2, 1, 2,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4,
+ 1, 2, 1, 2, 2, 3, 2, 3, 1, 2, 1, 2, 2, 3, 2, 3,
+ 2, 3, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 4, 3, 4
+};
+static const uint8 number_of_ones_for_frozen[256] = {
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4,
+ 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4
};
/* prototypes for internal routines */
@@ -141,7 +159,7 @@ static void vm_extend(Relation rel, BlockNumber nvmblocks);
/*
- * visibilitymap_clear - clear a bit in visibility map
+ * visibilitymap_clear - clear all bits in visibility map
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -153,7 +171,7 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
int mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- uint8 mask = 1 << mapBit;
+ uint8 mask = VISIBILITYMAP_VALID_BITS << mapBit;
char *map;
#ifdef TRACE_VISIBILITYMAP
@@ -186,7 +204,7 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
* visibilitymap_set to actually set the bit.
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by
- * an earlier call to visibilitymap_pin or visibilitymap_test on the same
+ * an earlier call to visibilitymap_pin or visibilitymap_get_status on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk.
*
@@ -212,7 +230,7 @@ visibilitymap_pin(Relation rel, BlockNumber heapBlk, Buffer *buf)
* visibilitymap_pin_ok - do we already have the correct page pinned?
*
* On entry, buf should be InvalidBuffer or a valid buffer returned by
- * an earlier call to visibilitymap_pin or visibilitymap_test on the same
+ * an earlier call to visibilitymap_pin or visibilitymap_get_status on the same
* relation. The return value indicates whether the buffer covers the
* given heapBlk.
*/
@@ -225,7 +243,7 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
}
/*
- * visibilitymap_set - set a bit on a previously pinned page
+ * visibilitymap_set - set bit(s) on a previously pinned page
*
* recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
* or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
@@ -234,10 +252,11 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
* marked all-visible; it is needed for Hot Standby, and can be
* InvalidTransactionId if the page contains no tuples.
*
- * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
- * this function. Except in recovery, caller should also pass the heap
- * buffer. When checksums are enabled and we're not in recovery, we must add
- * the heap buffer to the WAL chain to protect it from being torn.
+ * Caller is expected to set the heap page's PD_ALL_VISIBLE or PD_ALL_FROZEN
+ * bit before calling this function. Except in recovery, caller should also
+ * pass the heap buffer and flags which indicates what flag we want to set.
+ * When checksums are enabled and we're not in recovery, we must add the heap
+ * buffer to the WAL chain to protect it from being torn.
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -245,13 +264,14 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
*/
void
visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid)
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
uint8 mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
Page page;
- char *map;
+ uint8 *map;
#ifdef TRACE_VISIBILITYMAP
elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
@@ -259,6 +279,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
Assert(InRecovery || BufferIsValid(heapBuf));
+ Assert(flags & VISIBILITYMAP_VALID_BITS);
/* Check that we have the right heap page pinned, if present */
if (BufferIsValid(heapBuf) && BufferGetBlockNumber(heapBuf) != heapBlk)
@@ -269,14 +290,14 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
elog(ERROR, "wrong VM buffer passed to visibilitymap_set");
page = BufferGetPage(vmBuf);
- map = PageGetContents(page);
+ map = (uint8 *)PageGetContents(page);
LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
- if (!(map[mapByte] & (1 << mapBit)))
+ if (flags != (map[mapByte] >> mapBit & VISIBILITYMAP_VALID_BITS))
{
START_CRIT_SECTION();
- map[mapByte] |= (1 << mapBit);
+ map[mapByte] |= (flags << mapBit);
MarkBufferDirty(vmBuf);
if (RelationNeedsWAL(rel))
@@ -285,7 +306,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Assert(!InRecovery);
recptr = log_heap_visible(rel->rd_node, heapBuf, vmBuf,
- cutoff_xid);
+ cutoff_xid, flags);
/*
* If data checksums are enabled (or wal_log_hints=on), we
@@ -295,11 +316,19 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Page heapPage = BufferGetPage(heapBuf);
- /* caller is expected to set PD_ALL_VISIBLE first */
- Assert(PageIsAllVisible(heapPage));
+ /*
+ * Caller is expected to set PD_ALL_VISIBLE or
+ * PD_ALL_FROZEN first.
+ */
+ if (flags | VISIBILITYMAP_ALL_VISIBLE)
+ Assert(PageIsAllVisible(heapPage));
+ if (flags | VISIBILITYMAP_ALL_FROZEN)
+ Assert(PageIsAllFrozen(heapPage));
+
PageSetLSN(heapPage, recptr);
}
}
+
PageSetLSN(page, recptr);
}
@@ -310,15 +339,17 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
}
/*
- * visibilitymap_test - test if a bit is set
+ * visibilitymap_get_status - get status of bits
*
- * Are all tuples on heapBlk visible to all, according to the visibility map?
+ * Are all tuples on heapBlk visible to all or are marked frozen, according
+ * to the visibility map?
*
* On entry, *buf should be InvalidBuffer or a valid buffer returned by an
- * earlier call to visibilitymap_pin or visibilitymap_test on the same
+ * earlier call to visibilitymap_pin or visibilitymap_get_status on the same
* relation. On return, *buf is a valid buffer with the map page containing
* the bit for heapBlk, or InvalidBuffer. The caller is responsible for
- * releasing *buf after it's done testing and setting bits.
+ * releasing *buf after it's done testing and setting bits, and must pass flags
+ * for which it needs to check the value in visibility map.
*
* NOTE: This function is typically called without a lock on the heap page,
* so somebody else could change the bit just after we look at it. In fact,
@@ -327,17 +358,16 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
* we might see the old value. It is the caller's responsibility to deal with
* all concurrency issues!
*/
-bool
-visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
+uint8
+visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *buf)
{
BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
uint8 mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
- bool result;
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_test %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_get_status %s %d", RelationGetRelationName(rel), heapBlk);
#endif
/* Reuse the old pinned buffer if possible */
@@ -360,13 +390,11 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
map = PageGetContents(BufferGetPage(*buf));
/*
- * A single-bit read is atomic. There could be memory-ordering effects
+ * The double bits read is atomic. There could be memory-ordering effects
* here, but for performance reasons we make it the caller's job to worry
* about that.
*/
- result = (map[mapByte] & (1 << mapBit)) ? true : false;
-
- return result;
+ return ((map[mapByte] >> mapBit) & VISIBILITYMAP_VALID_BITS);
}
/*
@@ -374,14 +402,20 @@ visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf)
*
* Note: we ignore the possibility of race conditions when the table is being
* extended concurrently with the call. New pages added to the table aren't
- * going to be marked all-visible, so they won't affect the result.
+ * going to be marked all-visible or all-frozen, so they won't affect the result.
*/
-BlockNumber
-visibilitymap_count(Relation rel)
+void
+visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen)
{
- BlockNumber result = 0;
BlockNumber mapBlock;
+ /* all_visible must be specified */
+ Assert(all_visible);
+
+ *all_visible = 0;
+ if (all_frozen)
+ *all_frozen = 0;
+
for (mapBlock = 0;; mapBlock++)
{
Buffer mapBuffer;
@@ -406,13 +440,13 @@ visibilitymap_count(Relation rel)
for (i = 0; i < MAPSIZE; i++)
{
- result += number_of_ones[map[i]];
+ *all_visible += number_of_ones_for_visible[map[i]];
+ if (all_frozen)
+ *all_frozen += number_of_ones_for_frozen[map[i]];
}
ReleaseBuffer(mapBuffer);
}
-
- return result;
}
/*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 8898b55..31a1438 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1920,7 +1920,7 @@ index_update_stats(Relation rel,
BlockNumber relallvisible;
if (rd_rel->relkind != RELKIND_INDEX)
- relallvisible = visibilitymap_count(rel);
+ visibilitymap_count(rel, &relallvisible, NULL);
else /* don't bother for indexes */
relallvisible = 0;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index abf9a70..37d35d7 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -458,6 +458,7 @@ CREATE VIEW pg_stat_all_tables AS
pg_stat_get_live_tuples(C.oid) AS n_live_tup,
pg_stat_get_dead_tuples(C.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(C.oid) AS n_mod_since_analyze,
+ pg_stat_get_frozen_pages(C.oid) AS n_frozen_pages,
pg_stat_get_last_vacuum_time(C.oid) as last_vacuum,
pg_stat_get_last_autovacuum_time(C.oid) as last_autovacuum,
pg_stat_get_last_analyze_time(C.oid) as last_analyze,
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 070df29..d7f3035 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -314,6 +314,8 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
+ BlockNumber relallvisible,
+ relallfrozen;
if (inh)
ereport(elevel,
@@ -564,51 +566,56 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
}
}
- /*
- * Update pages/tuples stats in pg_class ... but not if we're doing
- * inherited stats.
- */
if (!inh)
+ {
+ /* Calculate the number of all-visible and all-frozen bit */
+ visibilitymap_count(onerel, &relallvisible, &relallfrozen);
+
+ /*
+ * Update pages/tuples stats in pg_class ... but not if we're doing
+ * inherited stats.
+ */
vac_update_relstats(onerel,
relpages,
totalrows,
- visibilitymap_count(onerel),
+ relallvisible,
hasindex,
InvalidTransactionId,
InvalidMultiXactId,
in_outer_xact);
- /*
- * Same for indexes. Vacuum always scans all indexes, so if we're part of
- * VACUUM ANALYZE, don't overwrite the accurate count already inserted by
- * VACUUM.
- */
- if (!inh && !(options & VACOPT_VACUUM))
- {
- for (ind = 0; ind < nindexes; ind++)
+ /*
+ * Same for indexes. Vacuum always scans all indexes, so if we're part of
+ * VACUUM ANALYZE, don't overwrite the accurate count already inserted by
+ * VACUUM.
+ */
+ if (!(options & VACOPT_VACUUM))
{
- AnlIndexData *thisdata = &indexdata[ind];
- double totalindexrows;
-
- totalindexrows = ceil(thisdata->tupleFract * totalrows);
- vac_update_relstats(Irel[ind],
- RelationGetNumberOfBlocks(Irel[ind]),
- totalindexrows,
- 0,
- false,
- InvalidTransactionId,
- InvalidMultiXactId,
- in_outer_xact);
+ for (ind = 0; ind < nindexes; ind++)
+ {
+ AnlIndexData *thisdata = &indexdata[ind];
+ double totalindexrows;
+
+ totalindexrows = ceil(thisdata->tupleFract * totalrows);
+ vac_update_relstats(Irel[ind],
+ RelationGetNumberOfBlocks(Irel[ind]),
+ totalindexrows,
+ 0,
+ false,
+ InvalidTransactionId,
+ InvalidMultiXactId,
+ in_outer_xact);
+ }
}
- }
- /*
- * Report ANALYZE to the stats collector, too. However, if doing
- * inherited stats we shouldn't report, because the stats collector only
- * tracks per-table stats.
- */
- if (!inh)
- pgstat_report_analyze(onerel, totalrows, totaldeadrows);
+ /*
+ * Report ANALYZE to the stats collector, too. However, if doing
+ * inherited stats we shouldn't report, because the stats collector only
+ * tracks per-table stats.
+ */
+ pgstat_report_analyze(onerel, totalrows, totaldeadrows, relallfrozen);
+
+ }
/* If this isn't part of VACUUM ANALYZE, let index AMs do cleanup */
if (!(options & VACOPT_VACUUM))
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 4f6f6e7..60782da 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -188,7 +188,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
MultiXactId mxactFullScanLimit;
BlockNumber new_rel_pages;
double new_rel_tuples;
- BlockNumber new_rel_allvisible;
+ BlockNumber new_rel_allvisible,
+ new_rel_allfrozen;
double new_live_tuples;
TransactionId new_frozen_xid;
MultiXactId new_min_multi;
@@ -295,10 +296,13 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
new_rel_tuples = vacrelstats->old_rel_tuples;
}
- new_rel_allvisible = visibilitymap_count(onerel);
+ visibilitymap_count(onerel, &new_rel_allvisible, &new_rel_allfrozen);
if (new_rel_allvisible > new_rel_pages)
new_rel_allvisible = new_rel_pages;
+ if (new_rel_allfrozen > new_rel_pages)
+ new_rel_allfrozen = new_rel_pages;
+
new_frozen_xid = scanned_all ? FreezeLimit : InvalidTransactionId;
new_min_multi = scanned_all ? MultiXactCutoff : InvalidMultiXactId;
@@ -319,7 +323,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
pgstat_report_vacuum(RelationGetRelid(onerel),
onerel->rd_rel->relisshared,
new_live_tuples,
- vacrelstats->new_dead_tuples);
+ vacrelstats->new_dead_tuples,
+ new_rel_allfrozen);
/* and log the action if appropriate */
if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
@@ -518,7 +523,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer))
+ if (!VM_ALL_VISIBLE(onerel, next_not_all_visible_block, &vmbuffer))
break;
vacuum_delay_point();
}
@@ -554,8 +559,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
next_not_all_visible_block < nblocks;
next_not_all_visible_block++)
{
- if (!visibilitymap_test(onerel, next_not_all_visible_block,
- &vmbuffer))
+ if (!VM_ALL_VISIBLE(onerel, next_not_all_visible_block, &vmbuffer))
break;
vacuum_delay_point();
}
@@ -767,7 +771,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
PageSetAllVisible(page);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, InvalidTransactionId);
+ vmbuffer, InvalidTransactionId,
+ VISIBILITYMAP_ALL_VISIBLE);
END_CRIT_SECTION();
}
@@ -1034,7 +1039,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
PageSetAllVisible(page);
MarkBufferDirty(buf);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, visibility_cutoff_xid);
+ vmbuffer, visibility_cutoff_xid,
+ VISIBILITYMAP_ALL_VISIBLE);
}
/*
@@ -1045,7 +1051,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* that something bad has happened.
*/
else if (all_visible_according_to_vm && !PageIsAllVisible(page)
- && visibilitymap_test(onerel, blkno, &vmbuffer))
+ && VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
{
elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
relname, blkno);
@@ -1316,11 +1322,11 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
* flag is now set, also set the VM bit.
*/
if (PageIsAllVisible(page) &&
- !visibilitymap_test(onerel, blkno, vmbuffer))
+ !VM_ALL_VISIBLE(onerel, blkno, vmbuffer))
{
Assert(BufferIsValid(*vmbuffer));
visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
- visibility_cutoff_xid);
+ visibility_cutoff_xid, VISIBILITYMAP_ALL_VISIBLE);
}
return tupindex;
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 90afbdc..4f6f91c 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -85,9 +85,9 @@ IndexOnlyNext(IndexOnlyScanState *node)
* which all tuples are known visible to everybody. In any case,
* we'll use the index tuple not the heap tuple as the data source.
*
- * Note on Memory Ordering Effects: visibilitymap_test does not lock
- * the visibility map buffer, and therefore the result we read here
- * could be slightly stale. However, it can't be stale enough to
+ * Note on Memory Ordering Effects: visibilitymap_get_status does not
+ * lock the visibility map buffer, and therefore the result we read
+ * here could be slightly stale. However, it can't be stale enough to
* matter.
*
* We need to detect clearing a VM bit due to an insert right away,
@@ -114,9 +114,9 @@ IndexOnlyNext(IndexOnlyScanState *node)
* It's worth going through this complexity to avoid needing to lock
* the VM buffer, which could cause significant contention.
*/
- if (!visibilitymap_test(scandesc->heapRelation,
- ItemPointerGetBlockNumber(tid),
- &node->ioss_VMBuffer))
+ if (!VM_ALL_VISIBLE(scandesc->heapRelation,
+ ItemPointerGetBlockNumber(tid),
+ &node->ioss_VMBuffer))
{
/*
* Rats, we have to visit the heap to check visibility.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index da768c6..08b61cb 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1329,7 +1329,8 @@ pgstat_report_autovac(Oid dboid)
*/
void
pgstat_report_vacuum(Oid tableoid, bool shared,
- PgStat_Counter livetuples, PgStat_Counter deadtuples)
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages)
{
PgStat_MsgVacuum msg;
@@ -1343,6 +1344,7 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
msg.m_vacuumtime = GetCurrentTimestamp();
msg.m_live_tuples = livetuples;
msg.m_dead_tuples = deadtuples;
+ msg.m_frozen_pages = frozenpages;
pgstat_send(&msg, sizeof(msg));
}
@@ -1354,7 +1356,8 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
*/
void
pgstat_report_analyze(Relation rel,
- PgStat_Counter livetuples, PgStat_Counter deadtuples)
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages)
{
PgStat_MsgAnalyze msg;
@@ -1394,6 +1397,7 @@ pgstat_report_analyze(Relation rel,
msg.m_analyzetime = GetCurrentTimestamp();
msg.m_live_tuples = livetuples;
msg.m_dead_tuples = deadtuples;
+ msg.m_frozen_pages = frozenpages;
pgstat_send(&msg, sizeof(msg));
}
@@ -3702,6 +3706,7 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
result->n_live_tuples = 0;
result->n_dead_tuples = 0;
result->changes_since_analyze = 0;
+ result->n_frozen_pages = 0;
result->blocks_fetched = 0;
result->blocks_hit = 0;
result->vacuum_timestamp = 0;
@@ -5069,6 +5074,7 @@ pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
tabentry->n_live_tuples = msg->m_live_tuples;
tabentry->n_dead_tuples = msg->m_dead_tuples;
+ tabentry->n_frozen_pages = msg->m_frozen_pages;
if (msg->m_autovacuum)
{
@@ -5103,6 +5109,7 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
tabentry->n_live_tuples = msg->m_live_tuples;
tabentry->n_dead_tuples = msg->m_dead_tuples;
+ tabentry->n_frozen_pages = msg->m_frozen_pages;
/*
* We reset changes_since_analyze to zero, forgetting any changes that
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 1b22fcc..7c57b3e 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -46,6 +46,7 @@ extern Datum pg_stat_get_vacuum_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_autovacuum_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_analyze_count(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_autoanalyze_count(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_frozen_pages(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_function_calls(PG_FUNCTION_ARGS);
extern Datum pg_stat_get_function_total_time(PG_FUNCTION_ARGS);
@@ -450,6 +451,21 @@ pg_stat_get_autoanalyze_count(PG_FUNCTION_ARGS)
}
Datum
+pg_stat_get_frozen_pages(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int32 result;
+ PgStat_StatTabEntry *tabentry;
+
+ if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+ result = 0;
+ else
+ result = (int32) (tabentry->n_frozen_pages);
+
+ PG_RETURN_INT32(result);
+}
+
+Datum
pg_stat_get_function_calls(PG_FUNCTION_ARGS)
{
Oid funcid = PG_GETARG_OID(0);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index f77489b..5fcb539 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -320,9 +320,10 @@ typedef struct xl_heap_freeze_page
typedef struct xl_heap_visible
{
TransactionId cutoff_xid;
+ uint8 flags;
} xl_heap_visible;
-#define SizeOfHeapVisible (offsetof(xl_heap_visible, cutoff_xid) + sizeof(TransactionId))
+#define SizeOfHeapVisible (offsetof(xl_heap_visible, flags) + sizeof(uint8))
typedef struct xl_heap_new_cid
{
@@ -389,6 +390,6 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
xl_heap_freeze_tuple *xlrec_tp);
extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
- Buffer vm_buffer, TransactionId cutoff_xid);
+ Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
#endif /* HEAPAM_XLOG_H */
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index d447daf..a75de5c 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -19,15 +19,36 @@
#include "storage/buf.h"
#include "utils/relcache.h"
+/*
+ * Number of bits allocated for each heap block.
+ * One for all-visible, other for all-frozen.
+*/
+#define BITS_PER_HEAPBLOCK 2
+
+/* Number of heap blocks we can represent in one byte. */
+#define HEAPBLOCKS_PER_BYTE 4
+
+/* Flags for bit map */
+#define VISIBILITYMAP_ALL_VISIBLE 0x01
+#define VISIBILITYMAP_ALL_FROZEN 0x02
+#define VISIBILITYMAP_VALID_BITS 0x03 /* OR of all valid visiblitymap flags bits */
+
+/* Macros for visibilitymap test */
+#define VM_ALL_VISIBLE(r, b, v) \
+ ((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_VISIBLE) != 0)
+#define VM_ALL_FROZEN(r, b, v) \
+ ((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_FROZEN) != 0)
+
extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk,
Buffer vmbuf);
extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
Buffer *vmbuf);
extern bool visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf);
extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
- XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid);
-extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
-extern BlockNumber visibilitymap_count(Relation rel);
+ XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
+ uint8 flags);
+extern uint8 visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
+extern void visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen);
extern void visibilitymap_truncate(Relation rel, BlockNumber nheapblocks);
#endif /* VISIBILITYMAP_H */
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index b4131f9..102d9eb 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 201602171
+#define CATALOG_VERSION_NO 201602181
#endif
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 2222e8f..1a6ce12 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2706,6 +2706,8 @@ DATA(insert OID = 3056 ( pg_stat_get_analyze_count PGNSP PGUID 12 1 0 0 0 f f f
DESCR("statistics: number of manual analyzes for a table");
DATA(insert OID = 3057 ( pg_stat_get_autoanalyze_count PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_autoanalyze_count _null_ _null_ _null_ ));
DESCR("statistics: number of auto analyzes for a table");
+DATA(insert OID = 6015 ( pg_stat_get_frozen_pages PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 23 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_frozen_pages _null_ _null_ _null_ ));
+DESCR("statistics: number of frozen pages of table");
DATA(insert OID = 1936 ( pg_stat_get_backend_idset PGNSP PGUID 12 1 100 0 0 f f f f t t s r 0 0 23 "" _null_ _null_ _null_ _null_ _null_ pg_stat_get_backend_idset _null_ _null_ _null_ ));
DESCR("statistics: currently active backend IDs");
DATA(insert OID = 2022 ( pg_stat_get_activity PGNSP PGUID 12 1 100 0 0 f f f f f t s r 1 0 2249 "23" "{23,26,23,26,25,25,25,16,1184,1184,1184,1184,869,25,23,28,28,16,25,25,23,16,25}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,datid,pid,usesysid,application_name,state,query,waiting,xact_start,query_start,backend_start,state_change,client_addr,client_hostname,client_port,backend_xid,backend_xmin,ssl,sslversion,sslcipher,sslbits,sslcompression,sslclientdn}" _null_ _null_ pg_stat_get_activity _null_ _null_ _null_ ));
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 65e968e..ad40b70 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -17,6 +17,7 @@
#include "portability/instr_time.h"
#include "postmaster/pgarch.h"
#include "storage/barrier.h"
+#include "storage/block.h"
#include "utils/hsearch.h"
#include "utils/relcache.h"
@@ -355,6 +356,7 @@ typedef struct PgStat_MsgVacuum
TimestampTz m_vacuumtime;
PgStat_Counter m_live_tuples;
PgStat_Counter m_dead_tuples;
+ BlockNumber m_frozen_pages;
} PgStat_MsgVacuum;
@@ -372,6 +374,7 @@ typedef struct PgStat_MsgAnalyze
TimestampTz m_analyzetime;
PgStat_Counter m_live_tuples;
PgStat_Counter m_dead_tuples;
+ int32 m_frozen_pages;
} PgStat_MsgAnalyze;
@@ -551,7 +554,7 @@ typedef union PgStat_Msg
* ------------------------------------------------------------
*/
-#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9D
+#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9E
/* ----------
* PgStat_StatDBEntry The collector's data per database
@@ -615,6 +618,8 @@ typedef struct PgStat_StatTabEntry
PgStat_Counter n_dead_tuples;
PgStat_Counter changes_since_analyze;
+ int32 n_frozen_pages;
+
PgStat_Counter blocks_fetched;
PgStat_Counter blocks_hit;
@@ -917,9 +922,11 @@ extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type t
extern void pgstat_report_autovac(Oid dboid);
extern void pgstat_report_vacuum(Oid tableoid, bool shared,
- PgStat_Counter livetuples, PgStat_Counter deadtuples);
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages);
extern void pgstat_report_analyze(Relation rel,
- PgStat_Counter livetuples, PgStat_Counter deadtuples);
+ PgStat_Counter livetuples, PgStat_Counter deadtuples,
+ int32 frozenpages);
extern void pgstat_report_recovery_conflict(int reason);
extern void pgstat_report_deadlock(void);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 2ce3be7..0b023b3 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -178,8 +178,10 @@ typedef PageHeaderData *PageHeader;
* tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
+#define PD_ALL_FROZEN 0x0008 /* all tuples on page are completely
+ frozen */
-#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
+#define PD_VALID_FLAG_BITS 0x000F /* OR of all valid pd_flags bits */
/*
* Page layout version number 0 is for pre-7.3 Postgres releases.
@@ -367,7 +369,12 @@ typedef PageHeaderData *PageHeader;
#define PageSetAllVisible(page) \
(((PageHeader) (page))->pd_flags |= PD_ALL_VISIBLE)
#define PageClearAllVisible(page) \
- (((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
+ (((PageHeader) (page))->pd_flags &= ~(PD_ALL_VISIBLE | PD_ALL_FROZEN))
+
+#define PageIsAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags & PD_ALL_FROZEN)
+#define PageSetAllFrozen(page) \
+ (((PageHeader) (page))->pd_flags |= PD_ALL_FROZEN)
#define PageIsPrunable(page, oldestxmin) \
( \
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 81bc5c9..a655519 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1686,6 +1686,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,
pg_stat_get_live_tuples(c.oid) AS n_live_tup,
pg_stat_get_dead_tuples(c.oid) AS n_dead_tup,
pg_stat_get_mod_since_analyze(c.oid) AS n_mod_since_analyze,
+ pg_stat_get_frozen_pages(c.oid) AS n_frozen_pages,
pg_stat_get_last_vacuum_time(c.oid) AS last_vacuum,
pg_stat_get_last_autovacuum_time(c.oid) AS last_autovacuum,
pg_stat_get_last_analyze_time(c.oid) AS last_analyze,
@@ -1798,6 +1799,7 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
+ pg_stat_all_tables.n_frozen_pages,
pg_stat_all_tables.last_vacuum,
pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,
@@ -1841,6 +1843,7 @@ pg_stat_user_tables| SELECT pg_stat_all_tables.relid,
pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,
pg_stat_all_tables.n_mod_since_analyze,
+ pg_stat_all_tables.n_frozen_pages,
pg_stat_all_tables.last_vacuum,
pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,
001_remove_pageconversion_pg_upgrade_v36.patchbinary/octet-stream; name=001_remove_pageconversion_pg_upgrade_v36.patchDownload
diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile
index d9c8145..0c882d9 100644
--- a/src/bin/pg_upgrade/Makefile
+++ b/src/bin/pg_upgrade/Makefile
@@ -8,7 +8,7 @@ top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
OBJS = check.o controldata.o dump.o exec.o file.o function.o info.o \
- option.o page.o parallel.o pg_upgrade.o relfilenode.o server.o \
+ option.o parallel.o pg_upgrade.o relfilenode.o server.o \
tablespace.o util.o version.o $(WIN32RES)
override CPPFLAGS := -DDLSUFFIX=\"$(DLSUFFIX)\" -I$(srcdir) -I$(libpq_srcdir) $(CPPFLAGS)
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 8c034bc..86d088d 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -80,8 +80,6 @@ check_and_dump_old_cluster(bool live_check)
if (!live_check)
start_postmaster(&old_cluster, true);
- get_pg_database_relfilenode(&old_cluster);
-
/* Extract a list of databases and tables from the old cluster */
get_db_and_rel_infos(&old_cluster);
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 9357ad8..115d506 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -25,15 +25,11 @@ static int win32_pghardlink(const char *src, const char *dst);
/*
* copyAndUpdateFile()
*
- * Copies a relation file from src to dst. If pageConverter is non-NULL, this function
- * uses that pageConverter to do a page-by-page conversion.
+ * Copies a relation file from src to dst.
*/
const char *
-copyAndUpdateFile(pageCnvCtx *pageConverter,
- const char *src, const char *dst, bool force)
+copyAndUpdateFile(const char *src, const char *dst, bool force)
{
- if (pageConverter == NULL)
- {
#ifndef WIN32
if (copy_file(src, dst, force) == -1)
#else
@@ -42,65 +38,6 @@ copyAndUpdateFile(pageCnvCtx *pageConverter,
return getErrorText();
else
return NULL;
- }
- else
- {
- /*
- * We have a pageConverter object - that implies that the
- * PageLayoutVersion differs between the two clusters so we have to
- * perform a page-by-page conversion.
- *
- * If the pageConverter can convert the entire file at once, invoke
- * that plugin function, otherwise, read each page in the relation
- * file and call the convertPage plugin function.
- */
-
-#ifdef PAGE_CONVERSION
- if (pageConverter->convertFile)
- return pageConverter->convertFile(pageConverter->pluginData,
- dst, src);
- else
-#endif
- {
- int src_fd;
- int dstfd;
- char buf[BLCKSZ];
- ssize_t bytesRead;
- const char *msg = NULL;
-
- if ((src_fd = open(src, O_RDONLY, 0)) < 0)
- return "could not open source file";
-
- if ((dstfd = open(dst, O_RDWR | O_CREAT | O_EXCL, S_IRUSR | S_IWUSR)) < 0)
- {
- close(src_fd);
- return "could not create destination file";
- }
-
- while ((bytesRead = read(src_fd, buf, BLCKSZ)) == BLCKSZ)
- {
-#ifdef PAGE_CONVERSION
- if ((msg = pageConverter->convertPage(pageConverter->pluginData, buf, buf)) != NULL)
- break;
-#endif
- if (write(dstfd, buf, BLCKSZ) != BLCKSZ)
- {
- msg = "could not write new page to destination";
- break;
- }
- }
-
- close(src_fd);
- close(dstfd);
-
- if (msg)
- return msg;
- else if (bytesRead != 0)
- return "found partial page in source file";
- else
- return NULL;
- }
- }
}
@@ -114,12 +51,8 @@ copyAndUpdateFile(pageCnvCtx *pageConverter,
* instead of copying the data from the old cluster to the new cluster.
*/
const char *
-linkAndUpdateFile(pageCnvCtx *pageConverter,
- const char *src, const char *dst)
+linkAndUpdateFile(const char *src, const char *dst)
{
- if (pageConverter != NULL)
- return "Cannot in-place update this cluster, page-by-page conversion is required";
-
if (pg_link_file(src, dst) == -1)
return getErrorText();
else
diff --git a/src/bin/pg_upgrade/page.c b/src/bin/pg_upgrade/page.c
deleted file mode 100644
index e5686e5..0000000
--- a/src/bin/pg_upgrade/page.c
+++ /dev/null
@@ -1,164 +0,0 @@
-/*
- * page.c
- *
- * per-page conversion operations
- *
- * Copyright (c) 2010-2016, PostgreSQL Global Development Group
- * src/bin/pg_upgrade/page.c
- */
-
-#include "postgres_fe.h"
-
-#include "pg_upgrade.h"
-
-#include "storage/bufpage.h"
-
-
-#ifdef PAGE_CONVERSION
-
-
-static void getPageVersion(
- uint16 *version, const char *pathName);
-static pageCnvCtx *loadConverterPlugin(
- uint16 newPageVersion, uint16 oldPageVersion);
-
-
-/*
- * setupPageConverter()
- *
- * This function determines the PageLayoutVersion of the old cluster and
- * the PageLayoutVersion of the new cluster. If the versions differ, this
- * function loads a converter plugin and returns a pointer to a pageCnvCtx
- * object (in *result) that knows how to convert pages from the old format
- * to the new format. If the versions are identical, this function just
- * returns a NULL pageCnvCtx pointer to indicate that page-by-page conversion
- * is not required.
- */
-pageCnvCtx *
-setupPageConverter(void)
-{
- uint16 oldPageVersion;
- uint16 newPageVersion;
- pageCnvCtx *converter;
- const char *msg;
- char dstName[MAXPGPATH];
- char srcName[MAXPGPATH];
-
- snprintf(dstName, sizeof(dstName), "%s/global/%u", new_cluster.pgdata,
- new_cluster.pg_database_oid);
- snprintf(srcName, sizeof(srcName), "%s/global/%u", old_cluster.pgdata,
- old_cluster.pg_database_oid);
-
- getPageVersion(&oldPageVersion, srcName);
- getPageVersion(&newPageVersion, dstName);
-
- /*
- * If the old cluster and new cluster use the same page layouts, then we
- * don't need a page converter.
- */
- if (newPageVersion != oldPageVersion)
- {
- /*
- * The clusters use differing page layouts, see if we can find a
- * plugin that knows how to convert from the old page layout to the
- * new page layout.
- */
-
- if ((converter = loadConverterPlugin(newPageVersion, oldPageVersion)) == NULL)
- pg_fatal("could not find plugin to convert from old page layout to new page layout\n");
-
- return converter;
- }
- else
- return NULL;
-}
-
-
-/*
- * getPageVersion()
- *
- * Retrieves the PageLayoutVersion for the given relation.
- *
- * Returns NULL on success (and stores the PageLayoutVersion at *version),
- * if an error occurs, this function returns an error message (in the form
- * of a null-terminated string).
- */
-static void
-getPageVersion(uint16 *version, const char *pathName)
-{
- int relfd;
- PageHeaderData page;
- ssize_t bytesRead;
-
- if ((relfd = open(pathName, O_RDONLY, 0)) < 0)
- pg_fatal("could not open relation %s\n", pathName);
-
- if ((bytesRead = read(relfd, &page, sizeof(page))) != sizeof(page))
- pg_fatal("could not read page header of %s\n", pathName);
-
- *version = PageGetPageLayoutVersion(&page);
-
- close(relfd);
-
- return;
-}
-
-
-/*
- * loadConverterPlugin()
- *
- * This function loads a page-converter plugin library and grabs a
- * pointer to each of the (interesting) functions provided by that
- * plugin. The name of the plugin library is derived from the given
- * newPageVersion and oldPageVersion. If a plugin is found, this
- * function returns a pointer to a pageCnvCtx object (which will contain
- * a collection of plugin function pointers). If the required plugin
- * is not found, this function returns NULL.
- */
-static pageCnvCtx *
-loadConverterPlugin(uint16 newPageVersion, uint16 oldPageVersion)
-{
- char pluginName[MAXPGPATH];
- void *plugin;
-
- /*
- * Try to find a plugin that can convert pages of oldPageVersion into
- * pages of newPageVersion. For example, if we oldPageVersion = 3 and
- * newPageVersion is 4, we search for a plugin named:
- * plugins/convertLayout_3_to_4.dll
- */
-
- /*
- * FIXME: we are searching for plugins relative to the current directory,
- * we should really search relative to our own executable instead.
- */
- snprintf(pluginName, sizeof(pluginName), "./plugins/convertLayout_%d_to_%d%s",
- oldPageVersion, newPageVersion, DLSUFFIX);
-
- if ((plugin = pg_dlopen(pluginName)) == NULL)
- return NULL;
- else
- {
- pageCnvCtx *result = (pageCnvCtx *) pg_malloc(sizeof(*result));
-
- result->old.PageVersion = oldPageVersion;
- result->new.PageVersion = newPageVersion;
-
- result->startup = (pluginStartup) pg_dlsym(plugin, "init");
- result->convertFile = (pluginConvertFile) pg_dlsym(plugin, "convertFile");
- result->convertPage = (pluginConvertPage) pg_dlsym(plugin, "convertPage");
- result->shutdown = (pluginShutdown) pg_dlsym(plugin, "fini");
- result->pluginData = NULL;
-
- /*
- * If the plugin has exported an initializer, go ahead and invoke it.
- */
- if (result->startup)
- result->startup(MIGRATOR_API_VERSION, &result->pluginVersion,
- newPageVersion, oldPageVersion, &result->pluginData);
-
- return result;
- }
-}
-
-#endif
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 984c395..4f5361a 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -260,8 +260,6 @@ prepare_new_cluster(void)
new_cluster.bindir, cluster_conn_opts(&new_cluster),
log_opts.verbose ? "--verbose" : "");
check_ok();
-
- get_pg_database_relfilenode(&new_cluster);
}
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index bc733c4..900b2a7 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -269,7 +269,6 @@ typedef struct
uint32 major_version; /* PG_VERSION of cluster */
char major_version_str[64]; /* string PG_VERSION of cluster */
uint32 bin_version; /* version returned from pg_ctl */
- Oid pg_database_oid; /* OID of pg_database relation */
const char *tablespace_suffix; /* directory specification */
} ClusterInfo;
@@ -364,40 +363,8 @@ bool pid_lock_file_exists(const char *datadir);
/* file.c */
-#ifdef PAGE_CONVERSION
-typedef const char *(*pluginStartup) (uint16 migratorVersion,
- uint16 *pluginVersion, uint16 newPageVersion,
- uint16 oldPageVersion, void **pluginData);
-typedef const char *(*pluginConvertFile) (void *pluginData,
- const char *dstName, const char *srcName);
-typedef const char *(*pluginConvertPage) (void *pluginData,
- const char *dstPage, const char *srcPage);
-typedef const char *(*pluginShutdown) (void *pluginData);
-
-typedef struct
-{
- uint16 oldPageVersion; /* Page layout version of the old cluster */
- uint16 newPageVersion; /* Page layout version of the new cluster */
- uint16 pluginVersion; /* API version of converter plugin */
- void *pluginData; /* Plugin data (set by plugin) */
- pluginStartup startup; /* Pointer to plugin's startup function */
- pluginConvertFile convertFile; /* Pointer to plugin's file converter
- * function */
- pluginConvertPage convertPage; /* Pointer to plugin's page converter
- * function */
- pluginShutdown shutdown; /* Pointer to plugin's shutdown function */
-} pageCnvCtx;
-
-const pageCnvCtx *setupPageConverter(void);
-#else
-/* dummy */
-typedef void *pageCnvCtx;
-#endif
-
-const char *copyAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
- const char *dst, bool force);
-const char *linkAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
- const char *dst);
+const char *copyAndUpdateFile(const char *src, const char *dst, bool force);
+const char *linkAndUpdateFile(const char *src, const char *dst);
void check_hard_link(void);
FILE *fopen_priv(const char *path, const char *mode);
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index c059c5b..fcaad79 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -15,10 +15,8 @@
#include "access/transam.h"
-static void transfer_single_new_db(pageCnvCtx *pageConverter,
- FileNameMap *maps, int size, char *old_tablespace);
-static void transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *suffix);
+static void transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace);
+static void transfer_relfile(FileNameMap *map, const char *suffix);
/*
@@ -92,7 +90,6 @@ transfer_all_new_dbs(DbInfoArr *old_db_arr, DbInfoArr *new_db_arr,
*new_db = NULL;
FileNameMap *mappings;
int n_maps;
- pageCnvCtx *pageConverter = NULL;
/*
* Advance past any databases that exist in the new cluster but not in
@@ -116,11 +113,7 @@ transfer_all_new_dbs(DbInfoArr *old_db_arr, DbInfoArr *new_db_arr,
{
print_maps(mappings, n_maps, new_db->db_name);
-#ifdef PAGE_CONVERSION
- pageConverter = setupPageConverter();
-#endif
- transfer_single_new_db(pageConverter, mappings, n_maps,
- old_tablespace);
+ transfer_single_new_db(mappings, n_maps, old_tablespace);
}
/* We allocate something even for n_maps == 0 */
pg_free(mappings);
@@ -129,45 +122,13 @@ transfer_all_new_dbs(DbInfoArr *old_db_arr, DbInfoArr *new_db_arr,
return;
}
-
-/*
- * get_pg_database_relfilenode()
- *
- * Retrieves the relfilenode for a few system-catalog tables. We need these
- * relfilenodes later in the upgrade process.
- */
-void
-get_pg_database_relfilenode(ClusterInfo *cluster)
-{
- PGconn *conn = connectToServer(cluster, "template1");
- PGresult *res;
- int i_relfile;
-
- res = executeQueryOrDie(conn,
- "SELECT c.relname, c.relfilenode "
- "FROM pg_catalog.pg_class c, "
- " pg_catalog.pg_namespace n "
- "WHERE c.relnamespace = n.oid AND "
- " n.nspname = 'pg_catalog' AND "
- " c.relname = 'pg_database' "
- "ORDER BY c.relname");
-
- i_relfile = PQfnumber(res, "relfilenode");
- cluster->pg_database_oid = atooid(PQgetvalue(res, 0, i_relfile));
-
- PQclear(res);
- PQfinish(conn);
-}
-
-
/*
* transfer_single_new_db()
*
* create links for mappings stored in "maps" array.
*/
static void
-transfer_single_new_db(pageCnvCtx *pageConverter,
- FileNameMap *maps, int size, char *old_tablespace)
+transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
{
int mapnum;
bool vm_crashsafe_match = true;
@@ -186,7 +147,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
strcmp(maps[mapnum].old_tablespace, old_tablespace) == 0)
{
/* transfer primary file */
- transfer_relfile(pageConverter, &maps[mapnum], "");
+ transfer_relfile(&maps[mapnum], "");
/* fsm/vm files added in PG 8.4 */
if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)
@@ -194,9 +155,9 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
/*
* Copy/link any fsm and vm files, if they exist
*/
- transfer_relfile(pageConverter, &maps[mapnum], "_fsm");
+ transfer_relfile(&maps[mapnum], "_fsm");
if (vm_crashsafe_match)
- transfer_relfile(pageConverter, &maps[mapnum], "_vm");
+ transfer_relfile(&maps[mapnum], "_vm");
}
}
}
@@ -209,8 +170,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
* Copy or link file from old cluster to new one.
*/
static void
-transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *type_suffix)
+transfer_relfile(FileNameMap *map, const char *type_suffix)
{
const char *msg;
char old_file[MAXPGPATH];
@@ -268,15 +228,11 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
/* Copying files might take some time, so give feedback. */
pg_log(PG_STATUS, "%s", old_file);
- if ((user_opts.transfer_mode == TRANSFER_MODE_LINK) && (pageConverter != NULL))
- pg_fatal("This upgrade requires page-by-page conversion, "
- "you must use copy mode instead of link mode.\n");
-
if (user_opts.transfer_mode == TRANSFER_MODE_COPY)
{
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
+ if ((msg = copyAndUpdateFile(old_file, new_file, true)) != NULL)
pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
@@ -284,7 +240,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "linking \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = linkAndUpdateFile(pageConverter, old_file, new_file)) != NULL)
+ if ((msg = linkAndUpdateFile(old_file, new_file)) != NULL)
pg_fatal("error while creating link for relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
002_support_pg_upgrade_for_freeze_map_v36.patchbinary/octet-stream; name=002_support_pg_upgrade_for_freeze_map_v36.patchDownload
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 115d506..9adee01 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -9,7 +9,11 @@
#include "postgres_fe.h"
+#include "access/visibilitymap.h"
#include "pg_upgrade.h"
+#include "storage/bufpage.h"
+#include "storage/checksum.h"
+#include "storage/checksum_impl.h"
#include <fcntl.h>
@@ -21,6 +25,25 @@ static int copy_file(const char *fromfile, const char *tofile, bool force);
static int win32_pghardlink(const char *src, const char *dst);
#endif
+/* table for fast rewriting vm file in order to add all-frozen information */
+static const uint16 rewrite_vm_table[256] = {
+ 0, 1, 4, 5, 16, 17, 20, 21, 64, 65, 68, 69, 80, 81, 84, 85,
+ 256, 257, 260, 261, 272, 273, 276, 277, 320, 321, 324, 325, 336, 337, 340, 341,
+ 1024, 1025, 1028, 1029, 1040, 1041, 1044, 1045, 1088, 1089, 1092, 1093, 1104, 1105, 1108, 1109,
+ 1280, 1281, 1284, 1285, 1296, 1297, 1300, 1301, 1344, 1345, 1348, 1349, 1360, 1361, 1364, 1365,
+ 4096, 4097, 4100, 4101, 4112, 4113, 4116, 4117, 4160, 4161, 4164, 4165, 4176, 4177, 4180, 4181,
+ 4352, 4353, 4356, 4357, 4368, 4369, 4372, 4373, 4416, 4417, 4420, 4421, 4432, 4433, 4436, 4437,
+ 5120, 5121, 5124, 5125, 5136, 5137, 5140, 5141, 5184, 5185, 5188, 5189, 5200, 5201, 5204, 5205,
+ 5376, 5377, 5380, 5381, 5392, 5393, 5396, 5397, 5440, 5441, 5444, 5445, 5456, 5457, 5460, 5461,
+ 16384, 16385, 16388, 16389, 16400, 16401, 16404, 16405, 16448, 16449, 16452, 16453, 16464, 16465, 16468, 16469,
+ 16640, 16641, 16644, 16645, 16656, 16657, 16660, 16661, 16704, 16705, 16708, 16709, 16720, 16721, 16724, 16725,
+ 17408, 17409, 17412, 17413, 17424, 17425, 17428, 17429, 17472, 17473, 17476, 17477, 17488, 17489, 17492, 17493,
+ 17664, 17665, 17668, 17669, 17680, 17681, 17684, 17685, 17728, 17729, 17732, 17733, 17744, 17745, 17748, 17749,
+ 20480, 20481, 20484, 20485, 20496, 20497, 20500, 20501, 20544, 20545, 20548, 20549, 20560, 20561, 20564, 20565,
+ 20736, 20737, 20740, 20741, 20752, 20753, 20756, 20757, 20800, 20801, 20804, 20805, 20816, 20817, 20820, 20821,
+ 21504, 21505, 21508, 21509, 21520, 21521, 21524, 21525, 21568, 21569, 21572, 21573, 21584, 21585, 21588, 21589,
+ 21760, 21761, 21764, 21765, 21776, 21777, 21780, 21781, 21824, 21825, 21828, 21829, 21840, 21841, 21844, 21845
+};
/*
* copyAndUpdateFile()
@@ -138,6 +161,95 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
#endif
+/*
+ * rewriteVisibilityMap()
+ *
+ * Copies a visibility map file while adding all-frozen bit(0) into each bit.
+ */
+const char *
+rewriteVisibilityMap(const char *fromfile, const char *tofile, bool force)
+{
+ int src_fd = 0;
+ int dst_fd = 0;
+ char buffer[BLCKSZ];
+ ssize_t bytesRead;
+ int rewriteVmBytesPerPage = (BLCKSZ - SizeOfPageHeaderData) / 2;
+ BlockNumber blkno = 0;
+
+ /* Reset errno */
+ errno = 0;
+
+ if ((fromfile == NULL) || (tofile == NULL))
+ return getErrorText();
+
+ if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
+ goto err;
+
+ if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+ goto err;
+
+ /* Perform data rewriting per page */
+ while ((bytesRead = read(src_fd, buffer, BLCKSZ)) == BLCKSZ)
+ {
+ char *cur, *end, *blkend;
+ PageHeaderData pageheader;
+ uint16 vm_bits;
+
+ /* Save the page header data */
+ memcpy(&pageheader, buffer, SizeOfPageHeaderData);
+
+ cur = buffer + SizeOfPageHeaderData;
+ end = buffer + SizeOfPageHeaderData + rewriteVmBytesPerPage;
+ blkend = buffer + bytesRead;
+
+ while (blkend >= end)
+ {
+ char vmbuf[BLCKSZ];
+ char *vmtmp = vmbuf;
+
+ /* Copy page header in advance */
+ memcpy(vmbuf, &pageheader, SizeOfPageHeaderData);
+
+ vmtmp += SizeOfPageHeaderData;
+
+ /* Rewrite visibility map bit one by one */
+ while (end > cur)
+ {
+ /* Write rewritten bit from table and its string representation */
+ vm_bits = rewrite_vm_table[(uint8) *cur];
+ memcpy(vmtmp, &vm_bits, BITS_PER_HEAPBLOCK);
+
+ cur++;
+ vmtmp += BITS_PER_HEAPBLOCK;
+ }
+
+ /* Set new checksum for a visibility map page, If enabled */
+ if (old_cluster.controldata.data_checksum_version != 0 &&
+ new_cluster.controldata.data_checksum_version != 0)
+ ((PageHeader) vmbuf)->pd_checksum = pg_checksum_page(vmbuf, blkno);
+
+ if (write(dst_fd, vmbuf, BLCKSZ) != BLCKSZ)
+ {
+ if (errno == 0)
+ errno = ENOSPC;
+ goto err;
+ }
+
+ end += rewriteVmBytesPerPage;
+ blkno++;
+ }
+ }
+
+err:
+ if (src_fd != 0)
+ close(src_fd);
+
+ if (dst_fd != 0)
+ close(dst_fd);
+
+ return (errno == 0) ? NULL : getErrorText();
+}
+
void
check_hard_link(void)
{
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 900b2a7..ecd9ab3 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -110,6 +110,10 @@ extern char *output_files[];
#define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031
/*
+ * The format of visibility map is changed with this 9.6 commit,
+ */
+#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201602181
+/*
* pg_multixact format changed in 9.3 commit 0ac5ad5134f2769ccbaefec73844f85,
* ("Improve concurrency of foreign key locking") which also updated catalog
* version to this value. pg_upgrade behavior depends on whether old and new
@@ -269,6 +273,7 @@ typedef struct
uint32 major_version; /* PG_VERSION of cluster */
char major_version_str[64]; /* string PG_VERSION of cluster */
uint32 bin_version; /* version returned from pg_ctl */
+ Oid pg_database_oid; /* OID of pg_database relation */
const char *tablespace_suffix; /* directory specification */
} ClusterInfo;
@@ -365,6 +370,8 @@ bool pid_lock_file_exists(const char *datadir);
const char *copyAndUpdateFile(const char *src, const char *dst, bool force);
const char *linkAndUpdateFile(const char *src, const char *dst);
+const char *rewriteVisibilityMap(const char *fromfile, const char *tofile,
+ bool force);
void check_hard_link(void);
FILE *fopen_priv(const char *path, const char *mode);
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index fcaad79..b003b36 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -16,7 +16,7 @@
static void transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace);
-static void transfer_relfile(FileNameMap *map, const char *suffix);
+static void transfer_relfile(FileNameMap *map, const char *suffix, bool vm_need_rewrite);
/*
@@ -132,6 +132,7 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
{
int mapnum;
bool vm_crashsafe_match = true;
+ bool vm_need_rewrite = false;
/*
* Do the old and new cluster disagree on the crash-safetiness of the vm
@@ -141,13 +142,20 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
vm_crashsafe_match = false;
+ /*
+ * Do we need to rewrite visibilitymap?
+ */
+ if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ vm_need_rewrite = true;
+
for (mapnum = 0; mapnum < size; mapnum++)
{
if (old_tablespace == NULL ||
strcmp(maps[mapnum].old_tablespace, old_tablespace) == 0)
{
/* transfer primary file */
- transfer_relfile(&maps[mapnum], "");
+ transfer_relfile(&maps[mapnum], "", vm_need_rewrite);
/* fsm/vm files added in PG 8.4 */
if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)
@@ -155,9 +163,9 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
/*
* Copy/link any fsm and vm files, if they exist
*/
- transfer_relfile(&maps[mapnum], "_fsm");
+ transfer_relfile(&maps[mapnum], "_fsm", vm_need_rewrite);
if (vm_crashsafe_match)
- transfer_relfile(&maps[mapnum], "_vm");
+ transfer_relfile(&maps[mapnum], "_vm", vm_need_rewrite);
}
}
}
@@ -168,9 +176,11 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
* transfer_relfile()
*
* Copy or link file from old cluster to new one.
+ * if vm_need_rewrite is true, visibility map is rewritten to be added frozen bit
+ * even link mode.
*/
static void
-transfer_relfile(FileNameMap *map, const char *type_suffix)
+transfer_relfile(FileNameMap *map, const char *type_suffix, bool vm_need_rewrite)
{
const char *msg;
char old_file[MAXPGPATH];
@@ -232,7 +242,13 @@ transfer_relfile(FileNameMap *map, const char *type_suffix)
{
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = copyAndUpdateFile(old_file, new_file, true)) != NULL)
+ /* Rewrite visibility map */
+ if (vm_need_rewrite && (strcmp(type_suffix, "_vm") == 0))
+ msg = rewriteVisibilityMap(old_file, new_file, true);
+ else
+ msg = copyAndUpdateFile(old_file, new_file, true);
+
+ if (!msg)
pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
@@ -240,7 +256,13 @@ transfer_relfile(FileNameMap *map, const char *type_suffix)
{
pg_log(PG_VERBOSE, "linking \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = linkAndUpdateFile(old_file, new_file)) != NULL)
+ /* Rewrite visibility map even link mode */
+ if (vm_need_rewrite && (strcmp(type_suffix, "_vm") == 0))
+ msg = rewriteVisibilityMap(old_file, new_file, true);
+ else
+ msg = linkAndUpdateFile(old_file, new_file);
+
+ if (!msg)
pg_fatal("error while creating link for relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
diff --git a/src/bin/pg_upgrade/test.sh b/src/bin/pg_upgrade/test.sh
index ba79fb3..cd9b17e 100644
--- a/src/bin/pg_upgrade/test.sh
+++ b/src/bin/pg_upgrade/test.sh
@@ -174,6 +174,11 @@ if "$MAKE" -C "$oldsrc" installcheck; then
mv "$temp_root"/dump1.sql "$temp_root"/dump1.sql.orig
sed "s;$oldsrc;$newsrc;g" "$temp_root"/dump1.sql.orig >"$temp_root"/dump1.sql
fi
+
+ vm_sql="SELECT c.relname, c.relallvisible FROM pg_class as c, pg_namespace as n WHERE c.relnamespace = n.oid AND n.nspname NOT IN ('information_schema', 'pg_toast', 'pg_catalog') ORDER BY c.relname;"
+ # Test for rewriting visibility map
+ vacuumdb -d regression || visibilitymap_vacuum1_status=$?
+ psql -d regression -c "$vm_sql" > "$temp_root"/vm_test1.txt || visibilitymap_test1_status=$?
else
make_installcheck_status=$?
fi
@@ -188,6 +193,14 @@ if [ -n "$pg_dumpall1_status" ]; then
echo "pg_dumpall of pre-upgrade database cluster failed"
exit 1
fi
+if [ -n "$visibilitymap_vacuum1_status" ];then
+ echo "VACUUM of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+if [ -n "$visibilitymap_test1_status" ];then
+ echo "SELECT of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
PGDATA=$BASE_PGDATA
@@ -203,6 +216,8 @@ case $testhost in
esac
pg_dumpall -f "$temp_root"/dump2.sql || pg_dumpall2_status=$?
+vacuumdb -d regression || visibilitymap_vacuum2_status=$?
+psql -d regression -c "$vm_sql" > "$temp_root"/vm_test2.txt || visibilitymap_test2_status=$?
pg_ctl -m fast stop
# no need to echo commands anymore
@@ -214,11 +229,26 @@ if [ -n "$pg_dumpall2_status" ]; then
exit 1
fi
+if [ -n "$visibilitymap_vacuum2_status" ];then
+ echo "VACUUM of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
+if [ -n "$visibilitymap_test2_status" ];then
+ echo "SELECT of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
case $testhost in
MINGW*) cmd /c delete_old_cluster.bat ;;
*) sh ./delete_old_cluster.sh ;;
esac
+if ! diff "$temp_root"/vm_test1.txt "$temp_root"/vm_test2.txt >/dev/null; then
+ echo "Visibility map rewriting test failed"
+ exit 1
+fi
+
if diff "$temp_root"/dump1.sql "$temp_root"/dump2.sql >/dev/null; then
echo PASSED
exit 0
003_optimize_vacuum_scan_based_on_freezemap_v36.patchbinary/octet-stream; name=003_optimize_vacuum_scan_based_on_freezemap_v36.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a09ceb2..8a258f0 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5984,7 +5984,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs an eager freezing if the table's
<structname>pg_class</>.<structfield>relfrozenxid</> field has reached
the age specified by this setting. The default is 150 million
transactions. Although users can set this value anywhere from zero to
@@ -6028,7 +6028,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs an eager freezing if the table's
<structname>pg_class</>.<structfield>relminmxid</> field has reached
the age specified by this setting. The default is 150 million multixacts.
Although users can set this value anywhere from zero to two billions,
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 5204b34..7cc975d 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -352,9 +352,9 @@
Vacuum maintains a <link linkend="storage-vm">visibility map</> for each
table to keep track of which pages contain only tuples that are known to be
visible to all active transactions (and all future transactions, until the
- page is again modified). This has two purposes. First, vacuum
- itself can skip such pages on the next run, since there is nothing to
- clean up.
+ page is again modified), and pages contain only frozen tuples.
+ This has two purposes. First, vacuum itself can skip such pages
+ on the next run, since there is nothing to clean up.
</para>
<para>
@@ -438,28 +438,25 @@
</para>
<para>
- <command>VACUUM</> normally skips pages that don't have any dead row
- versions, but those pages might still have row versions with old XID
- values. To ensure all old row versions have been frozen, a
- scan of the whole table is needed.
+ <command>VACUUM</> skips pages that don't have any dead row
+ versions, and pages that have only frozen rows. To ensure all old
+ row versions have been frozen, a scan of all unfrozen pages is needed.
<xref linkend="guc-vacuum-freeze-table-age"> controls when
- <command>VACUUM</> does that: a whole table sweep is forced if
- the table hasn't been fully scanned for <varname>vacuum_freeze_table_age</>
- minus <varname>vacuum_freeze_min_age</> transactions. Setting it to 0
- forces <command>VACUUM</> to always scan all pages, effectively ignoring
- the visibility map.
+ <command>VACUUM</> does that: a whole-table freezing is forced if
+ the table hasn't been ensured all row versions are frozen for
+ <varname>vacuum_freeze_table_age</> minus <varname>vacuum_freeze_min_age</>
+ transcation.
</para>
<para>
The maximum time that a table can go unvacuumed is two billion
transactions minus the <varname>vacuum_freeze_min_age</> value at
- the time <command>VACUUM</> last scanned the whole table. If it were to go
- unvacuumed for longer than
- that, data loss could result. To ensure that this does not happen,
- autovacuum is invoked on any table that might contain unfrozen rows with
- XIDs older than the age specified by the configuration parameter <xref
- linkend="guc-autovacuum-freeze-max-age">. (This will happen even if
- autovacuum is disabled.)
+ the time <command>VACUUM</> last scanned unfrozen pages. If it were to go
+ unvacuumed for longer than that, data loss could result. To ensure
+ that this does not happen, autovacuum is invoked on any table that might
+ contain unfrozen rows with XIDs older than the age specified by the
+ configuration parameter <xref linkend="guc-autovacuum-freeze-max-age">.
+ (This will happen even if autovacuum is disabled.)
</para>
<para>
@@ -490,8 +487,7 @@
a regularly scheduled <command>VACUUM</> or an autovacuum triggered by
normal delete and update activity is run in that window. Setting it too
close could lead to anti-wraparound autovacuums, even though the table
- was recently vacuumed to reclaim space, whereas lower values lead to more
- frequent whole-table scans.
+ was recently vacuumed to reclaim space.
</para>
<para>
@@ -527,7 +523,7 @@
<structname>pg_database</>. In particular,
the <structfield>relfrozenxid</> column of a table's
<structname>pg_class</> row contains the freeze cutoff XID that was used
- by the last whole-table <command>VACUUM</> for that table. All rows
+ by the last whole-table freezing for that table. All rows
inserted by transactions with XIDs older than this cutoff XID are
guaranteed to have been frozen. Similarly,
the <structfield>datfrozenxid</> column of a database's
@@ -555,17 +551,18 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
<command>VACUUM</> normally
only scans pages that have been modified since the last vacuum, but
<structfield>relfrozenxid</> can only be advanced when the whole table is
- scanned. The whole table is scanned when <structfield>relfrozenxid</> is
- more than <varname>vacuum_freeze_table_age</> transactions old, when
- <command>VACUUM</>'s <literal>FREEZE</> option is used, or when all pages
- happen to
- require vacuuming to remove dead row versions. When <command>VACUUM</>
- scans the whole table, after it's finished <literal>age(relfrozenxid)</>
- should be a little more than the <varname>vacuum_freeze_min_age</> setting
- that was used (more by the number of transactions started since the
- <command>VACUUM</> started). If no whole-table-scanning <command>VACUUM</>
- is issued on the table until <varname>autovacuum_freeze_max_age</> is
- reached, an autovacuum will soon be forced for the table.
+ frozen. Freezing occurs on the whole table once all pages of this relation
+ require it. In other cases such as where <structfield>relfrozenxid</> is more
+ than <varname>vacuum_freeze_table_age</> transactions old, where
+ <command>VACUUM</>'s <literal>FREEZE</> option is used, <command>VACUUM</>
+ can skip the pages that all tuples on the page itself are marked as frozen.
+ When all pages of table are eventually marked as frozen by <command>VACUUM</>,
+ after it's finished <literal>age(relfrozenxid)</> should be a little more
+ than the <varname>vacuum_freeze_min_age</> setting that was used (more by
+ the number of transactions started since the <command>VACUUM</> started).
+ If the advancing of <structfield>relfrozenxid</> is not happend until
+ <varname>autovacuum_freeze_max_age</> is reached, an autovacuum will soon
+ be forced for the table.
</para>
<para>
@@ -642,28 +639,28 @@ HINT: Stop the postmaster and vacuum that database in single-user mode.
<structname>pg_class</>.<structfield>relminmxid</> stores the oldest
possible multixact ID still appearing in any tuple of that table.
If this value is older than
- <xref linkend="guc-vacuum-multixact-freeze-table-age">, a whole-table
+ <xref linkend="guc-vacuum-multixact-freeze-table-age">, a table
scan is forced. <function>mxid_age()</> can be used on
<structname>pg_class</>.<structfield>relminmxid</> to find its age.
</para>
<para>
- Whole-table <command>VACUUM</> scans, regardless of
- what causes them, enable advancing the value for that table.
+ When <command>VACUUM</> scans all unfrozen pages, regardless of what causes
+ them, enable advancing the value for that table.
Eventually, as all tables in all databases are scanned and their
oldest multixact values are advanced, on-disk storage for older
multixacts can be removed.
</para>
<para>
- As a safety device, a whole-table vacuum scan will occur for any table
- whose multixact-age is greater than
- <xref linkend="guc-autovacuum-multixact-freeze-max-age">. Whole-table
+ As a safety device, a vacuum scan will occur for any table whose
+ multixact-age is greater than
+ <xref linkend="guc-autovacuum-multixact-freeze-max-age">. A
vacuum scans will also occur progressively for all tables, starting with
those that have the oldest multixact-age, if the amount of used member
storage space exceeds the amount 50% of the addressable storage space.
- Both of these kinds of whole-table scans will occur even if autovacuum is
- nominally disabled.
+ Both of these kinds of scans will occur even if autovacuum is nominally
+ disabled.
</para>
</sect3>
</sect2>
@@ -743,8 +740,8 @@ vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuple
<command>UPDATE</command> and <command>DELETE</command> operation. (It
is only semi-accurate because some information might be lost under heavy
load.) If the <structfield>relfrozenxid</> value of the table is more
- than <varname>vacuum_freeze_table_age</> transactions old, the whole
- table is scanned to freeze old tuples and advance
+ than <varname>vacuum_freeze_table_age</> transactions old, the table is
+ scanned to freeze old tuples and advance
<structfield>relfrozenxid</>, otherwise only pages that have been modified
since the last vacuum are scanned.
</para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 85459d0..0bcd52d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1423,6 +1423,11 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
<entry>Estimated number of rows modified since this table was last analyzed</entry>
</row>
<row>
+ <entry><structfield>n_frozen_pages</></entry>
+ <entry><type>integer</></entry>
+ <entry>Number of frozen pages</entry>
+ </row>
+ <row>
<entry><structfield>last_vacuum</></entry>
<entry><type>timestamp with time zone</></entry>
<entry>Last time at which this table was manually vacuumed
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 60782da..152b99c 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -106,6 +106,8 @@ typedef struct LVRelStats
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* number of pages we examined */
BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber vmskipped_frozen_pages; /* # of pages we skipped by all-frozen bit
+ of visibility map */
double scanned_tuples; /* counts only tuples on scanned pages */
double old_rel_tuples; /* previous value of pg_class.reltuples */
double new_rel_tuples; /* new estimated total # of tuples */
@@ -156,8 +158,9 @@ static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
ItemPointer itemptr);
static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
static int vac_cmp_itemptr(const void *left, const void *right);
-static bool heap_page_is_all_visible(Relation rel, Buffer buf,
- TransactionId *visibility_cutoff_xid);
+static void heap_page_visible_status(Relation rel, Buffer buf,
+ TransactionId *visibility_cutoff_xid,
+ bool *all_visible, bool *all_frozen);
/*
@@ -222,7 +225,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* We request a full scan if either the table's frozen Xid is now older
* than or equal to the requested Xid full-table scan limit; or if the
* table's minimum MultiXactId is older than or equal to the requested
- * mxid full-table scan limit.
+ * mxid full-table scan limit. During full scan, we could skip some pages
+ * according to all-frozen bit of visibility map.
*/
scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
xidFullScanLimit);
@@ -254,7 +258,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
- if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages + vacrelstats->vmskipped_frozen_pages)
+ < vacrelstats->rel_pages)
{
Assert(!scan_all);
scanned_all = false;
@@ -275,15 +280,15 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* Update statistics in pg_class.
*
* A corner case here is that if we scanned no pages at all because every
- * page is all-visible, we should not update relpages/reltuples, because
- * we have no new information to contribute. In particular this keeps us
- * from replacing relpages=reltuples=0 (which means "unknown tuple
+ * page is all-visible or all-frozen, we should not update relpages/reltuples,
+ * because we have no new information to contribute. In particular this keeps
+ * us from replacing relpages=reltuples=0 (which means "unknown tuple
* density") with nonzero relpages and reltuples=0 (which means "zero
* tuple density") unless there's some actual evidence for the latter.
*
- * We do update relallvisible even in the corner case, since if the table
- * is all-visible we'd definitely like to know that. But clamp the value
- * to be not more than what we're setting relpages to.
+ * We do update relallvisible and relallfrozen even in the corner case,
+ * since if the table is all-visible we'd definitely like to know that.
+ * But clamp the value to be not more than what we're setting relpages to.
*
* Also, don't change relfrozenxid/relminmxid if we skipped any pages,
* since then we don't know for certain that all tuples have a newer xmin.
@@ -359,10 +364,11 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel),
vacrelstats->num_index_scans);
- appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins\n"),
+ appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped according to vm\n"),
vacrelstats->pages_removed,
vacrelstats->rel_pages,
- vacrelstats->pinskipped_pages);
+ vacrelstats->pinskipped_pages,
+ vacrelstats->vmskipped_frozen_pages);
appendStringInfo(&buf,
_("tuples: %.0f removed, %.0f remain, %.0f are dead but not yet removable\n"),
vacrelstats->tuples_deleted,
@@ -485,9 +491,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* consecutive pages. Since we're reading sequentially, the OS should be
* doing readahead for us, so there's no gain in skipping a page now and
* then; that's likely to disable readahead and so be counterproductive.
- * Also, skipping even a single page means that we can't update
- * relfrozenxid, so we only want to do it if we can skip a goodly number
- * of pages.
+ * Also, skipping even a single page according to all-visible bit of
+ * visibility map means that we might not be able to update relfrozenxid,
+ * so we only want to do it if we can skip a goodly number. On the other hand,
+ * we count both how many pages we skipped according to all-frozen bit of
+ * visibility map and how many pages we froze, so we can update relfrozenxid
+ * if the sum of two is as many as pages of table.
*
* Before entering the main loop, establish the invariant that
* next_not_all_visible_block is the next block number >= blkno that's not
@@ -497,18 +506,18 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* started skipping blocks, we may as well skip everything up to the next
* not-all-visible block.
*
- * Note: if scan_all is true, we won't actually skip any pages; but we
+ * Note: if scan_all is true, we might not actually skip any pages; but we
* maintain next_not_all_visible_block anyway, so as to set up the
* all_visible_according_to_vm flag correctly for each page.
*
- * Note: The value returned by visibilitymap_test could be slightly
+ * Note: The value returned by visibilitymap_get_status could be slightly
* out-of-date, since we make this test before reading the corresponding
* heap page or locking the buffer. This is OK. If we mistakenly think
- * that the page is all-visible when in fact the flag's just been cleared,
- * we might fail to vacuum the page. But it's OK to skip pages when
- * scan_all is not set, so no great harm done; the next vacuum will find
- * them. If we make the reverse mistake and vacuum a page unnecessarily,
- * it'll just be a no-op.
+ * that the page is all-visible/all-frozen when in fact the flag's just
+ * been cleared, we might fail to vacuum the page. But it's OK to skip
+ * pages when scan_all is not set, so no great harm done; the next vacuum
+ * will find them. If we make the reverse mistake and vacuum a page
+ * unnecessarily, it'll just be a no-op.
*
* We will scan the table's last page, at least to the extent of
* determining whether it has tuples or not, even if it should be skipped
@@ -541,9 +550,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
bool tupgone,
hastup;
int prev_dead_count;
- int nfrozen;
+ int nfrozen; /* # of tuples is frozen */
+ int nalready_frozen; /* # of tuples is already frozen */
+ int ntotal_frozen; /* # of frozen tuples in single page */
+ int ntup_per_page;
Size freespace;
bool all_visible_according_to_vm;
+ bool all_frozen_according_to_vm;
bool all_visible;
bool has_dead_tuples;
TransactionId visibility_cutoff_xid = InvalidTransactionId;
@@ -573,14 +586,29 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
skipping_all_visible_blocks = true;
else
skipping_all_visible_blocks = false;
+
all_visible_according_to_vm = false;
+ all_frozen_according_to_vm = false;
}
else
{
- /* Current block is all-visible */
- if (skipping_all_visible_blocks && !scan_all && !FORCE_CHECK_PAGE())
- continue;
+ /*
+ * This block is at least all-visible according to visibility map.
+ * We check whether this block is all-frozen or not, to skip to
+ * vacuum this page even if scan_all is true.
+ */
+ bool all_frozen = VM_ALL_FROZEN(onerel, blkno, &vmbuffer);
+
+ if (scan_all && all_frozen && !FORCE_CHECK_PAGE())
+ {
+ vacrelstats->vmskipped_frozen_pages++;
+ continue;
+ }
+ else if (!scan_all && skipping_all_visible_blocks && !FORCE_CHECK_PAGE())
+ continue;
+
all_visible_according_to_vm = true;
+ all_frozen_according_to_vm = all_frozen;
}
vacuum_delay_point();
@@ -747,7 +775,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
empty_pages++;
freespace = PageGetHeapFreeSpace(page);
- /* empty pages are always all-visible */
+ /* empty pages are always all-visible and all-frozen */
if (!PageIsAllVisible(page))
{
START_CRIT_SECTION();
@@ -770,9 +798,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
log_newpage_buffer(buf, true);
PageSetAllVisible(page);
+ PageSetAllFrozen(page);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
vmbuffer, InvalidTransactionId,
- VISIBILITYMAP_ALL_VISIBLE);
+ VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
END_CRIT_SECTION();
}
@@ -796,13 +825,15 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
all_visible = true;
has_dead_tuples = false;
nfrozen = 0;
+ nalready_frozen = 0;
+ ntup_per_page = 0;
hastup = false;
prev_dead_count = vacrelstats->num_dead_tuples;
maxoff = PageGetMaxOffsetNumber(page);
/*
* Note: If you change anything in the loop below, also look at
- * heap_page_is_all_visible to see if that needs to be changed.
+ * heap_page_visible_status to see if that needs to be changed.
*/
for (offnum = FirstOffsetNumber;
offnum <= maxoff;
@@ -950,8 +981,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
{
num_tuples += 1;
+ ntup_per_page += 1;;
hastup = true;
+ /* Check whether this tuple is already frozen or not */
+ if (HeapTupleHeaderXminFrozen(tuple.t_data))
+ nalready_frozen += 1;
+
/*
* Each non-removable tuple must be checked to see if it needs
* freezing. Note we already have exclusive buffer lock.
@@ -998,6 +1034,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
END_CRIT_SECTION();
}
+ /* Compute total number of frozen tuples in a page */
+ ntotal_frozen = nfrozen + nalready_frozen;
+
/*
* If there are no indexes then we can vacuum the page right now
* instead of doing a second scan.
@@ -1020,27 +1059,47 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
freespace = PageGetHeapFreeSpace(page);
- /* mark page all-visible, if appropriate */
- if (all_visible && !all_visible_according_to_vm)
+ /* If this page is all visible, consider to set all-visible and all-frozen */
+ if (all_visible)
{
- /*
- * It should never be the case that the visibility map page is set
- * while the page-level bit is clear, but the reverse is allowed
- * (if checksums are not enabled). Regardless, set the both bits
- * so that we get back in sync.
- *
- * NB: If the heap page is all-visible but the VM bit is not set,
- * we don't need to dirty the heap page. However, if checksums
- * are enabled, we do need to make sure that the heap page is
- * dirtied before passing it to visibilitymap_set(), because it
- * may be logged. Given that this situation should only happen in
- * rare cases after a crash, it is not worth optimizing.
- */
- PageSetAllVisible(page);
- MarkBufferDirty(buf);
- visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, visibility_cutoff_xid,
- VISIBILITYMAP_ALL_VISIBLE);
+ uint8 flags = 0;
+
+ /* mark page all-visible, if appropriate */
+ if (!all_visible_according_to_vm)
+ {
+ /*
+ * It should never be the case that the visibility map page is set
+ * while the page-level bit is clear, but the reverse is allowed
+ * (if checksums are not enabled). Regardless, set the both bits
+ * so that we get back in sync.
+ *
+ * NB: If the heap page is all-visible but the VM bit is not set,
+ * we don't need to dirty the heap page. However, if checksums
+ * are enabled, we do need to make sure that the heap page is
+ * dirtied before passing it to visibilitymap_set(), because it
+ * may be logged. Given that this situation should only happen in
+ * rare cases after a crash, it is not worth optimizing.
+ */
+ PageSetAllVisible(page);
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+ }
+
+ /* mark page all-frozen, if all tuples are frozen and not marked yet */
+ if ((ntotal_frozen == ntup_per_page) && !all_frozen_according_to_vm)
+ {
+ Assert(PageIsAllVisible(page));
+
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+ if (flags)
+ {
+ MarkBufferDirty(buf);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+ vmbuffer, visibility_cutoff_xid, flags);
+ }
+
}
/*
@@ -1053,7 +1112,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else if (all_visible_according_to_vm && !PageIsAllVisible(page)
&& VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
{
- elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
+ /* If the all-frozen is set then all-visible must be set */
+ if (all_frozen_according_to_vm)
+ Assert(VM_ALL_FROZEN(onerel, blkno, &vmbuffer) &&
+ VM_ALL_VISIBLE(onerel, blkno, &vmbuffer));
+
+ elog(WARNING, "page is not marked all-visible (and all-frozen) but visibility map bit(s) is set in relation \"%s\" page %u",
relname, blkno);
visibilitymap_clear(onerel, blkno, vmbuffer);
}
@@ -1061,19 +1125,25 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
/*
* It's possible for the value returned by GetOldestXmin() to move
* backwards, so it's not wrong for us to see tuples that appear to
- * not be visible to everyone yet, while PD_ALL_VISIBLE is already
- * set. The real safe xmin value never moves backwards, but
- * GetOldestXmin() is conservative and sometimes returns a value
+ * not be visible to everyone yet, while PD_ALL_VISIBLE (and PD_ALL_FROZEN)
+ * are already set. The real safe xmin value never moves backwards,
+ * but GetOldestXmin() is conservative and sometimes returns a value
* that's unnecessarily small, so if we see that contradiction it just
* means that the tuples that we think are not visible to everyone yet
- * actually are, and the PD_ALL_VISIBLE flag is correct.
+ * actually are, and the PD_ALL_VISIBLE (and PD_ALL_FROZEN) flags are
+ * correct.
*
* There should never be dead tuples on a page with PD_ALL_VISIBLE
* set, however.
*/
else if (PageIsAllVisible(page) && has_dead_tuples)
{
- elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
+ /* If all-frozen is set then all-visible must be set */
+ if (PageIsAllFrozen(page))
+ Assert(VM_ALL_FROZEN(onerel, blkno, &vmbuffer) &&
+ VM_ALL_VISIBLE(onerel, blkno, &vmbuffer));
+
+ elog(WARNING, "page containing dead tuples is marked as all-visible (and all-frozen) in relation \"%s\" page %u",
relname, blkno);
PageClearAllVisible(page);
MarkBufferDirty(buf);
@@ -1147,6 +1217,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
RelationGetRelationName(onerel),
tups_vacuumed, vacuumed_pages)));
+ /* Report how many frozen pages vacuum skipped according to visibility map */
+ ereport(elevel,
+ (errmsg_plural("skipped %d frozen page according to visibility map",
+ "skipped %d frozen pages according to visibility map",
+ vacrelstats->vmskipped_frozen_pages,
+ vacrelstats->vmskipped_frozen_pages)));
+
/*
* This is pretty messy, but we split it up so that we can skip emitting
* individual parts of the message when not applicable.
@@ -1263,6 +1340,8 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
OffsetNumber unused[MaxOffsetNumber];
int uncnt = 0;
TransactionId visibility_cutoff_xid;
+ bool all_visible;
+ bool all_frozen;
START_CRIT_SECTION();
@@ -1314,19 +1393,35 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
* dirty, exclusively locked, and, if needed, a full page image has been
* emitted in the log_heap_clean() above.
*/
- if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid))
+ heap_page_visible_status(onerel, buffer, &visibility_cutoff_xid,
+ &all_visible, &all_frozen);
+ if (all_visible)
PageSetAllVisible(page);
/*
* All the changes to the heap page have been done. If the all-visible
- * flag is now set, also set the VM bit.
+ * flag is now set, also set the VM all-visible bit.
+ * Also, if this page is all-frozen, set the VM all-frozen bit and flag.
*/
- if (PageIsAllVisible(page) &&
- !VM_ALL_VISIBLE(onerel, blkno, vmbuffer))
+ if (PageIsAllVisible(page))
{
+ uint8 vm_status = visibilitymap_get_status(onerel, blkno, vmbuffer);
+ uint8 flags = 0;
+
+ if (!(vm_status & VISIBILITYMAP_ALL_VISIBLE))
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+
+ /* Set the VM all-frozen bit to flag, if needed */
+ if (all_frozen && !(vm_status & VISIBILITYMAP_ALL_FROZEN))
+ {
+ PageSetAllFrozen(page);
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
+
+
Assert(BufferIsValid(*vmbuffer));
visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr, *vmbuffer,
- visibility_cutoff_xid, VISIBILITYMAP_ALL_VISIBLE);
+ visibility_cutoff_xid, flags);
}
return tupindex;
@@ -1848,18 +1943,21 @@ vac_cmp_itemptr(const void *left, const void *right)
/*
* Check if every tuple in the given page is visible to all current and future
* transactions. Also return the visibility_cutoff_xid which is the highest
- * xmin amongst the visible tuples.
+ * xmin amongst the visible tuples, and all_frozen which implies that all tuples
+ * of this page are frozen.
*/
-static bool
-heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid)
+static void
+heap_page_visible_status(Relation rel, Buffer buf, TransactionId *visibility_cutoff_xid,
+ bool *all_visible, bool *all_frozen)
{
Page page = BufferGetPage(buf);
BlockNumber blockno = BufferGetBlockNumber(buf);
OffsetNumber offnum,
maxoff;
- bool all_visible = true;
*visibility_cutoff_xid = InvalidTransactionId;
+ *all_visible = true;
+ *all_frozen = true;
/*
* This is a stripped down version of the line pointer scan in
@@ -1867,7 +1965,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
*/
maxoff = PageGetMaxOffsetNumber(page);
for (offnum = FirstOffsetNumber;
- offnum <= maxoff && all_visible;
+ offnum <= maxoff && *all_visible;
offnum = OffsetNumberNext(offnum))
{
ItemId itemid;
@@ -1883,11 +1981,12 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/*
* Dead line pointers can have index pointers pointing to them. So
- * they can't be treated as visible
+ * they can't be treated as visible and frozen.
*/
if (ItemIdIsDead(itemid))
{
- all_visible = false;
+ *all_visible = false;
+ *all_frozen = false;
break;
}
@@ -1906,7 +2005,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
/* Check comments in lazy_scan_heap. */
if (!HeapTupleHeaderXminCommitted(tuple.t_data))
{
- all_visible = false;
+ *all_visible = false;
break;
}
@@ -1917,13 +2016,17 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
xmin = HeapTupleHeaderGetXmin(tuple.t_data);
if (!TransactionIdPrecedes(xmin, OldestXmin))
{
- all_visible = false;
+ *all_visible = false;
break;
}
/* Track newest xmin on page. */
if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
*visibility_cutoff_xid = xmin;
+
+ /* Check whether this tuple is already frozen or not */
+ if (!HeapTupleHeaderXminFrozen(tuple.t_data))
+ *all_frozen = false;
}
break;
@@ -1931,7 +2034,8 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
case HEAPTUPLE_RECENTLY_DEAD:
case HEAPTUPLE_INSERT_IN_PROGRESS:
case HEAPTUPLE_DELETE_IN_PROGRESS:
- all_visible = false;
+ *all_visible = false;
+ *all_frozen = false;
break;
default:
@@ -1940,5 +2044,6 @@ heap_page_is_all_visible(Relation rel, Buffer buf, TransactionId *visibility_cut
}
} /* scan along page */
- return all_visible;
+ if (!(*all_visible))
+ *all_frozen = false;
}
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index f5be70f..95ababf 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -127,6 +127,8 @@ SELECT count(*) FROM tenk2 WHERE unique1 = 1;
1
(1 row)
+-- do VACUUM FREEZE
+VACUUM FREEZE tenk2;
-- force the rate-limiting logic in pgstat_report_tabstat() to time out
-- and send a message
SELECT pg_sleep(1.0);
@@ -175,6 +177,14 @@ SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages,
t | t
(1 row)
+SELECT n_frozen_pages = (pg_relation_size('tenk2') / current_setting('block_size')::int)
+ FROM pg_stat_user_tables
+ WHERE relname ='tenk2';
+ ?column?
+----------
+ t
+(1 row)
+
SELECT pr.snap_ts < pg_stat_get_snapshot_timestamp() as snapshot_newer
FROM prevstats AS pr;
snapshot_newer
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index bec0316..2324420 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -110,3 +110,6 @@ test: event_trigger
# run stats by itself because its delay may be insufficient under heavy load
test: stats
+
+# page info map and vacuum test cannot run concurrently with any test that runs SQL
+test: visibilitymap
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 7e9b319..df4c717 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -162,3 +162,4 @@ test: with
test: xml
test: event_trigger
test: stats
+test: visibilitymap
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index cd2d592..dea5553 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -120,6 +120,8 @@ ROLLBACK;
SELECT count(*) FROM tenk2;
-- do an indexscan
SELECT count(*) FROM tenk2 WHERE unique1 = 1;
+-- do VACUUM FREEZE
+VACUUM FREEZE tenk2;
-- force the rate-limiting logic in pgstat_report_tabstat() to time out
-- and send a message
@@ -145,6 +147,10 @@ SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages,
FROM pg_statio_user_tables AS st, pg_class AS cl, prevstats AS pr
WHERE st.relname='tenk2' AND cl.relname='tenk2';
+SELECT n_frozen_pages = (pg_relation_size('tenk2') / current_setting('block_size')::int)
+ FROM pg_stat_user_tables
+ WHERE relname ='tenk2';
+
SELECT pr.snap_ts < pg_stat_get_snapshot_timestamp() as snapshot_newer
FROM prevstats AS pr;
004_enhance_visibilitymap_debug_messages_v36.patchbinary/octet-stream; name=004_enhance_visibilitymap_debug_messages_v36.patchDownload
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 217c694..27a10fc 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -175,7 +175,7 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer buf)
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_clear %s, block %d", RelationGetRelationName(rel), heapBlk);
#endif
if (!BufferIsValid(buf) || BufferGetBlockNumber(buf) != mapBlock)
@@ -274,7 +274,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
uint8 *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_set %s, block %d, flags %u", RelationGetRelationName(rel), heapBlk, flags);
#endif
Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
@@ -367,7 +367,7 @@ visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *buf)
char *map;
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_get_status %s %d", RelationGetRelationName(rel), heapBlk);
+ elog(DEBUG1, "vm_get_status %s, block %d", RelationGetRelationName(rel), heapBlk);
#endif
/* Reuse the old pinned buffer if possible */
@@ -469,7 +469,7 @@ visibilitymap_truncate(Relation rel, BlockNumber nheapblocks)
uint8 truncBit = HEAPBLK_TO_MAPBIT(nheapblocks);
#ifdef TRACE_VISIBILITYMAP
- elog(DEBUG1, "vm_truncate %s %d", RelationGetRelationName(rel), nheapblocks);
+ elog(DEBUG1, "vm_truncate %s, block %d", RelationGetRelationName(rel), nheapblocks);
#endif
RelationOpenSmgr(rel);
On Thu, Feb 18, 2016 at 3:45 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Attached updated 5 patches.
I would like to explain these patch shortly again here to make
reviewing more easier.We can divided these patches into 2 purposes.
1. Freeze map
000_ patch adds additional frozen bit into visibility map, but doesn't
include the logic for improve freezing performance.
001_ patch gets rid of page-conversion code from pg_upgrade. (This
patch doesn't related to this feature essentially, but is required by
002_ patch.)
002_ patch adds upgrading mechanism from 9.6- to 9.6+ and its regression test.2. Improve freezing logic
003_ patch changes the VACUUM to optimize scans based on freeze map
(i.g., 000_ patch), and its regression test.
004_ patch enhances debug messages in src/backend/access/heap/visibilitymap.cPlease review them.
I have pushed 000 and part of 003, with substantial revisions to the
003 part and minor revisions to the 000 part. This gets the basic
infrastructure in place, but the vacuum optimization and pg_upgrade
fixes still need to be done.
I discovered that make check-world failed with 000 applied, because
the Assert()s added to visibilitymap_set were using | rather than & to
test for a set bit. I fixed that.
I revised the code in vacuumlazy.c that updates the new map bits
rather heavily. I hope I didn't break anything; please have a look
and see if you spot any problems. One big problem was that it's
inadequate to judge whether a tuple needs freezing just by looking at
xmin; xmax might need to be cleared, for example.
I removed the pgstat stuff. I'm not sure we want that stuff in that
form; it doesn't seem to fit with the rest of what's in that view, and
it wasn't reliable in my testing. I did however throw together a
little contrib module for testing, which I attach here. I'm not sure
we want to commit this, and at the least someone would need to write
documentation. But it's certainly handy for checking whether this
works.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
pg_visibilitymap-v1.patchapplication/x-download; name=pg_visibilitymap-v1.patchDownload
From 03482265b539bfd46ce684935c4b6697217dfe21 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 1 Mar 2016 20:55:43 -0500
Subject: [PATCH 2/2] pg_visibilitymap
---
contrib/pg_visibilitymap/Makefile | 19 ++++++
contrib/pg_visibilitymap/pg_visibilitymap--1.0.sql | 46 ++++++++++++++
contrib/pg_visibilitymap/pg_visibilitymap.c | 70 ++++++++++++++++++++++
contrib/pg_visibilitymap/pg_visibilitymap.control | 5 ++
4 files changed, 140 insertions(+)
create mode 100644 contrib/pg_visibilitymap/Makefile
create mode 100644 contrib/pg_visibilitymap/pg_visibilitymap--1.0.sql
create mode 100644 contrib/pg_visibilitymap/pg_visibilitymap.c
create mode 100644 contrib/pg_visibilitymap/pg_visibilitymap.control
diff --git a/contrib/pg_visibilitymap/Makefile b/contrib/pg_visibilitymap/Makefile
new file mode 100644
index 0000000..76cf1fa
--- /dev/null
+++ b/contrib/pg_visibilitymap/Makefile
@@ -0,0 +1,19 @@
+# contrib/pg_visibilitymap/Makefile
+
+MODULE_big = pg_visibilitymap
+OBJS = pg_visibilitymap.o $(WIN32RES)
+
+EXTENSION = pg_visibilitymap
+DATA = pg_visibilitymap--1.0.sql
+PGFILEDESC = "pg_visibilitymap - monitoring of visibility map"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_visibilitymap
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_visibilitymap/pg_visibilitymap--1.0.sql b/contrib/pg_visibilitymap/pg_visibilitymap--1.0.sql
new file mode 100644
index 0000000..4817b2c
--- /dev/null
+++ b/contrib/pg_visibilitymap/pg_visibilitymap--1.0.sql
@@ -0,0 +1,46 @@
+/* contrib/pg_visibilitymap/pg_visibilitymap--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_visibilitymap" to load this file. \quit
+
+-- Show visibility map information.
+CREATE FUNCTION pg_visibilitymap(regclass, bigint)
+RETURNS int4
+AS 'MODULE_PATHNAME', 'pg_visibilitymap'
+LANGUAGE C STRICT;
+
+-- Show page status information.
+CREATE FUNCTION pg_page_flags(regclass, bigint)
+RETURNS int4
+AS 'MODULE_PATHNAME', 'pg_page_flags'
+LANGUAGE C STRICT;
+
+-- pg_visibilitymap shows the visibility map bits for each block in a relation
+CREATE FUNCTION
+ pg_visibilitymap(rel regclass, blkno OUT bigint, mapbits OUT int4)
+RETURNS SETOF RECORD
+AS $$
+ SELECT blkno, pg_visibilitymap($1, blkno) AS mapbits
+ FROM generate_series(0, pg_relation_size($1) / current_setting('block_size')::bigint - 1) AS blkno;
+$$
+LANGUAGE SQL;
+
+-- pg_visibility shows the visibility map bits and page-level bits for each
+-- block in a relation. this is more expensive than pg_visibilitymap since
+-- we must read all of the pages.
+CREATE FUNCTION
+ pg_visibility(rel regclass, blkno OUT bigint, mapbits OUT int4,
+ pagebits OUT int4)
+RETURNS SETOF RECORD
+AS $$
+ SELECT blkno, pg_visibilitymap($1, blkno) AS mapbits,
+ pg_page_flags($1, blkno) AS pagebits
+ FROM generate_series(0, pg_relation_size($1) / current_setting('block_size')::bigint - 1) AS blkno;
+$$
+LANGUAGE SQL;
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_visibilitymap(regclass, bigint) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_page_flags(regclass, bigint) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_visibilitymap(regclass) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_visibility(regclass) FROM PUBLIC;
diff --git a/contrib/pg_visibilitymap/pg_visibilitymap.c b/contrib/pg_visibilitymap/pg_visibilitymap.c
new file mode 100644
index 0000000..9ffc7e1
--- /dev/null
+++ b/contrib/pg_visibilitymap/pg_visibilitymap.c
@@ -0,0 +1,70 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_visibilitymap.c
+ * display contents of a visibility map and page level bits
+ *
+ * contrib/pg_visibilitymap/pg_visibilitymap.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/visibilitymap.h"
+#include "funcapi.h"
+#include "storage/bufmgr.h"
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(pg_visibilitymap);
+PG_FUNCTION_INFO_V1(pg_page_flags);
+
+Datum
+pg_visibilitymap(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int64 blkno = PG_GETARG_INT64(1);
+ int32 mapbits;
+ Relation rel;
+ Buffer vmbuffer = InvalidBuffer;
+
+ rel = relation_open(relid, AccessShareLock);
+
+ if (blkno < 0 || blkno > MaxBlockNumber)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid block number")));
+
+ mapbits = (int32) visibilitymap_get_status(rel, blkno, &vmbuffer);
+ if (vmbuffer != InvalidBuffer)
+ ReleaseBuffer(vmbuffer);
+
+ relation_close(rel, AccessShareLock);
+ PG_RETURN_INT32(mapbits);
+}
+
+Datum
+pg_page_flags(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int64 blkno = PG_GETARG_INT64(1);
+ int32 pagebits;
+ Relation rel;
+ Buffer buffer;
+ Page page;
+
+ rel = relation_open(relid, AccessShareLock);
+
+ if (blkno < 0 || blkno > MaxBlockNumber)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid block number")));
+
+ buffer = ReadBuffer(rel, blkno);
+ LockBuffer(buffer, BUFFER_LOCK_SHARE);
+
+ page = BufferGetPage(buffer);
+ pagebits = (int32) (((PageHeader) (page))->pd_flags);
+
+ UnlockReleaseBuffer(buffer);
+ relation_close(rel, AccessShareLock);
+ PG_RETURN_INT32(pagebits);
+}
diff --git a/contrib/pg_visibilitymap/pg_visibilitymap.control b/contrib/pg_visibilitymap/pg_visibilitymap.control
new file mode 100644
index 0000000..f1686eb
--- /dev/null
+++ b/contrib/pg_visibilitymap/pg_visibilitymap.control
@@ -0,0 +1,5 @@
+# pg_visibilitymap extension
+comment = 'examine the visibility map (VM)'
+default_version = '1.0'
+module_pathname = '$libdir/pg_visibilitymap'
+relocatable = true
--
2.5.4 (Apple Git-61)
Thank you for revising and commiting this.
At Tue, 1 Mar 2016 21:51:55 -0500, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoZtG7hnkgP74zRCeuRrGGG917J5-_P4dzNJz5_kAXFTKg@mail.gmail.com>
On Thu, Feb 18, 2016 at 3:45 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Attached updated 5 patches.
I would like to explain these patch shortly again here to make
reviewing more easier.We can divided these patches into 2 purposes.
1. Freeze map
000_ patch adds additional frozen bit into visibility map, but doesn't
include the logic for improve freezing performance.
001_ patch gets rid of page-conversion code from pg_upgrade. (This
patch doesn't related to this feature essentially, but is required by
002_ patch.)
002_ patch adds upgrading mechanism from 9.6- to 9.6+ and its regression test.2. Improve freezing logic
003_ patch changes the VACUUM to optimize scans based on freeze map
(i.g., 000_ patch), and its regression test.
004_ patch enhances debug messages in src/backend/access/heap/visibilitymap.cPlease review them.
I have pushed 000 and part of 003, with substantial revisions to the
003 part and minor revisions to the 000 part. This gets the basic
infrastructure in place, but the vacuum optimization and pg_upgrade
fixes still need to be done.I discovered that make check-world failed with 000 applied, because
the Assert()s added to visibilitymap_set were using | rather than & to
test for a set bit. I fixed that.
It looks reasonable as far as I can see. Thank you for your
labor for the additional part.
I revised the code in vacuumlazy.c that updates the new map bits
rather heavily. I hope I didn't break anything; please have a look
and see if you spot any problems. One big problem was that it's
inadequate to judge whether a tuple needs freezing just by looking at
xmin; xmax might need to be cleared, for example.
The new function heap_tuple_needs_eventual_freeze looks
reasonable for me in comparizon with heap_tuple_needs_freeze.
Looking the additional diff for lazy_vacuum_page, I noticed that
visibilitymap_set have a potential performance problem. (Though
it doesn't seem to occur for now.)
visibilitymap_set decides to modify vm bits by the following
code.
| if (flags = (map[mapByte] >> mapBit & VISIBILITYMAP_VALID_BITS))
| {
| START_CRIT_SECTION();
|
| map[mapByte] |= (flags << mapBit);
This is effectively right and no problem but it runs the critical
section for the case of (vmbit = 11, flags = 01), which does not
need to do so. Please apply this if this looks reasonable.
======
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 2e64fc3..87b7fc6 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -292,7 +292,8 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
map = (uint8 *)PageGetContents(page);
LockBuffer(vmBuf, BUFFER_LOCK_EXCLUSIVE);
- if (flags != (map[mapByte] >> mapBit & VISIBILITYMAP_VALID_BITS))
+ /* modify vm bits only if any bit is necessary to be set */
+ if (~flags & (map[mapByte] >> mapBit & VISIBILITYMAP_VALID_BITS))
{
START_CRIT_SECTION();
======
I removed the pgstat stuff. I'm not sure we want that stuff in that
form; it doesn't seem to fit with the rest of what's in that view, and
it wasn't reliable in my testing. I did however throw together a
little contrib module for testing, which I attach here. I'm not sure
we want to commit this, and at the least someone would need to write
documentation. But it's certainly handy for checking whether this
works.
I hanven't considered about the reliability but the
n_frozen_pages in the proposed patch surelly seems alien to the
columns around it.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Mar 1, 2016 at 6:51 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I removed the pgstat stuff. I'm not sure we want that stuff in that
form; it doesn't seem to fit with the rest of what's in that view, and
it wasn't reliable in my testing. I did however throw together a
little contrib module for testing, which I attach here. I'm not sure
we want to commit this, and at the least someone would need to write
documentation. But it's certainly handy for checking whether this
works.
I think you should commit this. The chances of anyone other than you
and Masahiko recalling that you developed this tool in 3 years is
essentially nil. I think that the cost of committing a developer-level
debugging tool like this is very low. Modules like pg_freespacemap
currently already have no chance of being of use to ordinary users.
All you need to do is restrict the functions to throw an error when
called by non-superusers, out of caution.
It's a problem that modules like pg_stat_statements and
pg_freespacemap are currently lumped together in the documentation,
but we all know that.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 3/2/16 4:21 PM, Peter Geoghegan wrote:
I think you should commit this. The chances of anyone other than you
and Masahiko recalling that you developed this tool in 3 years is
essentially nil. I think that the cost of committing a developer-level
debugging tool like this is very low. Modules like pg_freespacemap
currently already have no chance of being of use to ordinary users.
All you need to do is restrict the functions to throw an error when
called by non-superusers, out of caution.It's a problem that modules like pg_stat_statements and
pg_freespacemap are currently lumped together in the documentation,
but we all know that.
+1.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Jim Nasby <Jim.Nasby@BlueTreble.com> writes:
On 3/2/16 4:21 PM, Peter Geoghegan wrote:
I think you should commit this. The chances of anyone other than you
and Masahiko recalling that you developed this tool in 3 years is
essentially nil. I think that the cost of committing a developer-level
debugging tool like this is very low. Modules like pg_freespacemap
currently already have no chance of being of use to ordinary users.
All you need to do is restrict the functions to throw an error when
called by non-superusers, out of caution.It's a problem that modules like pg_stat_statements and
pg_freespacemap are currently lumped together in the documentation,
but we all know that.
+1.
Would it make any sense to stick it under src/test/modules/ instead of
contrib/ ? That would help make it clear that it's a debugging tool
and not something we expect end users to use.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 3/2/16 5:41 PM, Tom Lane wrote:
Jim Nasby <Jim.Nasby@BlueTreble.com> writes:
On 3/2/16 4:21 PM, Peter Geoghegan wrote:
I think you should commit this. The chances of anyone other than you
and Masahiko recalling that you developed this tool in 3 years is
essentially nil. I think that the cost of committing a developer-level
debugging tool like this is very low. Modules like pg_freespacemap
currently already have no chance of being of use to ordinary users.
All you need to do is restrict the functions to throw an error when
called by non-superusers, out of caution.It's a problem that modules like pg_stat_statements and
pg_freespacemap are currently lumped together in the documentation,
but we all know that.+1.
Would it make any sense to stick it under src/test/modules/ instead of
contrib/ ? That would help make it clear that it's a debugging tool
and not something we expect end users to use.
I haven't looked at it in detail; is there something inherently
dangerous about it?
When I'm forced to wear a DBA hat, I'd really love to be able to find
out what VM status for a large table is. If it's in contrib they'll know
the tool is there; if it's under src then there's about 0 chance of
that. I'd think SU-only and any appropriate warnings would be enough
heads-up for DBAs to be careful with it.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
At Wed, 2 Mar 2016 17:57:27 -0600, Jim Nasby <Jim.Nasby@BlueTreble.com> wrote in <56D77DE7.7080309@BlueTreble.com>
On 3/2/16 5:41 PM, Tom Lane wrote:
Jim Nasby <Jim.Nasby@BlueTreble.com> writes:
On 3/2/16 4:21 PM, Peter Geoghegan wrote:
I think you should commit this. The chances of anyone other than you
and Masahiko recalling that you developed this tool in 3 years is
essentially nil. I think that the cost of committing a developer-level
debugging tool like this is very low. Modules like pg_freespacemap
currently already have no chance of being of use to ordinary users.
All you need to do is restrict the functions to throw an error when
called by non-superusers, out of caution.It's a problem that modules like pg_stat_statements and
pg_freespacemap are currently lumped together in the documentation,
but we all know that.+1.
Would it make any sense to stick it under src/test/modules/ instead of
contrib/ ? That would help make it clear that it's a debugging tool
and not something we expect end users to use.I haven't looked at it in detail; is there something inherently
dangerous about it?
I don't see any danger but the interface doesn't seem to fit use
by DBAs.
When I'm forced to wear a DBA hat, I'd really love to be able to find
out what VM status for a large table is. If it's in contrib they'll
know the tool is there; if it's under src then there's about 0 chance
of that. I'd think SU-only and any appropriate warnings would be
enough heads-up for DBAs to be careful with it.
It looks to expose nothing about table contents. At lesast
anybody who can see the name of a table are safely allowable to
use this on it.
A possible usage (for me) of this would be directly couting
(un)vacuumed or (un)freezed pages in a relation. It would be
convenient that the 'freezed' and 'vacuumed' bits are in separate
integers. It would be usable even stats values for these bits are
shown in statistics views.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Mar 2, 2016 at 6:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jim Nasby <Jim.Nasby@BlueTreble.com> writes:
On 3/2/16 4:21 PM, Peter Geoghegan wrote:
I think you should commit this. The chances of anyone other than you
and Masahiko recalling that you developed this tool in 3 years is
essentially nil. I think that the cost of committing a developer-level
debugging tool like this is very low. Modules like pg_freespacemap
currently already have no chance of being of use to ordinary users.
All you need to do is restrict the functions to throw an error when
called by non-superusers, out of caution.It's a problem that modules like pg_stat_statements and
pg_freespacemap are currently lumped together in the documentation,
but we all know that.+1.
Would it make any sense to stick it under src/test/modules/ instead of
contrib/ ? That would help make it clear that it's a debugging tool
and not something we expect end users to use.
I actually think end-users might well want to use it. Also, I created
it by hacking up pg_freespacemap, so it may make sense to have it in
the same place.
I would also be tempted to add an additional C functions that scan the
entire visibility map and return counts of the total number of bits of
each type that are set, and similarly for the page level bits.
Presumably that would be much faster than
I am also tempted to change the API to be a bit more friendly,
although I am not sure exactly how. This was a quick and dirty hack
so that I could test, but the hardest thing about making it not a
quick and dirty hack is probably deciding on a good UI.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Mar 5, 2016 at 1:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Mar 2, 2016 at 6:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jim Nasby <Jim.Nasby@BlueTreble.com> writes:
On 3/2/16 4:21 PM, Peter Geoghegan wrote:
I think you should commit this. The chances of anyone other than you
and Masahiko recalling that you developed this tool in 3 years is
essentially nil. I think that the cost of committing a developer-level
debugging tool like this is very low. Modules like pg_freespacemap
currently already have no chance of being of use to ordinary users.
All you need to do is restrict the functions to throw an error when
called by non-superusers, out of caution.It's a problem that modules like pg_stat_statements and
pg_freespacemap are currently lumped together in the documentation,
but we all know that.+1.
Would it make any sense to stick it under src/test/modules/ instead of
contrib/ ? That would help make it clear that it's a debugging tool
and not something we expect end users to use.I actually think end-users might well want to use it. Also, I created
it by hacking up pg_freespacemap, so it may make sense to have it in
the same place.
I would also be tempted to add an additional C functions that scan the
entire visibility map and return counts of the total number of bits of
each type that are set, and similarly for the page level bits.
Presumably that would be much faster than
+1.
I am also tempted to change the API to be a bit more friendly,
although I am not sure exactly how. This was a quick and dirty hack
so that I could test, but the hardest thing about making it not a
quick and dirty hack is probably deciding on a good UI.
Does it mean visibility map API in visibilitymap.c?
Regards,
--
Masahiko Sawada
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Mar 5, 2016 at 11:25 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Sat, Mar 5, 2016 at 1:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Mar 2, 2016 at 6:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jim Nasby <Jim.Nasby@BlueTreble.com> writes:
On 3/2/16 4:21 PM, Peter Geoghegan wrote:
I think you should commit this. The chances of anyone other than you
and Masahiko recalling that you developed this tool in 3 years is
essentially nil. I think that the cost of committing a developer-level
debugging tool like this is very low. Modules like pg_freespacemap
currently already have no chance of being of use to ordinary users.
All you need to do is restrict the functions to throw an error when
called by non-superusers, out of caution.It's a problem that modules like pg_stat_statements and
pg_freespacemap are currently lumped together in the documentation,
but we all know that.+1.
Would it make any sense to stick it under src/test/modules/ instead of
contrib/ ? That would help make it clear that it's a debugging tool
and not something we expect end users to use.I actually think end-users might well want to use it. Also, I created
it by hacking up pg_freespacemap, so it may make sense to have it in
the same place.
I would also be tempted to add an additional C functions that scan the
entire visibility map and return counts of the total number of bits of
each type that are set, and similarly for the page level bits.
Presumably that would be much faster than+1.
I am also tempted to change the API to be a bit more friendly,
although I am not sure exactly how. This was a quick and dirty hack
so that I could test, but the hardest thing about making it not a
quick and dirty hack is probably deciding on a good UI.Does it mean visibility map API in visibilitymap.c?
Attached latest version optimisation patch.
I'm still consider regarding pg_upgrade regression test code, so I
will submit that patch later.
Regards,
--
Masahiko Sawada
Attachments:
000_optimize_vacuum_using_freezemap_v37.patchapplication/octet-stream; name=000_optimize_vacuum_using_freezemap_v37.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a09ceb2..8a258f0 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5984,7 +5984,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs an eager freezing if the table's
<structname>pg_class</>.<structfield>relfrozenxid</> field has reached
the age specified by this setting. The default is 150 million
transactions. Although users can set this value anywhere from zero to
@@ -6028,7 +6028,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs an eager freezing if the table's
<structname>pg_class</>.<structfield>relminmxid</> field has reached
the age specified by this setting. The default is 150 million multixacts.
Although users can set this value anywhere from zero to two billions,
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 5204b34..012e049 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -352,9 +352,9 @@
Vacuum maintains a <link linkend="storage-vm">visibility map</> for each
table to keep track of which pages contain only tuples that are known to be
visible to all active transactions (and all future transactions, until the
- page is again modified). This has two purposes. First, vacuum
- itself can skip such pages on the next run, since there is nothing to
- clean up.
+ page is again modified), and pages contain only frezen tuples.
+ This has two purposes. First, vacuum itself can skip such pages
+ on the next run, since there is nothing to clean up.
</para>
<para>
@@ -438,28 +438,25 @@
</para>
<para>
- <command>VACUUM</> normally skips pages that don't have any dead row
- versions, but those pages might still have row versions with old XID
- values. To ensure all old row versions have been frozen, a
- scan of the whole table is needed.
+ <command>VACUUM</> skips to scan pages that don't havee any dead row
+ versions, and pages that have only frozen rows. To ensure all old
+ row versions have been frozen, a scan of all unfrozen pages is needed.
<xref linkend="guc-vacuum-freeze-table-age"> controls when
- <command>VACUUM</> does that: a whole table sweep is forced if
- the table hasn't been fully scanned for <varname>vacuum_freeze_table_age</>
- minus <varname>vacuum_freeze_min_age</> transactions. Setting it to 0
- forces <command>VACUUM</> to always scan all pages, effectively ignoring
- the visibility map.
+ <command>VACUUM</> does that: a whole-table freezing is forced if
+ the table hasn't been ensured all row versions are frozen for
+ <varname>vacuum_freeze_table_age</> minus <vername>vacuum_freeze_min_age</>
+ transactions.
</para>
<para>
The maximum time that a table can go unvacuumed is two billion
transactions minus the <varname>vacuum_freeze_min_age</> value at
- the time <command>VACUUM</> last scanned the whole table. If it were to go
- unvacuumed for longer than
- that, data loss could result. To ensure that this does not happen,
- autovacuum is invoked on any table that might contain unfrozen rows with
- XIDs older than the age specified by the configuration parameter <xref
- linkend="guc-autovacuum-freeze-max-age">. (This will happen even if
- autovacuum is disabled.)
+ the time <command>VACUUM</> last scanned unfrozen pages. If it were to go
+ unvacuumed for longer than that, data loss could result. To ensure
+ that this does not happen, autovacuum is invoked on any table that might
+ contain unfrozen rows with XIDs older than the age specified by the
+ configuration parameter <xref linkend="guc-autovacuum-freeze-max-age">.
+ (This will happen even if autovacuum is disabled.)
</para>
<para>
@@ -490,8 +487,7 @@
a regularly scheduled <command>VACUUM</> or an autovacuum triggered by
normal delete and update activity is run in that window. Setting it too
close could lead to anti-wraparound autovacuums, even though the table
- was recently vacuumed to reclaim space, whereas lower values lead to more
- frequent whole-table scans.
+ was recently vacuumed to reclaim space.
</para>
<para>
@@ -527,7 +523,7 @@
<structname>pg_database</>. In particular,
the <structfield>relfrozenxid</> column of a table's
<structname>pg_class</> row contains the freeze cutoff XID that was used
- by the last whole-table <command>VACUUM</> for that table. All rows
+ by the last whole-table freezing for that table. All rows
inserted by transactions with XIDs older than this cutoff XID are
guaranteed to have been frozen. Similarly,
the <structfield>datfrozenxid</> column of a database's
@@ -555,17 +551,18 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
<command>VACUUM</> normally
only scans pages that have been modified since the last vacuum, but
<structfield>relfrozenxid</> can only be advanced when the whole table is
- scanned. The whole table is scanned when <structfield>relfrozenxid</> is
- more than <varname>vacuum_freeze_table_age</> transactions old, when
- <command>VACUUM</>'s <literal>FREEZE</> option is used, or when all pages
- happen to
- require vacuuming to remove dead row versions. When <command>VACUUM</>
- scans the whole table, after it's finished <literal>age(relfrozenxid)</>
- should be a little more than the <varname>vacuum_freeze_min_age</> setting
- that was used (more by the number of transactions started since the
- <command>VACUUM</> started). If no whole-table-scanning <command>VACUUM</>
- is issued on the table until <varname>autovacuum_freeze_max_age</> is
- reached, an autovacuum will soon be forced for the table.
+ frozen. Freezing occurs on the whole table once all pages of this relation
+ require it. In other cases such as where <structfield>relfrozenxid</> is more
+ than <varname>vacuum_freeze_table_age</> transactions old, when
+ <command>VACUUM</>'s <literal>FREEZE</> option is used, <command>VACUUM</>
+ can skip the pages that all tuples on the page itself are marked as frozen.
+ When all pages of table are eventually marked as frozen by <command>VACUUM</>,
+ after it's finished <literal>age(relfrozenxid)</> should be a little more
+ than the <varname>vacuum_freeze_min_age</> setting that was used (more by
+ the number of transactions started since the <command>VACUUM</> started).
+ If the advancing of <structfield>relfrozenxid</> is not happend until
+ <varname>autovacuum_freeze_max_age</> is reached, an autovacuum will soon
+ be forced for the table.
</para>
<para>
@@ -642,13 +639,13 @@ HINT: Stop the postmaster and vacuum that database in single-user mode.
<structname>pg_class</>.<structfield>relminmxid</> stores the oldest
possible multixact ID still appearing in any tuple of that table.
If this value is older than
- <xref linkend="guc-vacuum-multixact-freeze-table-age">, a whole-table
+ <xref linkend="guc-vacuum-multixact-freeze-table-age">, a table
scan is forced. <function>mxid_age()</> can be used on
<structname>pg_class</>.<structfield>relminmxid</> to find its age.
</para>
<para>
- Whole-table <command>VACUUM</> scans, regardless of
+ When <command>VACUUM</> scans all unfrozen pages, regardless of
what causes them, enable advancing the value for that table.
Eventually, as all tables in all databases are scanned and their
oldest multixact values are advanced, on-disk storage for older
@@ -656,13 +653,13 @@ HINT: Stop the postmaster and vacuum that database in single-user mode.
</para>
<para>
- As a safety device, a whole-table vacuum scan will occur for any table
- whose multixact-age is greater than
- <xref linkend="guc-autovacuum-multixact-freeze-max-age">. Whole-table
+ As a safety device, a vacuum scan will occur for any table whose
+ multixact-age is greater than
+ <xref linkend="guc-autovacuum-multixact-freeze-max-age">. A
vacuum scans will also occur progressively for all tables, starting with
those that have the oldest multixact-age, if the amount of used member
storage space exceeds the amount 50% of the addressable storage space.
- Both of these kinds of whole-table scans will occur even if autovacuum is
+ Both of these kinds of table scans will occur even if autovacuum is
nominally disabled.
</para>
</sect3>
@@ -743,8 +740,8 @@ vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuple
<command>UPDATE</command> and <command>DELETE</command> operation. (It
is only semi-accurate because some information might be lost under heavy
load.) If the <structfield>relfrozenxid</> value of the table is more
- than <varname>vacuum_freeze_table_age</> transactions old, the whole
- table is scanned to freeze old tuples and advance
+ than <varname>vacuum_freeze_table_age</> transactions old, the table is
+ scanned to freeze old tuples and advance
<structfield>relfrozenxid</>, otherwise only pages that have been modified
since the last vacuum are scanned.
</para>
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 8f7b248..67a7396 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -106,6 +106,7 @@ typedef struct LVRelStats
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* number of pages we examined */
BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber vmskipped_frozen_pages; /* # of pages we skipped by all-frozen bit */
double scanned_tuples; /* counts only tuples on scanned pages */
double old_rel_tuples; /* previous value of pg_class.reltuples */
double new_rel_tuples; /* new estimated total # of tuples */
@@ -221,7 +222,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* We request a full scan if either the table's frozen Xid is now older
* than or equal to the requested Xid full-table scan limit; or if the
* table's minimum MultiXactId is older than or equal to the requested
- * mxid full-table scan limit.
+ * mxid full-table scan limit. During full scan, we could skip to scan
+ * pages according to all-frozen bit of visibility map.
*/
scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
xidFullScanLimit);
@@ -253,7 +255,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
- if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages + vacrelstats->vmskipped_frozen_pages)
+ < vacrelstats->rel_pages)
{
Assert(!scan_all);
scanned_all = false;
@@ -274,9 +277,9 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* Update statistics in pg_class.
*
* A corner case here is that if we scanned no pages at all because every
- * page is all-visible, we should not update relpages/reltuples, because
- * we have no new information to contribute. In particular this keeps us
- * from replacing relpages=reltuples=0 (which means "unknown tuple
+ * page is all-visible or all-forzen, we should not update relpages/reltuples,
+ * because we have no new information to contribute. In particular this keeps
+ * us from replacing relpages=reltuples=0 (which means "unknown tuple
* density") with nonzero relpages and reltuples=0 (which means "zero
* tuple density") unless there's some actual evidence for the latter.
*
@@ -354,10 +357,11 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel),
vacrelstats->num_index_scans);
- appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins\n"),
+ appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped according to vm\n"),
vacrelstats->pages_removed,
vacrelstats->rel_pages,
- vacrelstats->pinskipped_pages);
+ vacrelstats->pinskipped_pages,
+ vacrelstats->vmskipped_frozen_pages);
appendStringInfo(&buf,
_("tuples: %.0f removed, %.0f remain, %.0f are dead but not yet removable\n"),
vacrelstats->tuples_deleted,
@@ -480,9 +484,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* consecutive pages. Since we're reading sequentially, the OS should be
* doing readahead for us, so there's no gain in skipping a page now and
* then; that's likely to disable readahead and so be counterproductive.
- * Also, skipping even a single page means that we can't update
- * relfrozenxid, so we only want to do it if we can skip a goodly number
- * of pages.
+ * Also, skipping even a single page according to all-visible bit of
+ * visibility map means that we might not be able to update relfrozenxid,
+ * so we on ly want to do if if we can skip a goodly number. On the other hand,
+ * we count both how many pages we skipped according to all-frozen bit and
+ * how many pages we froze, so we can update relfrozenxid if the sum of two
+ * is as many as the number of pages of table.
*
* Before entering the main loop, establish the invariant that
* next_not_all_visible_block is the next block number >= blkno that's not
@@ -492,18 +499,18 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* started skipping blocks, we may as well skip everything up to the next
* not-all-visible block.
*
- * Note: if scan_all is true, we won't actually skip any pages; but we
+ * Note: if scan_all is true, we might not actually skip any pages; but we
* maintain next_not_all_visible_block anyway, so as to set up the
* all_visible_according_to_vm flag correctly for each page.
*
* Note: The value returned by visibilitymap_get_status could be slightly
* out-of-date, since we make this test before reading the corresponding
* heap page or locking the buffer. This is OK. If we mistakenly think
- * that the page is all-visible when in fact the flag's just been cleared,
- * we might fail to vacuum the page. But it's OK to skip pages when
- * scan_all is not set, so no great harm done; the next vacuum will find
- * them. If we make the reverse mistake and vacuum a page unnecessarily,
- * it'll just be a no-op.
+ * that the page is all-visible/all-frozen when in fact the flag's just
+ * been cleared, we might fail to vacuum the page. But it's OK to skip
+ * pages when scan_all is not set, so no great harm done; the next vacuum
+ * will find them. If we make the reverse mistake and vacuum a page
+ * unnecessarily, it'll just be a no-op.
*
* We will scan the table's last page, at least to the extent of
* determining whether it has tuples or not, even if it should be skipped
@@ -536,9 +543,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
bool tupgone,
hastup;
int prev_dead_count;
- int nfrozen;
+ int nfrozen; /* # of tuple is frozen */
+ int nalready_frozen; /* # of tuples is already frozen */
+ int ntotal_frozen; /* # of frozen tuples in single page */
+ int ntup_per_page;
Size freespace;
bool all_visible_according_to_vm;
+ bool all_frozen_according_to_vm;
bool all_visible;
bool all_frozen = true; /* provided all_visible is also true */
bool has_dead_tuples;
@@ -570,13 +581,27 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
skipping_all_visible_blocks = false;
all_visible_according_to_vm = false;
+ all_frozen_according_to_vm = false;
}
else
{
- /* Current block is all-visible */
- if (skipping_all_visible_blocks && !scan_all && !FORCE_CHECK_PAGE())
+ /*
+ * This block is at least all-visible according to visibility map.
+ * Weh check whether this block is all-frozen or not, to skipt to
+ * scan this page even if scan_all is true.
+ */
+ bool all_frozen = VM_ALL_FROZEN(onerel, blkno, &vmbuffer);
+
+ if (scan_all && all_frozen && !FORCE_CHECK_PAGE())
+ {
+ vacrelstats->vmskipped_frozen_pages++;
continue;
+ }
+ else if (skipping_all_visible_blocks && !scan_all && !FORCE_CHECK_PAGE())
+ continue;
+
all_visible_according_to_vm = true;
+ all_frozen_according_to_vm = all_frozen;
}
vacuum_delay_point();
@@ -793,6 +818,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
all_visible = true;
has_dead_tuples = false;
nfrozen = 0;
+ nalready_frozen = 0;
+ ntup_per_page = 0;
hastup = false;
prev_dead_count = vacrelstats->num_dead_tuples;
maxoff = PageGetMaxOffsetNumber(page);
@@ -947,6 +974,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
{
num_tuples += 1;
+ ntup_per_page += 1;
hastup = true;
/*
@@ -997,6 +1025,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
END_CRIT_SECTION();
}
+ /* Compute total number of frozen tuples in single page */
+ ntotal_frozen = nfrozen + nalready_frozen;
+
/*
* If there are no indexes then we can vacuum the page right now
* instead of doing a second scan.
@@ -1019,33 +1050,45 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
freespace = PageGetHeapFreeSpace(page);
- /* mark page all-visible, if appropriate */
- if (all_visible && !all_visible_according_to_vm)
+ /* If this page is all visible, consider to set all-visible and all-frozen */
+ if (all_visible)
{
- uint8 flags = VISIBILITYMAP_ALL_VISIBLE;
+ uint8 flags = 0;
- /*
- * It should never be the case that the visibility map page is set
- * while the page-level bit is clear, but the reverse is allowed
- * (if checksums are not enabled). Regardless, set the both bits
- * so that we get back in sync.
- *
- * NB: If the heap page is all-visible but the VM bit is not set,
- * we don't need to dirty the heap page. However, if checksums
- * are enabled, we do need to make sure that the heap page is
- * dirtied before passing it to visibilitymap_set(), because it
- * may be logged. Given that this situation should only happen in
- * rare cases after a crash, it is not worth optimizing.
- */
- PageSetAllVisible(page);
- if (all_frozen)
+ /* mark page all-visible, if appropriate */
+ if (all_visible && !all_visible_according_to_vm)
+ {
+ /*
+ * It should never be the case that the visibility map page is set
+ * while the page-level bit is clear, but the reverse is allowed
+ * (if checksums are not enabled). Regardless, set the both bits
+ * so that we get back in sync.
+ *
+ * NB: If the heap page is all-visible but the VM bit is not set,
+ * we don't need to dirty the heap page. However, if checksums
+ * are enabled, we do need to make sure that the heap page is
+ * dirtied before passing it to visibilitymap_set(), because it
+ * may be logged. Given that this situation should only happen in
+ * rare cases after a crash, it is not worth optimizing.
+ */
+ PageSetAllVisible(page);
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+ }
+
+ /* Mark page as all-frozen, if all tuples are frozen and not marked yet */
+ if ((all_frozen || (ntotal_frozen = ntup_per_page)) &&
+ !all_frozen_according_to_vm)
{
PageSetAllFrozen(page);
flags |= VISIBILITYMAP_ALL_FROZEN;
}
- MarkBufferDirty(buf);
- visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, visibility_cutoff_xid, flags);
+
+ if (flags)
+ {
+ MarkBufferDirty(buf);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+ vmbuffer, visibility_cutoff_xid, flags);
+ }
}
/*
@@ -1058,7 +1101,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else if (all_visible_according_to_vm && !PageIsAllVisible(page)
&& VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
{
- elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
+ /* If the all-frozen bit is set then all-visible bit must be set */
+ if (all_frozen_according_to_vm)
+ Assert(VM_ALL_FROZEN(onerel, blkno, &vmbuffer) &&
+ VM_ALL_VISIBLE(onerel, blkno, &vmbuffer));
+
+ elog(WARNING, "page is not marked all-visible (and all-frozen) but visibility map bit is set in relation \"%s\" page %u",
relname, blkno);
visibilitymap_clear(onerel, blkno, vmbuffer);
}
@@ -1066,19 +1114,25 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
/*
* It's possible for the value returned by GetOldestXmin() to move
* backwards, so it's not wrong for us to see tuples that appear to
- * not be visible to everyone yet, while PD_ALL_VISIBLE is already
- * set. The real safe xmin value never moves backwards, but
+ * not be visible to everyone yet, while PD_ALL_VISIBLE (and PD_ALL_FROZEN)
+ * are already set. The real safe xmin value never moves backwards, but
* GetOldestXmin() is conservative and sometimes returns a value
* that's unnecessarily small, so if we see that contradiction it just
* means that the tuples that we think are not visible to everyone yet
- * actually are, and the PD_ALL_VISIBLE flag is correct.
+ * actually are, and the PD_ALL_VISIBLE (and PD_ALL_FROZEN) flag are
+ * correct.
*
* There should never be dead tuples on a page with PD_ALL_VISIBLE
* set, however.
*/
else if (PageIsAllVisible(page) && has_dead_tuples)
{
- elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
+ /* If the all-frozen bit is set then all-visible bit must be set */
+ if (all_frozen_according_to_vm)
+ Assert(VM_ALL_FROZEN(onerel, blkno, &vmbuffer) &&
+ VM_ALL_VISIBLE(onerel, blkno, &vmbuffer));
+
+ elog(WARNING, "page containing dead tuples is marked as all-visible (and all-frozen) in relation \"%s\" page %u",
relname, blkno);
PageClearAllVisible(page);
MarkBufferDirty(buf);
@@ -1174,6 +1228,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
RelationGetRelationName(onerel),
tups_vacuumed, vacuumed_pages)));
+ /* Report how many frozen pages vauum skipped according to visibility map */
+ ereport(elevel,
+ (errmsg_plural("skipped %d frozen page according to visibility map",
+ "skipped %d frozen pages according to visibility map",
+ vacrelstats->vmskipped_frozen_pages,
+ vacrelstats->vmskipped_frozen_pages)));
+
/*
* This is pretty messy, but we split it up so that we can skip emitting
* individual parts of the message when not applicable.
diff --git a/src/test/regress/expected/visibilitymap.out b/src/test/regress/expected/visibilitymap.out
new file mode 100644
index 0000000..767a0ec
--- /dev/null
+++ b/src/test/regress/expected/visibilitymap.out
@@ -0,0 +1,15 @@
+--
+-- Visibility Map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+VACUUM FREEZE vmtest;
+-- Check whether vacuum skips all-frozen pages
+\set VERBOSITY terse
+VACUUM FREEZE VERBOSE vmtest;
+INFO: vacuuming "public.vmtest"
+INFO: index "vmtest_pkey" now contains 10000 row versions in 30 pages
+INFO: skipped 44 frozen pages according to visibility map
+INFO: "vmtest": found 0 removable, 56 nonremovable row versions in 1 out of 45 pages
+\set VERBOSITY default
+DROP TABLE vmtest;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index bec0316..9ad2ffc 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -110,3 +110,6 @@ test: event_trigger
# run stats by itself because its delay may be insufficient under heavy load
test: stats
+
+# visiblity map and vacuum test cannot concurrently with any test that runs SQL
+test: visibilitymap
\ No newline at end of file
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 7e9b319..4b4eb07 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -162,3 +162,4 @@ test: with
test: xml
test: event_trigger
test: stats
+test: visibilitymap
\ No newline at end of file
diff --git a/src/test/regress/sql/visibilitymap.sql b/src/test/regress/sql/visibilitymap.sql
new file mode 100644
index 0000000..fb9c811
--- /dev/null
+++ b/src/test/regress/sql/visibilitymap.sql
@@ -0,0 +1,13 @@
+--
+-- Visibility Map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+VACUUM FREEZE vmtest;
+
+-- Check whether vacuum skips all-frozen pages
+\set VERBOSITY terse
+VACUUM FREEZE VERBOSE vmtest;
+\set VERBOSITY default
+
+DROP TABLE vmtest;
On Mon, Mar 7, 2016 at 12:41 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Attached latest version optimisation patch.
I'm still consider regarding pg_upgrade regression test code, so I
will submit that patch later.
I was thinking more about this today and I think that we don't
actually need the PD_ALL_FROZEN page-level bit for anything. It's
enough that the bit is present in the visibility map. The only point
of PD_ALL_VISIBLE is that it tells us that we need to clear the
visibility map bit, but that bit is enough to tell us to clear both
visibility map bits. So I propose the attached cleanup patch.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
no-pd-all-frozen.patchapplication/x-patch; name=no-pd-all-frozen.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 8a64321..34ba385 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7855,10 +7855,7 @@ heap_xlog_visible(XLogReaderState *record)
*/
page = BufferGetPage(buffer);
- if (xlrec->flags & VISIBILITYMAP_ALL_VISIBLE)
- PageSetAllVisible(page);
- if (xlrec->flags & VISIBILITYMAP_ALL_FROZEN)
- PageSetAllFrozen(page);
+ PageSetAllVisible(page);
MarkBufferDirty(buffer);
}
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 2e64fc3..eaab4be 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -39,15 +39,15 @@
*
* When we *set* a visibility map during VACUUM, we must write WAL. This may
* seem counterintuitive, since the bit is basically a hint: if it is clear,
- * it may still be the case that every tuple on the page is all-visible or
- * all-frozen we just don't know that for certain. The difficulty is that
- * there are two bits which are typically set together: the PD_ALL_VISIBLE
- * or PD_ALL_FROZEN bit on the page itself, and the corresponding visibility
- * map bit. If a crash occurs after the visibility map page makes it to disk
- * and before the updated heap page makes it to disk, redo must set the bit on
- * the heap page. Otherwise, the next insert, update, or delete on the heap
- * page will fail to realize that the visibility map bit must be cleared,
- * possibly causing index-only scans to return wrong answers.
+ * it may still be the case that every tuple on the page is visible to all
+ * transactions; we just don't know that for certain. The difficulty is that
+ * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
+ * on the page itself, and the visibility map bit. If a crash occurs after the
+ * visibility map page makes it to disk and before the updated heap page makes
+ * it to disk, redo must set the bit on the heap page. Otherwise, the next
+ * insert, update, or delete on the heap page will fail to realize that the
+ * visibility map bit must be cleared, possibly causing index-only scans to
+ * return wrong answers.
*
* VACUUM will normally skip pages for which the visibility map bit is set;
* such pages can't contain any dead tuples and therefore don't need vacuuming.
@@ -251,11 +251,10 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
* to InvalidTransactionId when a page that is already all-visible is being
* marked all-frozen.
*
- * Caller is expected to set the heap page's PD_ALL_VISIBLE or PD_ALL_FROZEN
- * bit before calling this function. Except in recovery, caller should also
- * pass the heap buffer and flags which indicates what flag we want to set.
- * When checksums are enabled and we're not in recovery, we must add the heap
- * buffer to the WAL chain to protect it from being torn.
+ * Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
+ * this function. Except in recovery, caller should also pass the heap
+ * buffer. When checksums are enabled and we're not in recovery, we must add
+ * the heap buffer to the WAL chain to protect it from being torn.
*
* You must pass a buffer containing the correct map page to this function.
* Call visibilitymap_pin first to pin the right one. This function doesn't do
@@ -315,10 +314,8 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
{
Page heapPage = BufferGetPage(heapBuf);
- /* Caller is expected to set page-level bits first. */
- Assert((flags & VISIBILITYMAP_ALL_VISIBLE) == 0 || PageIsAllVisible(heapPage));
- Assert((flags & VISIBILITYMAP_ALL_FROZEN) == 0 || PageIsAllFrozen(heapPage));
-
+ /* caller is expected to set PD_ALL_VISIBLE first */
+ Assert(PageIsAllVisible(heapPage));
PageSetLSN(heapPage, recptr);
}
}
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 8f7b248..363b2d0 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -766,7 +766,6 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
log_newpage_buffer(buf, true);
PageSetAllVisible(page);
- PageSetAllFrozen(page);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
vmbuffer, InvalidTransactionId,
VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
@@ -1024,6 +1023,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
{
uint8 flags = VISIBILITYMAP_ALL_VISIBLE;
+ if (all_frozen)
+ flags |= VISIBILITYMAP_ALL_FROZEN;
+
/*
* It should never be the case that the visibility map page is set
* while the page-level bit is clear, but the reverse is allowed
@@ -1038,11 +1040,6 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* rare cases after a crash, it is not worth optimizing.
*/
PageSetAllVisible(page);
- if (all_frozen)
- {
- PageSetAllFrozen(page);
- flags |= VISIBILITYMAP_ALL_FROZEN;
- }
MarkBufferDirty(buf);
visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
vmbuffer, visibility_cutoff_xid, flags);
@@ -1093,10 +1090,6 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else if (all_visible_according_to_vm && all_visible && all_frozen &&
!VM_ALL_FROZEN(onerel, blkno, &vmbuffer))
{
- /* Page is marked all-visible but should be all-frozen */
- PageSetAllFrozen(page);
- MarkBufferDirty(buf);
-
/*
* We can pass InvalidTransactionId as the cutoff XID here,
* because setting the all-frozen bit doesn't cause recovery
@@ -1344,11 +1337,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
*/
if (heap_page_is_all_visible(onerel, buffer, &visibility_cutoff_xid,
&all_frozen))
- {
PageSetAllVisible(page);
- if (all_frozen)
- PageSetAllFrozen(page);
- }
/*
* All the changes to the heap page have been done. If the all-visible
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 0b023b3..d930166 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -178,8 +178,6 @@ typedef PageHeaderData *PageHeader;
* tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */
-#define PD_ALL_FROZEN 0x0008 /* all tuples on page are completely
- frozen */
#define PD_VALID_FLAG_BITS 0x000F /* OR of all valid pd_flags bits */
@@ -369,12 +367,7 @@ typedef PageHeaderData *PageHeader;
#define PageSetAllVisible(page) \
(((PageHeader) (page))->pd_flags |= PD_ALL_VISIBLE)
#define PageClearAllVisible(page) \
- (((PageHeader) (page))->pd_flags &= ~(PD_ALL_VISIBLE | PD_ALL_FROZEN))
-
-#define PageIsAllFrozen(page) \
- (((PageHeader) (page))->pd_flags & PD_ALL_FROZEN)
-#define PageSetAllFrozen(page) \
- (((PageHeader) (page))->pd_flags |= PD_ALL_FROZEN)
+ (((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE)
#define PageIsPrunable(page, oldestxmin) \
( \
On Sat, Mar 5, 2016 at 9:25 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I actually think end-users might well want to use it. Also, I created
it by hacking up pg_freespacemap, so it may make sense to have it in
the same place.
I would also be tempted to add an additional C functions that scan the
entire visibility map and return counts of the total number of bits of
each type that are set, and similarly for the page level bits.
Presumably that would be much faster than+1.
I am also tempted to change the API to be a bit more friendly,
although I am not sure exactly how. This was a quick and dirty hack
so that I could test, but the hardest thing about making it not a
quick and dirty hack is probably deciding on a good UI.Does it mean visibility map API in visibilitymap.c?
Here's an updated patch with an API that I think is much more
reasonable to expose to users, and documentation! It assumes that the
patch I posted a few hours ago to remove PD_ALL_FROZEN will be
accepted; if that falls apart for some reason, I'll update this. I
plan to push this RSN if nobody objects.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
pg_visibility-v2.patchapplication/x-patch; name=pg_visibility-v2.patchDownload
diff --git a/contrib/Makefile b/contrib/Makefile
index bd251f6..d12dd63 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -37,6 +37,7 @@ SUBDIRS = \
pgcrypto \
pgrowlocks \
pgstattuple \
+ pg_visibility \
postgres_fdw \
seg \
spi \
diff --git a/contrib/pg_visibility/Makefile b/contrib/pg_visibility/Makefile
new file mode 100644
index 0000000..fbbaa2e
--- /dev/null
+++ b/contrib/pg_visibility/Makefile
@@ -0,0 +1,19 @@
+# contrib/pg_visibility/Makefile
+
+MODULE_big = pg_visibility
+OBJS = pg_visibility.o $(WIN32RES)
+
+EXTENSION = pg_visibility
+DATA = pg_visibility--1.0.sql
+PGFILEDESC = "pg_visibility - page visibility information"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_visibility
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_visibility/pg_visibility--1.0.sql b/contrib/pg_visibility/pg_visibility--1.0.sql
new file mode 100644
index 0000000..9616e1f
--- /dev/null
+++ b/contrib/pg_visibility/pg_visibility--1.0.sql
@@ -0,0 +1,52 @@
+/* contrib/pg_visibility/pg_visibility--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_visibility" to load this file. \quit
+
+-- Show visibility map information.
+CREATE FUNCTION pg_visibility_map(regclass, blkno bigint,
+ all_visible OUT boolean,
+ all_frozen OUT boolean)
+RETURNS record
+AS 'MODULE_PATHNAME', 'pg_visibility_map'
+LANGUAGE C STRICT;
+
+-- Show visibility map and page-level visibility information.
+CREATE FUNCTION pg_visibility(regclass, blkno, bigint,
+ all_visible OUT boolean,
+ all_frozen OUT boolean,
+ pd_all_visible OUT boolean)
+RETURNS record
+AS 'MODULE_PATHNAME', 'pg_visibility'
+LANGUAGE C STRICT;
+
+-- Show visibility map information for each block in a relation.
+CREATE FUNCTION pg_visibility_map(regclass, blkno OUT bigint,
+ all_visible OUT boolean,
+ all_frozen OUT boolean)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'pg_visibility_map_rel'
+LANGUAGE C STRICT;
+
+-- Show visibility map and page-level visibility information for each block.
+CREATE FUNCTION pg_visibility(regclass, blkno OUT bigint,
+ all_visible OUT boolean,
+ all_frozen OUT boolean,
+ pd_all_visible OUT boolean)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'pg_visibility_rel'
+LANGUAGE C STRICT;
+
+-- Show summary of visibility map bits for a relation.
+CREATE FUNCTION pg_visibility_map_summary(regclass,
+ OUT all_visible bigint, OUT all_frozen bigint)
+RETURNS record
+AS 'MODULE_PATHNAME', 'pg_visibility_map_summary'
+LANGUAGE C STRICT;
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_visibility_map(regclass, bigint) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_visibility(regclass, bigint) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_visibility_map(regclass) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_visibility(regclass) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_visibility_map_summary(regclass) FROM PUBLIC;
diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
new file mode 100644
index 0000000..d4336ce
--- /dev/null
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -0,0 +1,346 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_visibility.c
+ * display visibility map information and page-level visibility bits
+ *
+ * contrib/pg_visibility/pg_visibility.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/visibilitymap.h"
+#include "catalog/pg_type.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+
+PG_MODULE_MAGIC;
+
+typedef struct vbits
+{
+ BlockNumber next;
+ BlockNumber count;
+ uint8 bits[FLEXIBLE_ARRAY_MEMBER];
+} vbits;
+
+PG_FUNCTION_INFO_V1(pg_visibility_map);
+PG_FUNCTION_INFO_V1(pg_visibility_map_rel);
+PG_FUNCTION_INFO_V1(pg_visibility);
+PG_FUNCTION_INFO_V1(pg_visibility_rel);
+PG_FUNCTION_INFO_V1(pg_visibility_map_summary);
+
+static TupleDesc pg_visibility_tupdesc(bool include_blkno, bool include_pd);
+static vbits *collect_visibility_data(Oid relid, bool include_pd);
+
+/*
+ * Visibility map information for a single block of a relation.
+ */
+Datum
+pg_visibility_map(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int64 blkno = PG_GETARG_INT64(1);
+ int32 mapbits;
+ Relation rel;
+ Buffer vmbuffer = InvalidBuffer;
+ TupleDesc tupdesc;
+ Datum values[2];
+ bool nulls[2];
+
+ rel = relation_open(relid, AccessShareLock);
+
+ if (blkno < 0 || blkno > MaxBlockNumber)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid block number")));
+
+ tupdesc = pg_visibility_tupdesc(false, false);
+ MemSet(nulls, 0, sizeof(nulls));
+
+ mapbits = (int32) visibilitymap_get_status(rel, blkno, &vmbuffer);
+ if (vmbuffer != InvalidBuffer)
+ ReleaseBuffer(vmbuffer);
+ values[0] = BoolGetDatum((mapbits & VISIBILITYMAP_ALL_VISIBLE) != 0);
+ values[1] = BoolGetDatum((mapbits & VISIBILITYMAP_ALL_FROZEN) != 0);
+
+ relation_close(rel, AccessShareLock);
+
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/*
+ * Visibility map information for a single block of a relation, plus the
+ * page-level information for the same block.
+ */
+Datum
+pg_visibility(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ int64 blkno = PG_GETARG_INT64(1);
+ int32 mapbits;
+ Relation rel;
+ Buffer vmbuffer = InvalidBuffer;
+ Buffer buffer;
+ Page page;
+ TupleDesc tupdesc;
+ Datum values[3];
+ bool nulls[3];
+
+ rel = relation_open(relid, AccessShareLock);
+
+ if (blkno < 0 || blkno > MaxBlockNumber)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid block number")));
+
+ tupdesc = pg_visibility_tupdesc(false, true);
+ MemSet(nulls, 0, sizeof(nulls));
+
+ mapbits = (int32) visibilitymap_get_status(rel, blkno, &vmbuffer);
+ if (vmbuffer != InvalidBuffer)
+ ReleaseBuffer(vmbuffer);
+ values[0] = BoolGetDatum((mapbits & VISIBILITYMAP_ALL_VISIBLE) != 0);
+ values[1] = BoolGetDatum((mapbits & VISIBILITYMAP_ALL_FROZEN) != 0);
+
+ buffer = ReadBuffer(rel, blkno);
+ LockBuffer(buffer, BUFFER_LOCK_SHARE);
+
+ page = BufferGetPage(buffer);
+ values[2] = BoolGetDatum(PageIsAllVisible(page));
+
+ UnlockReleaseBuffer(buffer);
+
+ relation_close(rel, AccessShareLock);
+
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/*
+ * Visibility map information for every block in a relation.
+ */
+Datum
+pg_visibility_map_rel(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ vbits *info;
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ Oid relid = PG_GETARG_OID(0);
+ MemoryContext oldcontext;
+
+ funcctx = SRF_FIRSTCALL_INIT();
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+ funcctx->tuple_desc = pg_visibility_tupdesc(true, false);
+ funcctx->user_fctx = collect_visibility_data(relid, false);
+ MemoryContextSwitchTo(oldcontext);
+ }
+
+ funcctx = SRF_PERCALL_SETUP();
+ info = (vbits *) funcctx->user_fctx;
+
+ if (info->next < info->count)
+ {
+ Datum values[3];
+ bool nulls[3];
+ HeapTuple tuple;
+
+ MemSet(nulls, 0, sizeof(nulls));
+ values[0] = Int64GetDatum(info->next++);
+ values[1] = BoolGetDatum((info->bits[info->next] & (1 << 0)) != 0);
+ values[2] = BoolGetDatum((info->bits[info->next] & (1 << 1)) != 0);
+
+ tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+ SRF_RETURN_NEXT(funcctx, HeapTupleGetDatum(tuple));
+ }
+
+ SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * Visibility map information for every block in a relation, plus the page
+ * level information for each block.
+ */
+Datum
+pg_visibility_rel(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ vbits *info;
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ Oid relid = PG_GETARG_OID(0);
+ MemoryContext oldcontext;
+
+ funcctx = SRF_FIRSTCALL_INIT();
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+ funcctx->tuple_desc = pg_visibility_tupdesc(true, true);
+ funcctx->user_fctx = collect_visibility_data(relid, true);
+ MemoryContextSwitchTo(oldcontext);
+ }
+
+ funcctx = SRF_PERCALL_SETUP();
+ info = (vbits *) funcctx->user_fctx;
+
+ if (info->next < info->count)
+ {
+ Datum values[4];
+ bool nulls[4];
+ HeapTuple tuple;
+
+ MemSet(nulls, 0, sizeof(nulls));
+ values[0] = Int64GetDatum(info->next++);
+ values[1] = BoolGetDatum((info->bits[info->next] & (1 << 0)) != 0);
+ values[2] = BoolGetDatum((info->bits[info->next] & (1 << 1)) != 0);
+ values[3] = BoolGetDatum((info->bits[info->next] & (1 << 2)) != 0);
+
+ tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+ SRF_RETURN_NEXT(funcctx, HeapTupleGetDatum(tuple));
+ }
+
+ SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * Count the number of all-visible and all-frozen pages in the visibility
+ * map for a particular relation.
+ */
+Datum
+pg_visibility_map_summary(PG_FUNCTION_ARGS)
+{
+ Oid relid = PG_GETARG_OID(0);
+ Relation rel;
+ BlockNumber nblocks;
+ BlockNumber blkno;
+ Buffer vmbuffer = InvalidBuffer;
+ int64 all_visible = 0;
+ int64 all_frozen = 0;
+ TupleDesc tupdesc;
+ Datum values[2];
+ bool nulls[2];
+
+ rel = relation_open(relid, AccessShareLock);
+ nblocks = RelationGetNumberOfBlocks(rel);
+
+ for (blkno = 0; blkno < nblocks; ++blkno)
+ {
+ int32 mapbits;
+
+ /* Make sure we are interruptible. */
+ CHECK_FOR_INTERRUPTS();
+
+ /* Get map info. */
+ mapbits = (int32) visibilitymap_get_status(rel, blkno, &vmbuffer);
+ if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) != 0)
+ ++all_visible;
+ if ((mapbits & VISIBILITYMAP_ALL_FROZEN) != 0)
+ ++all_frozen;
+ }
+
+ /* Clean up. */
+ if (vmbuffer != InvalidBuffer)
+ ReleaseBuffer(vmbuffer);
+ relation_close(rel, AccessShareLock);
+
+ tupdesc = CreateTemplateTupleDesc(2, false);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 1, "all_visible", INT8OID, -1, 0);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 2, "all_frozen", INT8OID, -1, 0);
+ tupdesc = BlessTupleDesc(tupdesc);
+
+ MemSet(nulls, 0, sizeof(nulls));
+ values[0] = Int64GetDatum(all_visible);
+ values[1] = Int64GetDatum(all_frozen);
+
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/*
+ * Helper function to construct whichever TupleDesc we need for a particular
+ * call.
+ */
+static TupleDesc
+pg_visibility_tupdesc(bool include_blkno, bool include_pd)
+{
+ TupleDesc tupdesc;
+ AttrNumber maxattr = 2;
+ AttrNumber a = 0;
+
+ if (include_blkno)
+ ++maxattr;
+ if (include_pd)
+ ++maxattr;
+ tupdesc = CreateTemplateTupleDesc(maxattr, false);
+ if (include_blkno)
+ TupleDescInitEntry(tupdesc, ++a, "blkno", INT8OID, -1, 0);
+ TupleDescInitEntry(tupdesc, ++a, "all_visible", BOOLOID, -1, 0);
+ TupleDescInitEntry(tupdesc, ++a, "all_frozen", BOOLOID, -1, 0);
+ if (include_pd)
+ TupleDescInitEntry(tupdesc, ++a, "pd_all_visible", BOOLOID, -1, 0);
+ Assert(a == maxattr);
+
+ return BlessTupleDesc(tupdesc);
+}
+
+/*
+ * Collect visibility data about a relation.
+ */
+static vbits *
+collect_visibility_data(Oid relid, bool include_pd)
+{
+ Relation rel;
+ BlockNumber nblocks;
+ vbits *info;
+ BlockNumber blkno;
+ Buffer vmbuffer = InvalidBuffer;
+ BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+ rel = relation_open(relid, AccessShareLock);
+ nblocks = RelationGetNumberOfBlocks(rel);
+ info = palloc0(offsetof(vbits, bits) + nblocks);
+ info->next = 0;
+ info->count = nblocks;
+
+ for (blkno = 0; blkno < nblocks; ++blkno)
+ {
+ int32 mapbits;
+
+ /* Make sure we are interruptible. */
+ CHECK_FOR_INTERRUPTS();
+
+ /* Get map info. */
+ mapbits = (int32) visibilitymap_get_status(rel, blkno, &vmbuffer);
+ if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) != 0)
+ info->bits[blkno] |= (1 << 0);
+ if ((mapbits & VISIBILITYMAP_ALL_FROZEN) != 0)
+ info->bits[blkno] |= (1 << 1);
+
+ /*
+ * Page-level data requires reading every block, so only get it if
+ * the caller needs it. Use a buffer access strategy, too, to prevent
+ * cache-trashing.
+ */
+ if (include_pd)
+ {
+ Buffer buffer;
+ Page page;
+
+ buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+ bstrategy);
+ LockBuffer(buffer, BUFFER_LOCK_SHARE);
+
+ page = BufferGetPage(buffer);
+ if (PageIsAllVisible(page))
+ info->bits[blkno] |= (1 << 2);
+
+ UnlockReleaseBuffer(buffer);
+ }
+ }
+
+ /* Clean up. */
+ if (vmbuffer != InvalidBuffer)
+ ReleaseBuffer(vmbuffer);
+ relation_close(rel, AccessShareLock);
+
+ return info;
+}
diff --git a/contrib/pg_visibility/pg_visibility.control b/contrib/pg_visibility/pg_visibility.control
new file mode 100644
index 0000000..1d71853
--- /dev/null
+++ b/contrib/pg_visibility/pg_visibility.control
@@ -0,0 +1,5 @@
+# pg_visibility extension
+comment = 'examine the visibility map (VM) and page-level visibility info'
+default_version = '1.0'
+module_pathname = '$libdir/pg_visibility'
+relocatable = true
diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml
index 1b3d2d9..4e3f337 100644
--- a/doc/src/sgml/contrib.sgml
+++ b/doc/src/sgml/contrib.sgml
@@ -132,6 +132,7 @@ CREATE EXTENSION <replaceable>module_name</> FROM unpackaged;
&pgstatstatements;
&pgstattuple;
&pgtrgm;
+ &pgvisibility;
&postgres-fdw;
&seg;
&sepgsql;
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index a12fee7..30adece 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -136,6 +136,7 @@
<!ENTITY pgstatstatements SYSTEM "pgstatstatements.sgml">
<!ENTITY pgstattuple SYSTEM "pgstattuple.sgml">
<!ENTITY pgtrgm SYSTEM "pgtrgm.sgml">
+<!ENTITY pgvisibility SYSTEM "pgvisibility.sgml">
<!ENTITY postgres-fdw SYSTEM "postgres-fdw.sgml">
<!ENTITY seg SYSTEM "seg.sgml">
<!ENTITY contrib-spi SYSTEM "contrib-spi.sgml">
diff --git a/doc/src/sgml/pgvisibility.sgml b/doc/src/sgml/pgvisibility.sgml
new file mode 100644
index 0000000..8795dcd
--- /dev/null
+++ b/doc/src/sgml/pgvisibility.sgml
@@ -0,0 +1,110 @@
+<!-- doc/src/sgml/pgvisibility.sgml -->
+
+<sect1 id="pgvisibility" xreflabel="pg_visibility">
+ <title>pg_freespacemap</title>
+
+ <indexterm zone="pgvisibility">
+ <primary>pg_visibility</primary>
+ </indexterm>
+
+ <para>
+ The <filename>pg_visibility</> module provides a means for examining the
+ visibility map (VM) and page-level visibility information.
+ </para>
+
+ <para>
+ These routines return information about three different bits. The
+ all-visible bit in the visibility map indicates that every tuple on
+ a given page of a relation is visible to every current transaction. The
+ all-frozen bit in the visibility map indicates that every tuple on the
+ page is frozen; that is, no future vacuum will need to modify the page
+ until such time as a tuple is inserted, updated, deleted, or locked on
+ that page. The page-level <literal>PD_ALL_VISIBLE</literal> bit has the
+ same meaning as the all-visible bit in the visibility map, but is stored
+ within the data page itself rather than a separate data tructure. These
+ will normally agree, but the page-level bit can sometimes be set while the
+ visibility map bit is clear after a crash recovery; or they can disagree
+ because of a change which occurs after <literal>pg_visibility</> examines
+ the visibility map and before it examines the data page.
+ </para>
+
+ <para>
+ Functions which display information about <literal>PG_ALL_VISIBLE</>
+ are much more costly than those which only consult the visibility map,
+ because they must read the relation's data blocks rather than only the
+ (much smaller) visibility map.
+ </para>
+
+ <sect2>
+ <title>Functions</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><function>pg_visibility_map(regclass, blkno bigint, all_visible OUT boolean, all_frozen OUT boolean) returns record</function></term>
+ <listitem>
+ <para>
+ Returns the all-visible and all-frozen bits in the visibility map for
+ the given block of the given relation.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><function>pg_visibility(regclass, blkno bigint, all_visible OUT boolean, all_frozen OUT boolean, pd_all_visible OUT boolean) returns record</function></term>
+ <listitem>
+ <para>
+ Returns the all-visible and all-frozen bits in the visibility map for
+ the given block of the given relation, plus the
+ <literal>PD_ALL_VISIBILE</> bit for that block.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><function>pg_visibility_map(regclass, blkno OUT bigint, all_visible OUT boolean, all_frozen OUT boolean) returns record</function></term>
+ <listitem>
+ <para>
+ Returns the all-visible and all-frozen bits in the visibility map for
+ each block the given relation.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><function>pg_visibility(regclass, blkno OUT bigint, all_visible OUT boolean, all_frozen OUT boolean, pd_all_visible OUT boolean) returns record</function></term>
+
+ <listitem>
+ <para>
+ Returns the all-visible and all-frozen bits in the visibility map for
+ each block the given relation, plus the <literal>PD_ALL_VISIBLE</>
+ bit for each block.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><function>pg_visibility_map_summary(regclass, all_visible OUT bigint, all_frozen OUT bigint) returns record</function></term>
+
+ <listitem>
+ <para>
+ Returns the number of all-visible pages and the number of all-frozen
+ pages in the relation according to the visibility map.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+
+ <para>
+ By default, these functions are not publicly executable.
+ </para>
+ </sect2>
+
+ <sect2>
+ <title>Author</title>
+
+ <para>
+ Robert Haas <email>rhaas@postgresql.org</email>
+ </para>
+ </sect2>
+
+</sect1>
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index e2be43e..9b2e09e 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -648,6 +648,11 @@ might not be true. Visibility map bits are only set by vacuum, but are
cleared by any data-modifying operations on a page.
</para>
+<para>
+The <xref linkend="pgvisibility"> module can be used to examine the
+information stored in the visibility map.
+</para>
+
</sect1>
<sect1 id="storage-init">
On Mon, Mar 7, 2016 at 4:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Here's an updated patch with an API that I think is much more
reasonable to expose to users, and documentation! It assumes that the
patch I posted a few hours ago to remove PD_ALL_FROZEN will be
accepted; if that falls apart for some reason, I'll update this. I
plan to push this RSN if nobody objects.
Thanks for making the effort to make the tool generally available.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hello, thank you for updating this tool.
At Mon, 7 Mar 2016 14:03:08 -0500, Robert Haas <robertmhaas@gmail.com> wrote in <CA+Tgmob+NjfYE3b3BHBmAC=3tvTbqsZgZ1RoJ63yRAmRgrQOcA@mail.gmail.com>
On Mon, Mar 7, 2016 at 12:41 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Attached latest version optimisation patch.
I'm still consider regarding pg_upgrade regression test code, so I
will submit that patch later.I was thinking more about this today and I think that we don't
actually need the PD_ALL_FROZEN page-level bit for anything. It's
enough that the bit is present in the visibility map. The only point
of PD_ALL_VISIBLE is that it tells us that we need to clear the
visibility map bit, but that bit is enough to tell us to clear both
visibility map bits. So I propose the attached cleanup patch.
It seems reasonable to me. Although I haven't played it (or even
it didn't apply for me for now), but at a glance,
PD_VALID_FLAG_BITS seems should be changed to 0x0007 since
PD_ALL_FROZEN has been removed.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Mar 8, 2016 at 1:20 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Hello, thank you for updating this tool.
At Mon, 7 Mar 2016 14:03:08 -0500, Robert Haas <robertmhaas@gmail.com> wrote in <CA+Tgmob+NjfYE3b3BHBmAC=3tvTbqsZgZ1RoJ63yRAmRgrQOcA@mail.gmail.com>
On Mon, Mar 7, 2016 at 12:41 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Attached latest version optimisation patch.
I'm still consider regarding pg_upgrade regression test code, so I
will submit that patch later.I was thinking more about this today and I think that we don't
actually need the PD_ALL_FROZEN page-level bit for anything. It's
enough that the bit is present in the visibility map. The only point
of PD_ALL_VISIBLE is that it tells us that we need to clear the
visibility map bit, but that bit is enough to tell us to clear both
visibility map bits. So I propose the attached cleanup patch.
Thank you for updating tool and proposing it.
I agree with you, and the patch you attached looks good to me except
for Horiguchi-san's comment.
Regarding pg_visibility module, I'd like to share some bugs and
propose to add a relation type condition to each functions.
Including it, I've attached remaining 2 patches; one is removing page
conversion code from pg_upgarde, and another is supporting pg_upgrade
for frozen bit.
Please have a look at them.
Regards,
--
Masahiko Sawada
Attachments:
Add_condition_to_pg_visibility.patchapplication/x-patch; name=Add_condition_to_pg_visibility.patchDownload
diff --git a/contrib/pg_visibility/pg_visibility--1.0.sql b/contrib/pg_visibility/pg_visibility--1.0.sql
index 9616e1f..da511e5 100644
--- a/contrib/pg_visibility/pg_visibility--1.0.sql
+++ b/contrib/pg_visibility/pg_visibility--1.0.sql
@@ -12,7 +12,7 @@ AS 'MODULE_PATHNAME', 'pg_visibility_map'
LANGUAGE C STRICT;
-- Show visibility map and page-level visibility information.
-CREATE FUNCTION pg_visibility(regclass, blkno, bigint,
+CREATE FUNCTION pg_visibility(regclass, blkno bigint,
all_visible OUT boolean,
all_frozen OUT boolean,
pd_all_visible OUT boolean)
diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index d4336ce..2993bcb 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -14,6 +14,7 @@
#include "funcapi.h"
#include "miscadmin.h"
#include "storage/bufmgr.h"
+#include "utils/rel.h"
PG_MODULE_MAGIC;
@@ -55,6 +56,14 @@ pg_visibility_map(PG_FUNCTION_ARGS)
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("invalid block number")));
+ /* Check for relation type */
+ if (!(rel->rd_rel->relkind == RELKIND_RELATION ||
+ rel->rd_rel->relkind == RELKIND_MATVIEW))
+ ereport(ERROR,
+ (errcode(ERRCODE_WRONG_OBJECT_TYPE),
+ errmsg("\"%s\" is not a table or materialized view",
+ RelationGetRelationName(rel))));
+
tupdesc = pg_visibility_tupdesc(false, false);
MemSet(nulls, 0, sizeof(nulls));
@@ -94,6 +103,14 @@ pg_visibility(PG_FUNCTION_ARGS)
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("invalid block number")));
+ /* Check for relation type */
+ if (!(rel->rd_rel->relkind == RELKIND_RELATION ||
+ rel->rd_rel->relkind == RELKIND_MATVIEW))
+ ereport(ERROR,
+ (errcode(ERRCODE_WRONG_OBJECT_TYPE),
+ errmsg("\"%s\" is not a table or materialized view",
+ RelationGetRelationName(rel))));
+
tupdesc = pg_visibility_tupdesc(false, true);
MemSet(nulls, 0, sizeof(nulls));
@@ -147,9 +164,10 @@ pg_visibility_map_rel(PG_FUNCTION_ARGS)
HeapTuple tuple;
MemSet(nulls, 0, sizeof(nulls));
- values[0] = Int64GetDatum(info->next++);
+ values[0] = Int64GetDatum(info->next);
values[1] = BoolGetDatum((info->bits[info->next] & (1 << 0)) != 0);
values[2] = BoolGetDatum((info->bits[info->next] & (1 << 1)) != 0);
+ info->next++;
tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
SRF_RETURN_NEXT(funcctx, HeapTupleGetDatum(tuple));
@@ -190,10 +208,11 @@ pg_visibility_rel(PG_FUNCTION_ARGS)
HeapTuple tuple;
MemSet(nulls, 0, sizeof(nulls));
- values[0] = Int64GetDatum(info->next++);
+ values[0] = Int64GetDatum(info->next);
values[1] = BoolGetDatum((info->bits[info->next] & (1 << 0)) != 0);
values[2] = BoolGetDatum((info->bits[info->next] & (1 << 1)) != 0);
values[3] = BoolGetDatum((info->bits[info->next] & (1 << 2)) != 0);
+ info->next++;
tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
SRF_RETURN_NEXT(funcctx, HeapTupleGetDatum(tuple));
@@ -223,6 +242,14 @@ pg_visibility_map_summary(PG_FUNCTION_ARGS)
rel = relation_open(relid, AccessShareLock);
nblocks = RelationGetNumberOfBlocks(rel);
+ /* Check for relation type */
+ if (!(rel->rd_rel->relkind == RELKIND_RELATION ||
+ rel->rd_rel->relkind == RELKIND_MATVIEW))
+ ereport(ERROR,
+ (errcode(ERRCODE_WRONG_OBJECT_TYPE),
+ errmsg("\"%s\" is not a table or materialized view",
+ RelationGetRelationName(rel))));
+
for (blkno = 0; blkno < nblocks; ++blkno)
{
int32 mapbits;
@@ -296,6 +323,15 @@ collect_visibility_data(Oid relid, bool include_pd)
BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
rel = relation_open(relid, AccessShareLock);
+
+ /* Check for relation type */
+ if (!(rel->rd_rel->relkind == RELKIND_RELATION ||
+ rel->rd_rel->relkind == RELKIND_MATVIEW))
+ ereport(ERROR,
+ (errcode(ERRCODE_WRONG_OBJECT_TYPE),
+ errmsg("\"%s\" is not a table or materialized view",
+ RelationGetRelationName(rel))));
+
nblocks = RelationGetNumberOfBlocks(rel);
info = palloc0(offsetof(vbits, bits) + nblocks);
info->next = 0;
001_remove_page_conversion_code_v37.patchapplication/x-patch; name=001_remove_page_conversion_code_v37.patchDownload
commit 8ab96722459fc929d5b2d447ffda18fe1107abc0
Author: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed Mar 2 11:09:41 2016 -0400
Initial.
diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile
index d9c8145..0c882d9 100644
--- a/src/bin/pg_upgrade/Makefile
+++ b/src/bin/pg_upgrade/Makefile
@@ -8,7 +8,7 @@ top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
OBJS = check.o controldata.o dump.o exec.o file.o function.o info.o \
- option.o page.o parallel.o pg_upgrade.o relfilenode.o server.o \
+ option.o parallel.o pg_upgrade.o relfilenode.o server.o \
tablespace.o util.o version.o $(WIN32RES)
override CPPFLAGS := -DDLSUFFIX=\"$(DLSUFFIX)\" -I$(srcdir) -I$(libpq_srcdir) $(CPPFLAGS)
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index e0cb675..f932094 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -80,8 +80,6 @@ check_and_dump_old_cluster(bool live_check)
if (!live_check)
start_postmaster(&old_cluster, true);
- get_pg_database_relfilenode(&old_cluster);
-
/* Extract a list of databases and tables from the old cluster */
get_db_and_rel_infos(&old_cluster);
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 9357ad8..115d506 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -25,15 +25,11 @@ static int win32_pghardlink(const char *src, const char *dst);
/*
* copyAndUpdateFile()
*
- * Copies a relation file from src to dst. If pageConverter is non-NULL, this function
- * uses that pageConverter to do a page-by-page conversion.
+ * Copies a relation file from src to dst.
*/
const char *
-copyAndUpdateFile(pageCnvCtx *pageConverter,
- const char *src, const char *dst, bool force)
+copyAndUpdateFile(const char *src, const char *dst, bool force)
{
- if (pageConverter == NULL)
- {
#ifndef WIN32
if (copy_file(src, dst, force) == -1)
#else
@@ -42,65 +38,6 @@ copyAndUpdateFile(pageCnvCtx *pageConverter,
return getErrorText();
else
return NULL;
- }
- else
- {
- /*
- * We have a pageConverter object - that implies that the
- * PageLayoutVersion differs between the two clusters so we have to
- * perform a page-by-page conversion.
- *
- * If the pageConverter can convert the entire file at once, invoke
- * that plugin function, otherwise, read each page in the relation
- * file and call the convertPage plugin function.
- */
-
-#ifdef PAGE_CONVERSION
- if (pageConverter->convertFile)
- return pageConverter->convertFile(pageConverter->pluginData,
- dst, src);
- else
-#endif
- {
- int src_fd;
- int dstfd;
- char buf[BLCKSZ];
- ssize_t bytesRead;
- const char *msg = NULL;
-
- if ((src_fd = open(src, O_RDONLY, 0)) < 0)
- return "could not open source file";
-
- if ((dstfd = open(dst, O_RDWR | O_CREAT | O_EXCL, S_IRUSR | S_IWUSR)) < 0)
- {
- close(src_fd);
- return "could not create destination file";
- }
-
- while ((bytesRead = read(src_fd, buf, BLCKSZ)) == BLCKSZ)
- {
-#ifdef PAGE_CONVERSION
- if ((msg = pageConverter->convertPage(pageConverter->pluginData, buf, buf)) != NULL)
- break;
-#endif
- if (write(dstfd, buf, BLCKSZ) != BLCKSZ)
- {
- msg = "could not write new page to destination";
- break;
- }
- }
-
- close(src_fd);
- close(dstfd);
-
- if (msg)
- return msg;
- else if (bytesRead != 0)
- return "found partial page in source file";
- else
- return NULL;
- }
- }
}
@@ -114,12 +51,8 @@ copyAndUpdateFile(pageCnvCtx *pageConverter,
* instead of copying the data from the old cluster to the new cluster.
*/
const char *
-linkAndUpdateFile(pageCnvCtx *pageConverter,
- const char *src, const char *dst)
+linkAndUpdateFile(const char *src, const char *dst)
{
- if (pageConverter != NULL)
- return "Cannot in-place update this cluster, page-by-page conversion is required";
-
if (pg_link_file(src, dst) == -1)
return getErrorText();
else
diff --git a/src/bin/pg_upgrade/page.c b/src/bin/pg_upgrade/page.c
deleted file mode 100644
index e5686e5..0000000
--- a/src/bin/pg_upgrade/page.c
+++ /dev/null
@@ -1,164 +0,0 @@
-/*
- * page.c
- *
- * per-page conversion operations
- *
- * Copyright (c) 2010-2016, PostgreSQL Global Development Group
- * src/bin/pg_upgrade/page.c
- */
-
-#include "postgres_fe.h"
-
-#include "pg_upgrade.h"
-
-#include "storage/bufpage.h"
-
-
-#ifdef PAGE_CONVERSION
-
-
-static void getPageVersion(
- uint16 *version, const char *pathName);
-static pageCnvCtx *loadConverterPlugin(
- uint16 newPageVersion, uint16 oldPageVersion);
-
-
-/*
- * setupPageConverter()
- *
- * This function determines the PageLayoutVersion of the old cluster and
- * the PageLayoutVersion of the new cluster. If the versions differ, this
- * function loads a converter plugin and returns a pointer to a pageCnvCtx
- * object (in *result) that knows how to convert pages from the old format
- * to the new format. If the versions are identical, this function just
- * returns a NULL pageCnvCtx pointer to indicate that page-by-page conversion
- * is not required.
- */
-pageCnvCtx *
-setupPageConverter(void)
-{
- uint16 oldPageVersion;
- uint16 newPageVersion;
- pageCnvCtx *converter;
- const char *msg;
- char dstName[MAXPGPATH];
- char srcName[MAXPGPATH];
-
- snprintf(dstName, sizeof(dstName), "%s/global/%u", new_cluster.pgdata,
- new_cluster.pg_database_oid);
- snprintf(srcName, sizeof(srcName), "%s/global/%u", old_cluster.pgdata,
- old_cluster.pg_database_oid);
-
- getPageVersion(&oldPageVersion, srcName);
- getPageVersion(&newPageVersion, dstName);
-
- /*
- * If the old cluster and new cluster use the same page layouts, then we
- * don't need a page converter.
- */
- if (newPageVersion != oldPageVersion)
- {
- /*
- * The clusters use differing page layouts, see if we can find a
- * plugin that knows how to convert from the old page layout to the
- * new page layout.
- */
-
- if ((converter = loadConverterPlugin(newPageVersion, oldPageVersion)) == NULL)
- pg_fatal("could not find plugin to convert from old page layout to new page layout\n");
-
- return converter;
- }
- else
- return NULL;
-}
-
-
-/*
- * getPageVersion()
- *
- * Retrieves the PageLayoutVersion for the given relation.
- *
- * Returns NULL on success (and stores the PageLayoutVersion at *version),
- * if an error occurs, this function returns an error message (in the form
- * of a null-terminated string).
- */
-static void
-getPageVersion(uint16 *version, const char *pathName)
-{
- int relfd;
- PageHeaderData page;
- ssize_t bytesRead;
-
- if ((relfd = open(pathName, O_RDONLY, 0)) < 0)
- pg_fatal("could not open relation %s\n", pathName);
-
- if ((bytesRead = read(relfd, &page, sizeof(page))) != sizeof(page))
- pg_fatal("could not read page header of %s\n", pathName);
-
- *version = PageGetPageLayoutVersion(&page);
-
- close(relfd);
-
- return;
-}
-
-
-/*
- * loadConverterPlugin()
- *
- * This function loads a page-converter plugin library and grabs a
- * pointer to each of the (interesting) functions provided by that
- * plugin. The name of the plugin library is derived from the given
- * newPageVersion and oldPageVersion. If a plugin is found, this
- * function returns a pointer to a pageCnvCtx object (which will contain
- * a collection of plugin function pointers). If the required plugin
- * is not found, this function returns NULL.
- */
-static pageCnvCtx *
-loadConverterPlugin(uint16 newPageVersion, uint16 oldPageVersion)
-{
- char pluginName[MAXPGPATH];
- void *plugin;
-
- /*
- * Try to find a plugin that can convert pages of oldPageVersion into
- * pages of newPageVersion. For example, if we oldPageVersion = 3 and
- * newPageVersion is 4, we search for a plugin named:
- * plugins/convertLayout_3_to_4.dll
- */
-
- /*
- * FIXME: we are searching for plugins relative to the current directory,
- * we should really search relative to our own executable instead.
- */
- snprintf(pluginName, sizeof(pluginName), "./plugins/convertLayout_%d_to_%d%s",
- oldPageVersion, newPageVersion, DLSUFFIX);
-
- if ((plugin = pg_dlopen(pluginName)) == NULL)
- return NULL;
- else
- {
- pageCnvCtx *result = (pageCnvCtx *) pg_malloc(sizeof(*result));
-
- result->old.PageVersion = oldPageVersion;
- result->new.PageVersion = newPageVersion;
-
- result->startup = (pluginStartup) pg_dlsym(plugin, "init");
- result->convertFile = (pluginConvertFile) pg_dlsym(plugin, "convertFile");
- result->convertPage = (pluginConvertPage) pg_dlsym(plugin, "convertPage");
- result->shutdown = (pluginShutdown) pg_dlsym(plugin, "fini");
- result->pluginData = NULL;
-
- /*
- * If the plugin has exported an initializer, go ahead and invoke it.
- */
- if (result->startup)
- result->startup(MIGRATOR_API_VERSION, &result->pluginVersion,
- newPageVersion, oldPageVersion, &result->pluginData);
-
- return result;
- }
-}
-
-#endif
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 984c395..4f5361a 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -260,8 +260,6 @@ prepare_new_cluster(void)
new_cluster.bindir, cluster_conn_opts(&new_cluster),
log_opts.verbose ? "--verbose" : "");
check_ok();
-
- get_pg_database_relfilenode(&new_cluster);
}
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index bc733c4..900b2a7 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -269,7 +269,6 @@ typedef struct
uint32 major_version; /* PG_VERSION of cluster */
char major_version_str[64]; /* string PG_VERSION of cluster */
uint32 bin_version; /* version returned from pg_ctl */
- Oid pg_database_oid; /* OID of pg_database relation */
const char *tablespace_suffix; /* directory specification */
} ClusterInfo;
@@ -364,40 +363,8 @@ bool pid_lock_file_exists(const char *datadir);
/* file.c */
-#ifdef PAGE_CONVERSION
-typedef const char *(*pluginStartup) (uint16 migratorVersion,
- uint16 *pluginVersion, uint16 newPageVersion,
- uint16 oldPageVersion, void **pluginData);
-typedef const char *(*pluginConvertFile) (void *pluginData,
- const char *dstName, const char *srcName);
-typedef const char *(*pluginConvertPage) (void *pluginData,
- const char *dstPage, const char *srcPage);
-typedef const char *(*pluginShutdown) (void *pluginData);
-
-typedef struct
-{
- uint16 oldPageVersion; /* Page layout version of the old cluster */
- uint16 newPageVersion; /* Page layout version of the new cluster */
- uint16 pluginVersion; /* API version of converter plugin */
- void *pluginData; /* Plugin data (set by plugin) */
- pluginStartup startup; /* Pointer to plugin's startup function */
- pluginConvertFile convertFile; /* Pointer to plugin's file converter
- * function */
- pluginConvertPage convertPage; /* Pointer to plugin's page converter
- * function */
- pluginShutdown shutdown; /* Pointer to plugin's shutdown function */
-} pageCnvCtx;
-
-const pageCnvCtx *setupPageConverter(void);
-#else
-/* dummy */
-typedef void *pageCnvCtx;
-#endif
-
-const char *copyAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
- const char *dst, bool force);
-const char *linkAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
- const char *dst);
+const char *copyAndUpdateFile(const char *src, const char *dst, bool force);
+const char *linkAndUpdateFile(const char *src, const char *dst);
void check_hard_link(void);
FILE *fopen_priv(const char *path, const char *mode);
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index c059c5b..fcaad79 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -15,10 +15,8 @@
#include "access/transam.h"
-static void transfer_single_new_db(pageCnvCtx *pageConverter,
- FileNameMap *maps, int size, char *old_tablespace);
-static void transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *suffix);
+static void transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace);
+static void transfer_relfile(FileNameMap *map, const char *suffix);
/*
@@ -92,7 +90,6 @@ transfer_all_new_dbs(DbInfoArr *old_db_arr, DbInfoArr *new_db_arr,
*new_db = NULL;
FileNameMap *mappings;
int n_maps;
- pageCnvCtx *pageConverter = NULL;
/*
* Advance past any databases that exist in the new cluster but not in
@@ -116,11 +113,7 @@ transfer_all_new_dbs(DbInfoArr *old_db_arr, DbInfoArr *new_db_arr,
{
print_maps(mappings, n_maps, new_db->db_name);
-#ifdef PAGE_CONVERSION
- pageConverter = setupPageConverter();
-#endif
- transfer_single_new_db(pageConverter, mappings, n_maps,
- old_tablespace);
+ transfer_single_new_db(mappings, n_maps, old_tablespace);
}
/* We allocate something even for n_maps == 0 */
pg_free(mappings);
@@ -129,45 +122,13 @@ transfer_all_new_dbs(DbInfoArr *old_db_arr, DbInfoArr *new_db_arr,
return;
}
-
-/*
- * get_pg_database_relfilenode()
- *
- * Retrieves the relfilenode for a few system-catalog tables. We need these
- * relfilenodes later in the upgrade process.
- */
-void
-get_pg_database_relfilenode(ClusterInfo *cluster)
-{
- PGconn *conn = connectToServer(cluster, "template1");
- PGresult *res;
- int i_relfile;
-
- res = executeQueryOrDie(conn,
- "SELECT c.relname, c.relfilenode "
- "FROM pg_catalog.pg_class c, "
- " pg_catalog.pg_namespace n "
- "WHERE c.relnamespace = n.oid AND "
- " n.nspname = 'pg_catalog' AND "
- " c.relname = 'pg_database' "
- "ORDER BY c.relname");
-
- i_relfile = PQfnumber(res, "relfilenode");
- cluster->pg_database_oid = atooid(PQgetvalue(res, 0, i_relfile));
-
- PQclear(res);
- PQfinish(conn);
-}
-
-
/*
* transfer_single_new_db()
*
* create links for mappings stored in "maps" array.
*/
static void
-transfer_single_new_db(pageCnvCtx *pageConverter,
- FileNameMap *maps, int size, char *old_tablespace)
+transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
{
int mapnum;
bool vm_crashsafe_match = true;
@@ -186,7 +147,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
strcmp(maps[mapnum].old_tablespace, old_tablespace) == 0)
{
/* transfer primary file */
- transfer_relfile(pageConverter, &maps[mapnum], "");
+ transfer_relfile(&maps[mapnum], "");
/* fsm/vm files added in PG 8.4 */
if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)
@@ -194,9 +155,9 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
/*
* Copy/link any fsm and vm files, if they exist
*/
- transfer_relfile(pageConverter, &maps[mapnum], "_fsm");
+ transfer_relfile(&maps[mapnum], "_fsm");
if (vm_crashsafe_match)
- transfer_relfile(pageConverter, &maps[mapnum], "_vm");
+ transfer_relfile(&maps[mapnum], "_vm");
}
}
}
@@ -209,8 +170,7 @@ transfer_single_new_db(pageCnvCtx *pageConverter,
* Copy or link file from old cluster to new one.
*/
static void
-transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
- const char *type_suffix)
+transfer_relfile(FileNameMap *map, const char *type_suffix)
{
const char *msg;
char old_file[MAXPGPATH];
@@ -268,15 +228,11 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
/* Copying files might take some time, so give feedback. */
pg_log(PG_STATUS, "%s", old_file);
- if ((user_opts.transfer_mode == TRANSFER_MODE_LINK) && (pageConverter != NULL))
- pg_fatal("This upgrade requires page-by-page conversion, "
- "you must use copy mode instead of link mode.\n");
-
if (user_opts.transfer_mode == TRANSFER_MODE_COPY)
{
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
+ if ((msg = copyAndUpdateFile(old_file, new_file, true)) != NULL)
pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
@@ -284,7 +240,7 @@ transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
{
pg_log(PG_VERBOSE, "linking \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = linkAndUpdateFile(pageConverter, old_file, new_file)) != NULL)
+ if ((msg = linkAndUpdateFile(old_file, new_file)) != NULL)
pg_fatal("error while creating link for relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
002_pgupgrade_rewrite_vm_v37.patchapplication/x-patch; name=002_pgupgrade_rewrite_vm_v37.patchDownload
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 115d506..9adee01 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -9,7 +9,11 @@
#include "postgres_fe.h"
+#include "access/visibilitymap.h"
#include "pg_upgrade.h"
+#include "storage/bufpage.h"
+#include "storage/checksum.h"
+#include "storage/checksum_impl.h"
#include <fcntl.h>
@@ -21,6 +25,25 @@ static int copy_file(const char *fromfile, const char *tofile, bool force);
static int win32_pghardlink(const char *src, const char *dst);
#endif
+/* table for fast rewriting vm file in order to add all-frozen information */
+static const uint16 rewrite_vm_table[256] = {
+ 0, 1, 4, 5, 16, 17, 20, 21, 64, 65, 68, 69, 80, 81, 84, 85,
+ 256, 257, 260, 261, 272, 273, 276, 277, 320, 321, 324, 325, 336, 337, 340, 341,
+ 1024, 1025, 1028, 1029, 1040, 1041, 1044, 1045, 1088, 1089, 1092, 1093, 1104, 1105, 1108, 1109,
+ 1280, 1281, 1284, 1285, 1296, 1297, 1300, 1301, 1344, 1345, 1348, 1349, 1360, 1361, 1364, 1365,
+ 4096, 4097, 4100, 4101, 4112, 4113, 4116, 4117, 4160, 4161, 4164, 4165, 4176, 4177, 4180, 4181,
+ 4352, 4353, 4356, 4357, 4368, 4369, 4372, 4373, 4416, 4417, 4420, 4421, 4432, 4433, 4436, 4437,
+ 5120, 5121, 5124, 5125, 5136, 5137, 5140, 5141, 5184, 5185, 5188, 5189, 5200, 5201, 5204, 5205,
+ 5376, 5377, 5380, 5381, 5392, 5393, 5396, 5397, 5440, 5441, 5444, 5445, 5456, 5457, 5460, 5461,
+ 16384, 16385, 16388, 16389, 16400, 16401, 16404, 16405, 16448, 16449, 16452, 16453, 16464, 16465, 16468, 16469,
+ 16640, 16641, 16644, 16645, 16656, 16657, 16660, 16661, 16704, 16705, 16708, 16709, 16720, 16721, 16724, 16725,
+ 17408, 17409, 17412, 17413, 17424, 17425, 17428, 17429, 17472, 17473, 17476, 17477, 17488, 17489, 17492, 17493,
+ 17664, 17665, 17668, 17669, 17680, 17681, 17684, 17685, 17728, 17729, 17732, 17733, 17744, 17745, 17748, 17749,
+ 20480, 20481, 20484, 20485, 20496, 20497, 20500, 20501, 20544, 20545, 20548, 20549, 20560, 20561, 20564, 20565,
+ 20736, 20737, 20740, 20741, 20752, 20753, 20756, 20757, 20800, 20801, 20804, 20805, 20816, 20817, 20820, 20821,
+ 21504, 21505, 21508, 21509, 21520, 21521, 21524, 21525, 21568, 21569, 21572, 21573, 21584, 21585, 21588, 21589,
+ 21760, 21761, 21764, 21765, 21776, 21777, 21780, 21781, 21824, 21825, 21828, 21829, 21840, 21841, 21844, 21845
+};
/*
* copyAndUpdateFile()
@@ -138,6 +161,95 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
#endif
+/*
+ * rewriteVisibilityMap()
+ *
+ * Copies a visibility map file while adding all-frozen bit(0) into each bit.
+ */
+const char *
+rewriteVisibilityMap(const char *fromfile, const char *tofile, bool force)
+{
+ int src_fd = 0;
+ int dst_fd = 0;
+ char buffer[BLCKSZ];
+ ssize_t bytesRead;
+ int rewriteVmBytesPerPage = (BLCKSZ - SizeOfPageHeaderData) / 2;
+ BlockNumber blkno = 0;
+
+ /* Reset errno */
+ errno = 0;
+
+ if ((fromfile == NULL) || (tofile == NULL))
+ return getErrorText();
+
+ if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
+ goto err;
+
+ if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+ goto err;
+
+ /* Perform data rewriting per page */
+ while ((bytesRead = read(src_fd, buffer, BLCKSZ)) == BLCKSZ)
+ {
+ char *cur, *end, *blkend;
+ PageHeaderData pageheader;
+ uint16 vm_bits;
+
+ /* Save the page header data */
+ memcpy(&pageheader, buffer, SizeOfPageHeaderData);
+
+ cur = buffer + SizeOfPageHeaderData;
+ end = buffer + SizeOfPageHeaderData + rewriteVmBytesPerPage;
+ blkend = buffer + bytesRead;
+
+ while (blkend >= end)
+ {
+ char vmbuf[BLCKSZ];
+ char *vmtmp = vmbuf;
+
+ /* Copy page header in advance */
+ memcpy(vmbuf, &pageheader, SizeOfPageHeaderData);
+
+ vmtmp += SizeOfPageHeaderData;
+
+ /* Rewrite visibility map bit one by one */
+ while (end > cur)
+ {
+ /* Write rewritten bit from table and its string representation */
+ vm_bits = rewrite_vm_table[(uint8) *cur];
+ memcpy(vmtmp, &vm_bits, BITS_PER_HEAPBLOCK);
+
+ cur++;
+ vmtmp += BITS_PER_HEAPBLOCK;
+ }
+
+ /* Set new checksum for a visibility map page, If enabled */
+ if (old_cluster.controldata.data_checksum_version != 0 &&
+ new_cluster.controldata.data_checksum_version != 0)
+ ((PageHeader) vmbuf)->pd_checksum = pg_checksum_page(vmbuf, blkno);
+
+ if (write(dst_fd, vmbuf, BLCKSZ) != BLCKSZ)
+ {
+ if (errno == 0)
+ errno = ENOSPC;
+ goto err;
+ }
+
+ end += rewriteVmBytesPerPage;
+ blkno++;
+ }
+ }
+
+err:
+ if (src_fd != 0)
+ close(src_fd);
+
+ if (dst_fd != 0)
+ close(dst_fd);
+
+ return (errno == 0) ? NULL : getErrorText();
+}
+
void
check_hard_link(void)
{
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 900b2a7..ecd9ab3 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -110,6 +110,10 @@ extern char *output_files[];
#define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031
/*
+ * The format of visibility map is changed with this 9.6 commit,
+ */
+#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201602181
+/*
* pg_multixact format changed in 9.3 commit 0ac5ad5134f2769ccbaefec73844f85,
* ("Improve concurrency of foreign key locking") which also updated catalog
* version to this value. pg_upgrade behavior depends on whether old and new
@@ -269,6 +273,7 @@ typedef struct
uint32 major_version; /* PG_VERSION of cluster */
char major_version_str[64]; /* string PG_VERSION of cluster */
uint32 bin_version; /* version returned from pg_ctl */
+ Oid pg_database_oid; /* OID of pg_database relation */
const char *tablespace_suffix; /* directory specification */
} ClusterInfo;
@@ -365,6 +370,8 @@ bool pid_lock_file_exists(const char *datadir);
const char *copyAndUpdateFile(const char *src, const char *dst, bool force);
const char *linkAndUpdateFile(const char *src, const char *dst);
+const char *rewriteVisibilityMap(const char *fromfile, const char *tofile,
+ bool force);
void check_hard_link(void);
FILE *fopen_priv(const char *path, const char *mode);
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index fcaad79..ee88c15 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -16,7 +16,7 @@
static void transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace);
-static void transfer_relfile(FileNameMap *map, const char *suffix);
+static void transfer_relfile(FileNameMap *map, const char *suffix, bool vm_need_rewrite);
/*
@@ -132,6 +132,7 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
{
int mapnum;
bool vm_crashsafe_match = true;
+ bool vm_need_rewrite = false;
/*
* Do the old and new cluster disagree on the crash-safetiness of the vm
@@ -141,13 +142,20 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
vm_crashsafe_match = false;
+ /*
+ * Do we need to rewrite visibilitymap?
+ */
+ if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ vm_need_rewrite = true;
+
for (mapnum = 0; mapnum < size; mapnum++)
{
if (old_tablespace == NULL ||
strcmp(maps[mapnum].old_tablespace, old_tablespace) == 0)
{
/* transfer primary file */
- transfer_relfile(&maps[mapnum], "");
+ transfer_relfile(&maps[mapnum], "", vm_need_rewrite);
/* fsm/vm files added in PG 8.4 */
if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)
@@ -155,9 +163,9 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
/*
* Copy/link any fsm and vm files, if they exist
*/
- transfer_relfile(&maps[mapnum], "_fsm");
+ transfer_relfile(&maps[mapnum], "_fsm", vm_need_rewrite);
if (vm_crashsafe_match)
- transfer_relfile(&maps[mapnum], "_vm");
+ transfer_relfile(&maps[mapnum], "_vm", vm_need_rewrite);
}
}
}
@@ -168,9 +176,11 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
* transfer_relfile()
*
* Copy or link file from old cluster to new one.
+ * if vm_need_rewrite is true, visibility map is rewritten to be added frozen bit
+ * even link mode.
*/
static void
-transfer_relfile(FileNameMap *map, const char *type_suffix)
+transfer_relfile(FileNameMap *map, const char *type_suffix, bool vm_need_rewrite)
{
const char *msg;
char old_file[MAXPGPATH];
@@ -232,7 +242,13 @@ transfer_relfile(FileNameMap *map, const char *type_suffix)
{
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = copyAndUpdateFile(old_file, new_file, true)) != NULL)
+ /* Rewrite visibility map */
+ if (vm_need_rewrite && (strcmp(type_suffix, "_vm") == 0))
+ msg = rewriteVisibilityMap(old_file, new_file, true);
+ else
+ msg = copyAndUpdateFile(old_file, new_file, true);
+
+ if (msg)
pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
@@ -240,7 +256,13 @@ transfer_relfile(FileNameMap *map, const char *type_suffix)
{
pg_log(PG_VERBOSE, "linking \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = linkAndUpdateFile(old_file, new_file)) != NULL)
+ /* Rewrite visibility map even link mode */
+ if (vm_need_rewrite && (strcmp(type_suffix, "_vm") == 0))
+ msg = rewriteVisibilityMap(old_file, new_file, true);
+ else
+ msg = linkAndUpdateFile(old_file, new_file);
+
+ if (msg)
pg_fatal("error while creating link for relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
diff --git a/src/bin/pg_upgrade/test.sh b/src/bin/pg_upgrade/test.sh
index ba79fb3..cd9b17e 100644
--- a/src/bin/pg_upgrade/test.sh
+++ b/src/bin/pg_upgrade/test.sh
@@ -174,6 +174,11 @@ if "$MAKE" -C "$oldsrc" installcheck; then
mv "$temp_root"/dump1.sql "$temp_root"/dump1.sql.orig
sed "s;$oldsrc;$newsrc;g" "$temp_root"/dump1.sql.orig >"$temp_root"/dump1.sql
fi
+
+ vm_sql="SELECT c.relname, c.relallvisible FROM pg_class as c, pg_namespace as n WHERE c.relnamespace = n.oid AND n.nspname NOT IN ('information_schema', 'pg_toast', 'pg_catalog') ORDER BY c.relname;"
+ # Test for rewriting visibility map
+ vacuumdb -d regression || visibilitymap_vacuum1_status=$?
+ psql -d regression -c "$vm_sql" > "$temp_root"/vm_test1.txt || visibilitymap_test1_status=$?
else
make_installcheck_status=$?
fi
@@ -188,6 +193,14 @@ if [ -n "$pg_dumpall1_status" ]; then
echo "pg_dumpall of pre-upgrade database cluster failed"
exit 1
fi
+if [ -n "$visibilitymap_vacuum1_status" ];then
+ echo "VACUUM of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+if [ -n "$visibilitymap_test1_status" ];then
+ echo "SELECT of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
PGDATA=$BASE_PGDATA
@@ -203,6 +216,8 @@ case $testhost in
esac
pg_dumpall -f "$temp_root"/dump2.sql || pg_dumpall2_status=$?
+vacuumdb -d regression || visibilitymap_vacuum2_status=$?
+psql -d regression -c "$vm_sql" > "$temp_root"/vm_test2.txt || visibilitymap_test2_status=$?
pg_ctl -m fast stop
# no need to echo commands anymore
@@ -214,11 +229,26 @@ if [ -n "$pg_dumpall2_status" ]; then
exit 1
fi
+if [ -n "$visibilitymap_vacuum2_status" ];then
+ echo "VACUUM of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
+if [ -n "$visibilitymap_test2_status" ];then
+ echo "SELECT of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
case $testhost in
MINGW*) cmd /c delete_old_cluster.bat ;;
*) sh ./delete_old_cluster.sh ;;
esac
+if ! diff "$temp_root"/vm_test1.txt "$temp_root"/vm_test2.txt >/dev/null; then
+ echo "Visibility map rewriting test failed"
+ exit 1
+fi
+
if diff "$temp_root"/dump1.sql "$temp_root"/dump2.sql >/dev/null; then
echo PASSED
exit 0
On Tue, Mar 8, 2016 at 7:26 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Regarding pg_visibility module, I'd like to share some bugs and
propose to add a relation type condition to each functions.
OK, thanks.
Including it, I've attached remaining 2 patches; one is removing page
conversion code from pg_upgarde, and another is supporting pg_upgrade
for frozen bit.
Committed 001 with minor tweaks.
I find rewrite_vm_table to be pretty opaque. There's not even a
comment explaining what it is supposed to do. And I wonder why we
really need to be this efficient about it anyway. Like, would it be
too expensive to just do this:
for (i = 0; i < BITS_PER_BYTE; ++i)
if ((old & (1 << i)) != 0)
new |= 1 << (2 * i);
And how about adding some more comments explaining why we are doing
this rewriting, like this:
In versions of PostgreSQL prior to catversion 201602181, PostgreSQL's
visibility map included one bit per heap page; it now includes two.
When upgrading a cluster from before that time to a current PostgreSQL
version, we could refuse to copy visibility maps from the old cluster
to the new cluster; the next VACUUM would recreate them, but at the
price of scanning the entire table. So, instead, we rewrite the old
visibility maps in the new format. That way, the all-visible bit
remains set for the pages for which it was set previously. The
all-frozen bit is never set by this conversion; we leave that to
VACUUM.
Also, I'm slightly perplexed by the fact that I can't see how this
code succeeds in turning each page into two pages, which is something
that it seems like it would need to do. Wouldn't we need to write out
the old page header twice, one for the first of the two new pages and
again for the second? I probably need more caffeine here, so please
tell me what I'm missing.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Mar 8, 2016 at 8:30 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Mar 8, 2016 at 7:26 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Regarding pg_visibility module, I'd like to share some bugs and
propose to add a relation type condition to each functions.OK, thanks.
I left out the relkind check from the final commit because, for one
thing, the check you added isn't actually right: toast relations can
also have a visibility map. And also, I'm sort of wondering what the
point of that check is. What does it protect us from? It doesn't
seem very future-proof ... what if we add a new relkind in the future?
Do we really want to have to update this?
How about instead changing things so that we specifically reject
indexes? And maybe some kind of a check that will reject anything
that lacks a relfilnode? That seems like it would be more on point.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Mar 8, 2016 at 5:30 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Mar 8, 2016 at 7:26 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Regarding pg_visibility module, I'd like to share some bugs and
propose to add a relation type condition to each functions.OK, thanks.
Including it, I've attached remaining 2 patches; one is removing page
conversion code from pg_upgarde, and another is supporting pg_upgrade
for frozen bit.Committed 001 with minor tweaks.
I find rewrite_vm_table to be pretty opaque. There's not even a
comment explaining what it is supposed to do. And I wonder why we
really need to be this efficient about it anyway. Like, would it be
too expensive to just do this:for (i = 0; i < BITS_PER_BYTE; ++i)
if ((old & (1 << i)) != 0)
new |= 1 << (2 * i);And how about adding some more comments explaining why we are doing
this rewriting, like this:In versions of PostgreSQL prior to catversion 201602181, PostgreSQL's
visibility map included one bit per heap page; it now includes two.
When upgrading a cluster from before that time to a current PostgreSQL
version, we could refuse to copy visibility maps from the old cluster
to the new cluster; the next VACUUM would recreate them, but at the
price of scanning the entire table. So, instead, we rewrite the old
visibility maps in the new format. That way, the all-visible bit
remains set for the pages for which it was set previously. The
all-frozen bit is never set by this conversion; we leave that to
VACUUM.Also, I'm slightly perplexed by the fact that I can't see how this
code succeeds in turning each page into two pages, which is something
that it seems like it would need to do. Wouldn't we need to write out
the old page header twice, one for the first of the two new pages and
again for the second? I probably need more caffeine here, so please
tell me what I'm missing.
I think that this loop:
while (blkend >= end)
Executes exactly twice for each iteration of the outer loop. I'd
rather see it written as a loop which explicitly executes twice,
rather looking like it might execute a dynamic number of times. I
can't imagine that this code needs to be future-proof. If we change
the format again in the future, surely we can't just change this code,
we would have to write new code for the new format.
Cheers,
Jeff
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Mar 7, 2016 at 12:41 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Attached latest version optimisation patch.
I'm still consider regarding pg_upgrade regression test code, so I
will submit that patch later.
I just spent some time looking at this and I'm a bit worried about the
following (existing) comment in vacuumlazy.c:
* Note: The value returned by visibilitymap_get_status could be slightly
* out-of-date, since we make this test before reading the corresponding
* heap page or locking the buffer. This is OK. If we mistakenly think
* that the page is all-visible when in fact the flag's just been cleared,
* we might fail to vacuum the page. But it's OK to skip pages when
* scan_all is not set, so no great harm done; the next vacuum will find
* them. If we make the reverse mistake and vacuum a page unnecessarily,
* it'll just be a no-op.
The patch makes some attempt to update the comment mechanically, but
that's not nearly enough. That comment is explaining that you *can't*
rely on the visibility map to tell you *for sure* that a page does not
require vacuuming. For current uses, that's OK, because if we miss a
page we'll pick it up later. But now we want to skip vacuuming pages
for relfrozenxid/relminmxid advancement, that rationale doesn't apply.
Missing pages that need to be frozen and advancing relfrozenxid anyway
would be _bad_.
However, after some further thought, I think we might actually be OK.
If a page goes from all-frozen to not-all-frozen while VACUUM is
running, any new XID added to the page must be newer than the
oldestXmin value computed by vacuum_set_xid_limits(), so it won't
affect the value to which we can safely set relfrozenxid. Similarly,
any MXID added to the page will be newer than GetOldestMultiXactId(),
so setting relminmxid is still safe for similar reasons.
I'd appreciate it if any other senior hackers could review that chain
of reasoning. It would be really bad to get this wrong.
On another note, I didn't really like the way you updated the
documentation. "eager freezing" doesn't seem like a great term to me,
and I think your changes were a little too localized. Here's a draft
alternative where I used the term "aggressive vacuum" to describe
freezing all of the pages except for those already known to be
all-frozen. Thoughts?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
all-frozen-doc.patchapplication/x-patch; name=all-frozen-doc.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a09ceb2..2f72633 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5984,12 +5984,15 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs an aggressive scan if the table's
<structname>pg_class</>.<structfield>relfrozenxid</> field has reached
- the age specified by this setting. The default is 150 million
- transactions. Although users can set this value anywhere from zero to
- two billions, <command>VACUUM</> will silently limit the effective value
- to 95% of <xref linkend="guc-autovacuum-freeze-max-age">, so that a
+ the age specified by this setting. An aggressive scan differs from
+ a regular <command>VACUUM</> in that it visits every page that might
+ contain unfrozen XIDs or MXIDs, not just those that might contain dead
+ tuples. The default is 150 million transactions. Although users can
+ set this value anywhere from zero to two billions, <command>VACUUM</>
+ will silently limit the effective value to 95% of
+ <xref linkend="guc-autovacuum-freeze-max-age">, so that a
periodical manual <command>VACUUM</> has a chance to run before an
anti-wraparound autovacuum is launched for the table. For more
information see
@@ -6028,9 +6031,12 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs an aggressive scan if the table's
<structname>pg_class</>.<structfield>relminmxid</> field has reached
- the age specified by this setting. The default is 150 million multixacts.
+ the age specified by this setting. An aggressive scan differs from
+ a regular <command>VACUUM</> in that it visits every page that might
+ contain unfrozen XIDs or MXIDs, not just those that might contain dead
+ tuples. The default is 150 million multixacts.
Although users can set this value anywhere from zero to two billions,
<command>VACUUM</> will silently limit the effective value to 95% of
<xref linkend="guc-autovacuum-multixact-freeze-max-age">, so that a
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 5204b34..d742ec9 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -438,22 +438,27 @@
</para>
<para>
- <command>VACUUM</> normally skips pages that don't have any dead row
- versions, but those pages might still have row versions with old XID
- values. To ensure all old row versions have been frozen, a
- scan of the whole table is needed.
- <xref linkend="guc-vacuum-freeze-table-age"> controls when
- <command>VACUUM</> does that: a whole table sweep is forced if
- the table hasn't been fully scanned for <varname>vacuum_freeze_table_age</>
- minus <varname>vacuum_freeze_min_age</> transactions. Setting it to 0
- forces <command>VACUUM</> to always scan all pages, effectively ignoring
- the visibility map.
+ <command>VACUUM</> uses the <link linkend="storage-vm">visibility map</>
+ to determine which pages of a relation must be scanned. Normally, it
+ will skips pages that don't have any dead row versions even if those pages
+ might still have row versions with old XID values. Therefore, normal
+ scans won't succeed in freezing every row version in the table.
+ Periodically, <command>VACUUM</> will perform an <firstterm>aggressive
+ vacuum</>, skipping only those pages which contain neither dead rows nor
+ any unfrozen XID or MXID values.
+ <xref linkend="guc-vacuum-freeze-table-age">
+ controls when <command>VACUUM</> does that: all-visible but not all-frozen
+ pages are scanned if the number of transactions that have passed since the
+ last such scan is greater than <varname>vacuum_freeze_table_age</> minus
+ <varname>vacuum_freeze_min_age</>. Setting
+ <varname>vacuum_freeze_table_age</> to 0 forces <command>VACUUM</> to
+ use this more aggressive strategy for all scans.
</para>
<para>
The maximum time that a table can go unvacuumed is two billion
transactions minus the <varname>vacuum_freeze_min_age</> value at
- the time <command>VACUUM</> last scanned the whole table. If it were to go
+ the time of the last aggressive vacuum. If it were to go
unvacuumed for longer than
that, data loss could result. To ensure that this does not happen,
autovacuum is invoked on any table that might contain unfrozen rows with
@@ -491,7 +496,7 @@
normal delete and update activity is run in that window. Setting it too
close could lead to anti-wraparound autovacuums, even though the table
was recently vacuumed to reclaim space, whereas lower values lead to more
- frequent whole-table scans.
+ frequent aggressive vaccuuming.
</para>
<para>
@@ -527,7 +532,7 @@
<structname>pg_database</>. In particular,
the <structfield>relfrozenxid</> column of a table's
<structname>pg_class</> row contains the freeze cutoff XID that was used
- by the last whole-table <command>VACUUM</> for that table. All rows
+ by the last aggressive <command>VACUUM</> for that table. All rows
inserted by transactions with XIDs older than this cutoff XID are
guaranteed to have been frozen. Similarly,
the <structfield>datfrozenxid</> column of a database's
@@ -554,18 +559,21 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
<para>
<command>VACUUM</> normally
only scans pages that have been modified since the last vacuum, but
- <structfield>relfrozenxid</> can only be advanced when the whole table is
- scanned. The whole table is scanned when <structfield>relfrozenxid</> is
- more than <varname>vacuum_freeze_table_age</> transactions old, when
- <command>VACUUM</>'s <literal>FREEZE</> option is used, or when all pages
- happen to
+ <structfield>relfrozenxid</> can only be advanced every page of the table
+ that might contain unfrozen XIDs is scanned. This happens when
+ <structfield>relfrozenxid</> is more than
+ <varname>vacuum_freeze_table_age</> transactions old, when
+ <command>VACUUM</>'s <literal>FREEZE</> option is used, or when all
+ pages that are not already all-frozen happen to
require vacuuming to remove dead row versions. When <command>VACUUM</>
- scans the whole table, after it's finished <literal>age(relfrozenxid)</>
- should be a little more than the <varname>vacuum_freeze_min_age</> setting
+ scans every page in the table that is not already all-frozen, it should
+ set <literal>age(relfrozenxid)</> to a value just a little more than the
+ <varname>vacuum_freeze_min_age</> setting
that was used (more by the number of transactions started since the
- <command>VACUUM</> started). If no whole-table-scanning <command>VACUUM</>
- is issued on the table until <varname>autovacuum_freeze_max_age</> is
- reached, an autovacuum will soon be forced for the table.
+ <command>VACUUM</> started). If no <structfield>relfrozenxid</>-advancing
+ <command>VACUUM</> is issued on the table until
+ <varname>autovacuum_freeze_max_age</> is reached, an autovacuum will soon
+ be forced for the table.
</para>
<para>
@@ -634,21 +642,23 @@ HINT: Stop the postmaster and vacuum that database in single-user mode.
</para>
<para>
- During a <command>VACUUM</> table scan, either partial or of the whole
- table, any multixact ID older than
+ Whenever <command>VACUUM</> scans any part of a table, it will replace
+ any multixact ID it encounters which is older than
<xref linkend="guc-vacuum-multixact-freeze-min-age">
- is replaced by a different value, which can be the zero value, a single
+ by a different value, which can be the zero value, a single
transaction ID, or a newer multixact ID. For each table,
<structname>pg_class</>.<structfield>relminmxid</> stores the oldest
possible multixact ID still appearing in any tuple of that table.
If this value is older than
- <xref linkend="guc-vacuum-multixact-freeze-table-age">, a whole-table
- scan is forced. <function>mxid_age()</> can be used on
+ <xref linkend="guc-vacuum-multixact-freeze-table-age">, an aggressive
+ vacuum is forced. As discussed in the previous section, an aggressive
+ vacuum means that only those pages which are known to be all-frozen will
+ be skipped. <function>mxid_age()</> can be used on
<structname>pg_class</>.<structfield>relminmxid</> to find its age.
</para>
<para>
- Whole-table <command>VACUUM</> scans, regardless of
+ Aggressive <command>VACUUM</> scans, regardless of
what causes them, enable advancing the value for that table.
Eventually, as all tables in all databases are scanned and their
oldest multixact values are advanced, on-disk storage for older
@@ -656,13 +666,13 @@ HINT: Stop the postmaster and vacuum that database in single-user mode.
</para>
<para>
- As a safety device, a whole-table vacuum scan will occur for any table
+ As a safety device, an aggressive vacuum scan will occur for any table
whose multixact-age is greater than
- <xref linkend="guc-autovacuum-multixact-freeze-max-age">. Whole-table
+ <xref linkend="guc-autovacuum-multixact-freeze-max-age">. Aggressive
vacuum scans will also occur progressively for all tables, starting with
those that have the oldest multixact-age, if the amount of used member
storage space exceeds the amount 50% of the addressable storage space.
- Both of these kinds of whole-table scans will occur even if autovacuum is
+ Both of these kinds of aggressive scans will occur even if autovacuum is
nominally disabled.
</para>
</sect3>
@@ -743,9 +753,9 @@ vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuple
<command>UPDATE</command> and <command>DELETE</command> operation. (It
is only semi-accurate because some information might be lost under heavy
load.) If the <structfield>relfrozenxid</> value of the table is more
- than <varname>vacuum_freeze_table_age</> transactions old, the whole
- table is scanned to freeze old tuples and advance
- <structfield>relfrozenxid</>, otherwise only pages that have been modified
+ than <varname>vacuum_freeze_table_age</> transactions old, an aggressive
+ vacuum is performed to freeze old tuples and advance
+ <structfield>relfrozenxid</>; otherwise, only pages that have been modified
since the last vacuum are scanned.
</para>
Robert Haas <robertmhaas@gmail.com> writes:
The patch makes some attempt to update the comment mechanically, but
that's not nearly enough. That comment is explaining that you *can't*
rely on the visibility map to tell you *for sure* that a page does not
require vacuuming. For current uses, that's OK, because if we miss a
page we'll pick it up later. But now we want to skip vacuuming pages
for relfrozenxid/relminmxid advancement, that rationale doesn't apply.
Missing pages that need to be frozen and advancing relfrozenxid anyway
would be _bad_.
Check.
However, after some further thought, I think we might actually be OK.
If a page goes from all-frozen to not-all-frozen while VACUUM is
running, any new XID added to the page must be newer than the
oldestXmin value computed by vacuum_set_xid_limits(), so it won't
affect the value to which we can safely set relfrozenxid. Similarly,
any MXID added to the page will be newer than GetOldestMultiXactId(),
so setting relminmxid is still safe for similar reasons.
Yeah, I agree with this, as long as the issue is only that the visibility
map result is slightly stale and not that it's, say, not crash-safe.
We can reasonably assume that any newly-added XID must be one that was
in progress while VACUUM was running, and hence will be after the xmin
horizon we computed earlier. This requires the existence of a read
barrier somewhere between computing xmin horizon and inspecting the
visibility map, but I find it hard to believe there aren't plenty.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Mar 9, 2016 at 1:23 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Tue, Mar 8, 2016 at 5:30 AM, Robert Haas <robertmhaas@gmail.com> wrote:
I left out the relkind check from the final commit because, for one
thing, the check you added isn't actually right: toast relations can
also have a visibility map. And also, I'm sort of wondering what the
point of that check is. What does it protect us from? It doesn't
seem very future-proof ... what if we add a new relkind in the future?
Do we really want to have to update this?How about instead changing things so that we specifically reject
indexes? And maybe some kind of a check that will reject anything
that lacks a relfilnode? That seems like it would be more on point.
I agree, I don't have strong opinion about this.
It would be good to add condition for rejecting only indexes.
Attached patches are,
- Change heap2 rmgr description
- Add condition to pg_visibility
- Fix typo in pgvisibility.sgml
(Sorry for the late notice..)
Regards,
--
Masahiko Sawada
Attachments:
fix_typo.patchapplication/x-patch; name=fix_typo.patchDownload
diff --git a/doc/src/sgml/pgvisibility.sgml b/doc/src/sgml/pgvisibility.sgml
index 6a98b55..cdd6a6f 100644
--- a/doc/src/sgml/pgvisibility.sgml
+++ b/doc/src/sgml/pgvisibility.sgml
@@ -21,7 +21,7 @@
until such time as a tuple is inserted, updated, deleted, or locked on
that page. The page-level <literal>PD_ALL_VISIBLE</literal> bit has the
same meaning as the all-visible bit in the visibility map, but is stored
- within the data page itself rather than a separate data tructure. These
+ within the data page itself rather than a separate data structure. These
will normally agree, but the page-level bit can sometimes be set while the
visibility map bit is clear after a crash recovery; or they can disagree
because of a change which occurs after <literal>pg_visibility</> examines
add_condition_to_pgvisibility.patchapplication/x-patch; name=add_condition_to_pgvisibility.patchDownload
diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 5e5c7cc..c916d0d 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -56,6 +56,13 @@ pg_visibility_map(PG_FUNCTION_ARGS)
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("invalid block number")));
+ /* Check for relation type */
+ if (rel->rd_rel->relkind == RELKIND_INDEX)
+ ereport(ERROR,
+ (errcode(ERRCODE_WRONG_OBJECT_TYPE),
+ errmsg("cannot use for index \"%s\"",
+ RelationGetRelationName(rel))));
+
tupdesc = pg_visibility_tupdesc(false, false);
MemSet(nulls, 0, sizeof(nulls));
@@ -95,6 +102,13 @@ pg_visibility(PG_FUNCTION_ARGS)
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("invalid block number")));
+ /* Check for relation type */
+ if (rel->rd_rel->relkind == RELKIND_INDEX)
+ ereport(ERROR,
+ (errcode(ERRCODE_WRONG_OBJECT_TYPE),
+ errmsg("cannot use for index \"%s\"",
+ RelationGetRelationName(rel))));
+
tupdesc = pg_visibility_tupdesc(false, true);
MemSet(nulls, 0, sizeof(nulls));
@@ -226,6 +240,13 @@ pg_visibility_map_summary(PG_FUNCTION_ARGS)
rel = relation_open(relid, AccessShareLock);
nblocks = RelationGetNumberOfBlocks(rel);
+ /* Check for relation type */
+ if (rel->rd_rel->relkind == RELKIND_INDEX)
+ ereport(ERROR,
+ (errcode(ERRCODE_WRONG_OBJECT_TYPE),
+ errmsg("cannot use for index \"%s\"",
+ RelationGetRelationName(rel))));
+
for (blkno = 0; blkno < nblocks; ++blkno)
{
int32 mapbits;
@@ -300,6 +321,13 @@ collect_visibility_data(Oid relid, bool include_pd)
rel = relation_open(relid, AccessShareLock);
+ /* Check for relation type */
+ if (rel->rd_rel->relkind == RELKIND_INDEX)
+ ereport(ERROR,
+ (errcode(ERRCODE_WRONG_OBJECT_TYPE),
+ errmsg("cannot use for index \"%s\"",
+ RelationGetRelationName(rel))));
+
nblocks = RelationGetNumberOfBlocks(rel);
info = palloc0(offsetof(vbits, bits) + nblocks);
info->next = 0;
add_flags_to_heap2_desc.patchapplication/x-patch; name=add_flags_to_heap2_desc.patchDownload
diff --git a/src/backend/access/rmgrdesc/heapdesc.c b/src/backend/access/rmgrdesc/heapdesc.c
index a63162c..2b31ea4 100644
--- a/src/backend/access/rmgrdesc/heapdesc.c
+++ b/src/backend/access/rmgrdesc/heapdesc.c
@@ -125,7 +125,8 @@ heap2_desc(StringInfo buf, XLogReaderState *record)
{
xl_heap_visible *xlrec = (xl_heap_visible *) rec;
- appendStringInfo(buf, "cutoff xid %u", xlrec->cutoff_xid);
+ appendStringInfo(buf, "cutoff xid %u flags %d",
+ xlrec->cutoff_xid, xlrec->flags);
}
else if (info == XLOG_HEAP2_MULTI_INSERT)
{
On Tue, Mar 8, 2016 at 12:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
However, after some further thought, I think we might actually be OK.
If a page goes from all-frozen to not-all-frozen while VACUUM is
running, any new XID added to the page must be newer than the
oldestXmin value computed by vacuum_set_xid_limits(), so it won't
affect the value to which we can safely set relfrozenxid. Similarly,
any MXID added to the page will be newer than GetOldestMultiXactId(),
so setting relminmxid is still safe for similar reasons.Yeah, I agree with this, as long as the issue is only that the visibility
map result is slightly stale and not that it's, say, not crash-safe.
If the visibility map isn't crash safe, we've got big problems even
without this patch, but we dealt with that when index-only scans went
in. Maybe this patch introduces more stringent requirements in this
area, but I can't think of any reason why that should be true. If
anything occurs to you (or anyone else), it would be good to mention
that before I go further destroy the world.
We can reasonably assume that any newly-added XID must be one that was
in progress while VACUUM was running, and hence will be after the xmin
horizon we computed earlier. This requires the existence of a read
barrier somewhere between computing xmin horizon and inspecting the
visibility map, but I find it hard to believe there aren't plenty.
I'll check that, but I agree that it should be OK.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Mar 8, 2016 at 12:59 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
How about instead changing things so that we specifically reject
indexes? And maybe some kind of a check that will reject anything
that lacks a relfilnode? That seems like it would be more on point.I agree, I don't have strong opinion about this.
It would be good to add condition for rejecting only indexes.
Attached patches are,
- Change heap2 rmgr description
- Add condition to pg_visibility
- Fix typo in pgvisibility.sgml
(Sorry for the late notice..)
OK, committed the first and last of those. I think the other one
needs some work yet; the error message doesn't seem like it is quite
our usual style, and if we're going to do something here we should
probably also insert a check to throw a better error when there is no
relfilenode.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Mar 9, 2016 at 3:38 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Mar 8, 2016 at 12:59 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
How about instead changing things so that we specifically reject
indexes? And maybe some kind of a check that will reject anything
that lacks a relfilnode? That seems like it would be more on point.I agree, I don't have strong opinion about this.
It would be good to add condition for rejecting only indexes.
Attached patches are,
- Change heap2 rmgr description
- Add condition to pg_visibility
- Fix typo in pgvisibility.sgml
(Sorry for the late notice..)OK, committed the first and last of those. I think the other one
needs some work yet; the error message doesn't seem like it is quite
our usual style, and if we're going to do something here we should
probably also insert a check to throw a better error when there is no
relfilenode.
Thank you for your advising and suggestion!
Attached latest 2 patches.
* 000 patch : Incorporated the review comments and made rewriting
logic more clearly.
* 001 patch : Incorporated the documentation suggestions and updated
logic a little.
Please review them.
Regards,
--
Masahiko Sawada
Attachments:
000_pgupgrade_rewrites_vm_v38.patchapplication/octet-stream; name=000_pgupgrade_rewrites_vm_v38.patchDownload
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 2a99a28..112a10c 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -9,7 +9,11 @@
#include "postgres_fe.h"
+#include "access/visibilitymap.h"
#include "pg_upgrade.h"
+#include "storage/bufpage.h"
+#include "storage/checksum.h"
+#include "storage/checksum_impl.h"
#include <fcntl.h>
@@ -138,6 +142,132 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
#endif
+/*
+ * rewriteVisibilityMap()
+ *
+ * In versions of PostgreSQL prior to catversion 201602181, PostgreSQL's
+ * visibility map included one bit per heap page; it now includes two.
+ * When upgrading a cluster from before that time to a current PostgreSQL
+ * version, we could refuse to copy visibility maps from the old cluster
+ * to the new cluster; the next VACUUM would recreate them, but at the
+ * price of scanning the entire table. So, instead, we rewrite the old
+ * visibility maps in the new format. That way, the all-visible bit
+ * remains set for the pages for which it was set previously. The
+ * all-frozen bit is never set by this conversion; we leave that to
+ * VACUUM.
+ */
+const char *
+rewriteVisibilityMap(const char *fromfile, const char *tofile, bool force)
+{
+#define BITS_PER_HEAPBLOCK_OLD 1
+
+ int src_fd = 0;
+ int dst_fd = 0;
+ char buffer[BLCKSZ];
+ ssize_t bytesRead;
+ int rewriteVmBytesPerPage;
+ BlockNumber blkno = 0;
+
+ /* Compute we need how many old page bytes to rewrite a new page */
+ rewriteVmBytesPerPage = (BLCKSZ - SizeOfPageHeaderData) / 2;
+
+ /* Reset errno */
+ errno = 0;
+
+ if ((fromfile == NULL) || (tofile == NULL))
+ return getErrorText();
+
+ if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
+ goto err;
+
+ if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+ goto err;
+
+ /*
+ * Turn each visibility map page into 2 pages one by one.
+ * Rewritten 2 pages have same page header as old page had.
+ */
+ while ((bytesRead = read(src_fd, buffer, BLCKSZ)) == BLCKSZ)
+ {
+ char *cur_old, *break_old, *blkend_old;
+ PageHeaderData pageheader;
+
+ /* Save the page header data */
+ memcpy(&pageheader, buffer, SizeOfPageHeaderData);
+
+ /*
+ * These old_* variables point to old visibility map page.
+ *
+ * cur_old : Points to current position on old page.
+ * blkend_old : Points to end of old block.
+ * break_old : Points to old page break position for rewriting
+ * a new page. After wrote a new page, old_end
+ * proceeds rewriteVmBytesPerPgae bytes.
+ */
+ cur_old = buffer + SizeOfPageHeaderData;
+ blkend_old = buffer + bytesRead;
+ break_old = cur_old + rewriteVmBytesPerPage;
+
+ while (blkend_old >= break_old)
+ {
+ char vmbuf[BLCKSZ];
+ char *cur_new = vmbuf;
+
+ /* Copy page header in advance */
+ memcpy(vmbuf, &pageheader, SizeOfPageHeaderData);
+
+ cur_new += SizeOfPageHeaderData;
+
+ /*
+ * Process old page bytes one by one, and turn it
+ * into new page.
+ */
+ while (break_old > cur_old)
+ {
+ uint16 new_vmbits;
+ int i;
+
+ /* Generate new format bits while keeping old information */
+ for (i = 0; i < BITS_PER_BYTE; i++)
+ {
+ if ((((uint8) *cur_old) & (1 << i)) != 0)
+ new_vmbits |= 1 << (BITS_PER_HEAPBLOCK * i);
+ }
+
+ /* Copy new visibility map bit to new format page */
+ memcpy(cur_new, &new_vmbits, BITS_PER_HEAPBLOCK);
+
+ cur_old += BITS_PER_HEAPBLOCK_OLD;
+ cur_new += BITS_PER_HEAPBLOCK;
+ }
+
+ /* Set new checksum for a visibility map page, If enabled */
+ if (old_cluster.controldata.data_checksum_version != 0 &&
+ new_cluster.controldata.data_checksum_version != 0)
+ ((PageHeader) vmbuf)->pd_checksum = pg_checksum_page(vmbuf, blkno);
+
+ if (write(dst_fd, vmbuf, BLCKSZ) != BLCKSZ)
+ {
+ if (errno == 0)
+ errno = ENOSPC;
+ goto err;
+ }
+
+ break_old += rewriteVmBytesPerPage;
+ blkno++;
+ }
+ }
+
+err:
+ if (src_fd != 0)
+ close(src_fd);
+
+ if (dst_fd != 0)
+ close(dst_fd);
+
+ return (errno == 0) ? NULL : getErrorText();
+}
+
void
check_hard_link(void)
{
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 6122878..5e72985 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -110,6 +110,10 @@ extern char *output_files[];
#define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031
/*
+ * The format of visibility map is changed with this 9.6 commit,
+ */
+#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201602181
+/*
* pg_multixact format changed in 9.3 commit 0ac5ad5134f2769ccbaefec73844f85,
* ("Improve concurrency of foreign key locking") which also updated catalog
* version to this value. pg_upgrade behavior depends on whether old and new
@@ -269,6 +273,7 @@ typedef struct
uint32 major_version; /* PG_VERSION of cluster */
char major_version_str[64]; /* string PG_VERSION of cluster */
uint32 bin_version; /* version returned from pg_ctl */
+ Oid pg_database_oid; /* OID of pg_database relation */
const char *tablespace_suffix; /* directory specification */
} ClusterInfo;
@@ -365,6 +370,8 @@ bool pid_lock_file_exists(const char *datadir);
const char *copyFile(const char *src, const char *dst, bool force);
const char *linkFile(const char *src, const char *dst);
+const char *rewriteVisibilityMap(const char *fromfile, const char *tofile,
+ bool force);
void check_hard_link(void);
FILE *fopen_priv(const char *path, const char *mode);
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index b20f073..24491c1 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -16,7 +16,7 @@
static void transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace);
-static void transfer_relfile(FileNameMap *map, const char *suffix);
+static void transfer_relfile(FileNameMap *map, const char *suffix, bool vm_need_rewrite);
/*
@@ -132,6 +132,7 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
{
int mapnum;
bool vm_crashsafe_match = true;
+ bool vm_need_rewrite = false;
/*
* Do the old and new cluster disagree on the crash-safetiness of the vm
@@ -141,13 +142,20 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
vm_crashsafe_match = false;
+ /*
+ * Do we need to rewrite visibilitymap?
+ */
+ if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ vm_need_rewrite = true;
+
for (mapnum = 0; mapnum < size; mapnum++)
{
if (old_tablespace == NULL ||
strcmp(maps[mapnum].old_tablespace, old_tablespace) == 0)
{
/* transfer primary file */
- transfer_relfile(&maps[mapnum], "");
+ transfer_relfile(&maps[mapnum], "", vm_need_rewrite);
/* fsm/vm files added in PG 8.4 */
if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)
@@ -155,9 +163,9 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
/*
* Copy/link any fsm and vm files, if they exist
*/
- transfer_relfile(&maps[mapnum], "_fsm");
+ transfer_relfile(&maps[mapnum], "_fsm", vm_need_rewrite);
if (vm_crashsafe_match)
- transfer_relfile(&maps[mapnum], "_vm");
+ transfer_relfile(&maps[mapnum], "_vm", vm_need_rewrite);
}
}
}
@@ -168,9 +176,11 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
* transfer_relfile()
*
* Copy or link file from old cluster to new one.
+ * if vm_need_rewrite is true, visibility map is rewritten to be added frozen bit
+ * even link mode.
*/
static void
-transfer_relfile(FileNameMap *map, const char *type_suffix)
+transfer_relfile(FileNameMap *map, const char *type_suffix, bool vm_need_rewrite)
{
const char *msg;
char old_file[MAXPGPATH];
@@ -232,7 +242,13 @@ transfer_relfile(FileNameMap *map, const char *type_suffix)
{
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = copyFile(old_file, new_file, true)) != NULL)
+ /* Rewrite visibility map if needed */
+ if (vm_need_rewrite && (strcmp(type_suffix, "_vm") == 0))
+ msg = rewriteVisibilityMap(old_file, new_file, true);
+ else
+ msg = copyFile(old_file, new_file, true);
+
+ if (msg)
pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
@@ -240,7 +256,13 @@ transfer_relfile(FileNameMap *map, const char *type_suffix)
{
pg_log(PG_VERBOSE, "linking \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = linkFile(old_file, new_file)) != NULL)
+ /* Rewrite visibility map if needed */
+ if (vm_need_rewrite && (strcmp(type_suffix, "_vm") == 0))
+ msg = rewriteVisibilityMap(old_file, new_file, true);
+ else
+ msg = linkFile(old_file, new_file);
+
+ if (msg)
pg_fatal("error while creating link for relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
diff --git a/src/bin/pg_upgrade/test.sh b/src/bin/pg_upgrade/test.sh
index ba79fb3..cd9b17e 100644
--- a/src/bin/pg_upgrade/test.sh
+++ b/src/bin/pg_upgrade/test.sh
@@ -174,6 +174,11 @@ if "$MAKE" -C "$oldsrc" installcheck; then
mv "$temp_root"/dump1.sql "$temp_root"/dump1.sql.orig
sed "s;$oldsrc;$newsrc;g" "$temp_root"/dump1.sql.orig >"$temp_root"/dump1.sql
fi
+
+ vm_sql="SELECT c.relname, c.relallvisible FROM pg_class as c, pg_namespace as n WHERE c.relnamespace = n.oid AND n.nspname NOT IN ('information_schema', 'pg_toast', 'pg_catalog') ORDER BY c.relname;"
+ # Test for rewriting visibility map
+ vacuumdb -d regression || visibilitymap_vacuum1_status=$?
+ psql -d regression -c "$vm_sql" > "$temp_root"/vm_test1.txt || visibilitymap_test1_status=$?
else
make_installcheck_status=$?
fi
@@ -188,6 +193,14 @@ if [ -n "$pg_dumpall1_status" ]; then
echo "pg_dumpall of pre-upgrade database cluster failed"
exit 1
fi
+if [ -n "$visibilitymap_vacuum1_status" ];then
+ echo "VACUUM of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+if [ -n "$visibilitymap_test1_status" ];then
+ echo "SELECT of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
PGDATA=$BASE_PGDATA
@@ -203,6 +216,8 @@ case $testhost in
esac
pg_dumpall -f "$temp_root"/dump2.sql || pg_dumpall2_status=$?
+vacuumdb -d regression || visibilitymap_vacuum2_status=$?
+psql -d regression -c "$vm_sql" > "$temp_root"/vm_test2.txt || visibilitymap_test2_status=$?
pg_ctl -m fast stop
# no need to echo commands anymore
@@ -214,11 +229,26 @@ if [ -n "$pg_dumpall2_status" ]; then
exit 1
fi
+if [ -n "$visibilitymap_vacuum2_status" ];then
+ echo "VACUUM of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
+if [ -n "$visibilitymap_test2_status" ];then
+ echo "SELECT of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
case $testhost in
MINGW*) cmd /c delete_old_cluster.bat ;;
*) sh ./delete_old_cluster.sh ;;
esac
+if ! diff "$temp_root"/vm_test1.txt "$temp_root"/vm_test2.txt >/dev/null; then
+ echo "Visibility map rewriting test failed"
+ exit 1
+fi
+
if diff "$temp_root"/dump1.sql "$temp_root"/dump2.sql >/dev/null; then
echo PASSED
exit 0
001_optimize_vacuum_by_frozen_bit_v38.patchapplication/octet-stream; name=001_optimize_vacuum_by_frozen_bit_v38.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a09ceb2..2f72633 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5984,12 +5984,15 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs an aggressive scan if the table's
<structname>pg_class</>.<structfield>relfrozenxid</> field has reached
- the age specified by this setting. The default is 150 million
- transactions. Although users can set this value anywhere from zero to
- two billions, <command>VACUUM</> will silently limit the effective value
- to 95% of <xref linkend="guc-autovacuum-freeze-max-age">, so that a
+ the age specified by this setting. An aggressive scan differs from
+ a regular <command>VACUUM</> in that it visits every page that might
+ contain unfrozen XIDs or MXIDs, not just those that might contain dead
+ tuples. The default is 150 million transactions. Although users can
+ set this value anywhere from zero to two billions, <command>VACUUM</>
+ will silently limit the effective value to 95% of
+ <xref linkend="guc-autovacuum-freeze-max-age">, so that a
periodical manual <command>VACUUM</> has a chance to run before an
anti-wraparound autovacuum is launched for the table. For more
information see
@@ -6028,9 +6031,12 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs an aggressive scan if the table's
<structname>pg_class</>.<structfield>relminmxid</> field has reached
- the age specified by this setting. The default is 150 million multixacts.
+ the age specified by this setting. An aggressive scan differs from
+ a regular <command>VACUUM</> in that it visits every page that might
+ contain unfrozen XIDs or MXIDs, not just those that might contain dead
+ tuples. The default is 150 million multixacts.
Although users can set this value anywhere from zero to two billions,
<command>VACUUM</> will silently limit the effective value to 95% of
<xref linkend="guc-autovacuum-multixact-freeze-max-age">, so that a
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 5204b34..a03c2c6 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -438,27 +438,32 @@
</para>
<para>
- <command>VACUUM</> normally skips pages that don't have any dead row
- versions, but those pages might still have row versions with old XID
- values. To ensure all old row versions have been frozen, a
- scan of the whole table is needed.
- <xref linkend="guc-vacuum-freeze-table-age"> controls when
- <command>VACUUM</> does that: a whole table sweep is forced if
- the table hasn't been fully scanned for <varname>vacuum_freeze_table_age</>
- minus <varname>vacuum_freeze_min_age</> transactions. Setting it to 0
- forces <command>VACUUM</> to always scan all pages, effectively ignoring
- the visibility map.
+ <command>VACUUM</> uses the <link linkend="storage-vm">visibility map</>
+ to determine which pages of a relation must be scanned. Normally, it
+ will skips pages that don't have any dead row versions even if those pages
+ might still have row versions with old XID values. Therefore, normal
+ scans won't succeed in freezing every row version in the table.
+ Periodically, <command>VACUUM</> will perform an <firstterm>aggressive
+ vacuum</>, skipping only those pages which contain neither dead rows nor
+ any unfrozen XID or MXID values.
+ <xref linkend="guc-vacuum-freeze-table-age">
+ controls when <command>VACUUM</> does that: all-visible but not all-frozen
+ pages are scanned if the number of transactions that have passed since the
+ last such scan is greater than <varname>vacuum_freeze_table_age</> minus
+ <varname>vacuum_freeze_min_age</>. Setting
+ <varname>vacuum_freeze_table_age</> to 0 forces <command>VACUUM</> to
+ use this more aggressive strategy for all scans.
</para>
<para>
The maximum time that a table can go unvacuumed is two billion
transactions minus the <varname>vacuum_freeze_min_age</> value at
- the time <command>VACUUM</> last scanned the whole table. If it were to go
+ the time of the last aggressive vacuum. If it were to go
unvacuumed for longer than
that, data loss could result. To ensure that this does not happen,
autovacuum is invoked on any table that might contain unfrozen rows with
XIDs older than the age specified by the configuration parameter <xref
- linkend="guc-autovacuum-freeze-max-age">. (This will happen even if
+ linkend="guc-autovacuum-freeze-max-age">. (This will happen even if
autovacuum is disabled.)
</para>
@@ -491,7 +496,7 @@
normal delete and update activity is run in that window. Setting it too
close could lead to anti-wraparound autovacuums, even though the table
was recently vacuumed to reclaim space, whereas lower values lead to more
- frequent whole-table scans.
+ frequent aggressive vacuuming.
</para>
<para>
@@ -527,7 +532,7 @@
<structname>pg_database</>. In particular,
the <structfield>relfrozenxid</> column of a table's
<structname>pg_class</> row contains the freeze cutoff XID that was used
- by the last whole-table <command>VACUUM</> for that table. All rows
+ by the last aggressive <command>VACUUM</> for that table. All rows
inserted by transactions with XIDs older than this cutoff XID are
guaranteed to have been frozen. Similarly,
the <structfield>datfrozenxid</> column of a database's
@@ -554,18 +559,21 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
<para>
<command>VACUUM</> normally
only scans pages that have been modified since the last vacuum, but
- <structfield>relfrozenxid</> can only be advanced when the whole table is
- scanned. The whole table is scanned when <structfield>relfrozenxid</> is
- more than <varname>vacuum_freeze_table_age</> transactions old, when
- <command>VACUUM</>'s <literal>FREEZE</> option is used, or when all pages
- happen to
+ <structfield>relfrozenxid</> can only be advanced every page of the table
+ that might contain unfrozen XIDs is scanned. This happens when
+ <structfield>relfrozenxid</> is more than
+ <varname>vacuum_freeze_table_age</> transactions old, when
+ <command>VACUUM</>'s <literal>FREEZE</> option is used, or when all
+ pages that are not already all-frozen happen to
require vacuuming to remove dead row versions. When <command>VACUUM</>
- scans the whole table, after it's finished <literal>age(relfrozenxid)</>
- should be a little more than the <varname>vacuum_freeze_min_age</> setting
- that was used (more by the number of transactions started since the
- <command>VACUUM</> started). If no whole-table-scanning <command>VACUUM</>
- is issued on the table until <varname>autovacuum_freeze_max_age</> is
- reached, an autovacuum will soon be forced for the table.
+ scans every page in the table that is not already all-frozen, it should
+ set <literal>age(relfrozenxid)</> to a value just a little more than the
+ <varname>vacuum_freeze_min_age</> setting
+ that was used (more by the number of transcations started since the
+ <command>VACUUM</> started). If no <structfield>relfrozenxid</>-advancing
+ <command>VACUUM</> is issued on the table until
+ <varname>autovacuum_freeze_max_age</> is reached, an autovacuum will soon
+ be forced for the table.
</para>
<para>
@@ -634,21 +642,23 @@ HINT: Stop the postmaster and vacuum that database in single-user mode.
</para>
<para>
- During a <command>VACUUM</> table scan, either partial or of the whole
- table, any multixact ID older than
+ Whenever <command>VACUUM</> scans any part of a table, it will replace
+ any multixact ID it encounters which is older than
<xref linkend="guc-vacuum-multixact-freeze-min-age">
- is replaced by a different value, which can be the zero value, a single
+ by a different value, which can be the zero value, a single
transaction ID, or a newer multixact ID. For each table,
<structname>pg_class</>.<structfield>relminmxid</> stores the oldest
possible multixact ID still appearing in any tuple of that table.
If this value is older than
- <xref linkend="guc-vacuum-multixact-freeze-table-age">, a whole-table
- scan is forced. <function>mxid_age()</> can be used on
+ <xref linkend="guc-vacuum-multixact-freeze-table-age">, an aggressive
+ vacuum is forced. As discussed in the previous section, an aggressive
+ vacuum means that only those pages which are known to be all-frozen will
+ be skipped. <function>mxid_age()</> can be used on
<structname>pg_class</>.<structfield>relminmxid</> to find its age.
</para>
<para>
- Whole-table <command>VACUUM</> scans, regardless of
+ Aggressive <command>VACUUM</> scans, regardless of
what causes them, enable advancing the value for that table.
Eventually, as all tables in all databases are scanned and their
oldest multixact values are advanced, on-disk storage for older
@@ -656,13 +666,13 @@ HINT: Stop the postmaster and vacuum that database in single-user mode.
</para>
<para>
- As a safety device, a whole-table vacuum scan will occur for any table
+ As a safety device, an aggressive vacuum scan will occur for any table
whose multixact-age is greater than
- <xref linkend="guc-autovacuum-multixact-freeze-max-age">. Whole-table
+ <xref linkend="guc-autovacuum-multixact-freeze-max-age">. Aggressive
vacuum scans will also occur progressively for all tables, starting with
those that have the oldest multixact-age, if the amount of used member
storage space exceeds the amount 50% of the addressable storage space.
- Both of these kinds of whole-table scans will occur even if autovacuum is
+ Both of these kinds of aggressive scans will occur even if autovacuum is
nominally disabled.
</para>
</sect3>
@@ -743,9 +753,9 @@ vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuple
<command>UPDATE</command> and <command>DELETE</command> operation. (It
is only semi-accurate because some information might be lost under heavy
load.) If the <structfield>relfrozenxid</> value of the table is more
- than <varname>vacuum_freeze_table_age</> transactions old, the whole
- table is scanned to freeze old tuples and advance
- <structfield>relfrozenxid</>, otherwise only pages that have been modified
+ than <varname>vacuum_freeze_table_age</> transactions old, an aggressive
+ vacuum is performed to freeze old tuples and advance
+ <structfield>relfrozenxid</>; otherwise, only pages that have been modified
since the last vacuum are scanned.
</para>
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 363b2d0..401e218 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -106,6 +106,7 @@ typedef struct LVRelStats
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* number of pages we examined */
BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber vmskipped_frozen_pages; /* # of pages we skipped by all-frozen bit */
double scanned_tuples; /* counts only tuples on scanned pages */
double old_rel_tuples; /* previous value of pg_class.reltuples */
double new_rel_tuples; /* new estimated total # of tuples */
@@ -221,7 +222,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* We request a full scan if either the table's frozen Xid is now older
* than or equal to the requested Xid full-table scan limit; or if the
* table's minimum MultiXactId is older than or equal to the requested
- * mxid full-table scan limit.
+ * mxid full-table scan limit. During full scan, we could skip to scan
+ * pages according to all-frozen bit of visibility map.
*/
scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
xidFullScanLimit);
@@ -253,7 +255,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
- if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages + vacrelstats->vmskipped_frozen_pages)
+ < vacrelstats->rel_pages)
{
Assert(!scan_all);
scanned_all = false;
@@ -274,9 +277,9 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* Update statistics in pg_class.
*
* A corner case here is that if we scanned no pages at all because every
- * page is all-visible, we should not update relpages/reltuples, because
- * we have no new information to contribute. In particular this keeps us
- * from replacing relpages=reltuples=0 (which means "unknown tuple
+ * page is all-visible or all-forzen, we should not update relpages/reltuples,
+ * because we have no new information to contribute. In particular this keeps
+ * us from replacing relpages=reltuples=0 (which means "unknown tuple
* density") with nonzero relpages and reltuples=0 (which means "zero
* tuple density") unless there's some actual evidence for the latter.
*
@@ -354,10 +357,11 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel),
vacrelstats->num_index_scans);
- appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins\n"),
+ appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped according to vm\n"),
vacrelstats->pages_removed,
vacrelstats->rel_pages,
- vacrelstats->pinskipped_pages);
+ vacrelstats->pinskipped_pages,
+ vacrelstats->vmskipped_frozen_pages);
appendStringInfo(&buf,
_("tuples: %.0f removed, %.0f remain, %.0f are dead but not yet removable\n"),
vacrelstats->tuples_deleted,
@@ -480,9 +484,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* consecutive pages. Since we're reading sequentially, the OS should be
* doing readahead for us, so there's no gain in skipping a page now and
* then; that's likely to disable readahead and so be counterproductive.
- * Also, skipping even a single page means that we can't update
- * relfrozenxid, so we only want to do it if we can skip a goodly number
- * of pages.
+ * Also, skipping even a single page according to all-visible bit of
+ * visibility map means that we might not be able to update relfrozenxid,
+ * so we on ly want to do if if we can skip a goodly number. On the other hand,
+ * we count both how many pages we skipped according to all-frozen bit and
+ * how many pages we froze, so we can update relfrozenxid if the sum of two
+ * is as many as the number of pages of table.
*
* Before entering the main loop, establish the invariant that
* next_not_all_visible_block is the next block number >= blkno that's not
@@ -492,18 +499,18 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* started skipping blocks, we may as well skip everything up to the next
* not-all-visible block.
*
- * Note: if scan_all is true, we won't actually skip any pages; but we
+ * Note: if scan_all is true, we might not actually skip any pages; but we
* maintain next_not_all_visible_block anyway, so as to set up the
* all_visible_according_to_vm flag correctly for each page.
*
* Note: The value returned by visibilitymap_get_status could be slightly
* out-of-date, since we make this test before reading the corresponding
* heap page or locking the buffer. This is OK. If we mistakenly think
- * that the page is all-visible when in fact the flag's just been cleared,
- * we might fail to vacuum the page. But it's OK to skip pages when
- * scan_all is not set, so no great harm done; the next vacuum will find
- * them. If we make the reverse mistake and vacuum a page unnecessarily,
- * it'll just be a no-op.
+ * that the page is all-visible/all-frozen when in fact the flag's just
+ * been cleared, we might fail to vacuum the page. But it's OK to skip
+ * pages when scan_all is not set, so no great harm done; the next vacuum
+ * will find them. If we make the reverse mistake and vacuum a page
+ * unnecessarily, it'll just be a no-op.
*
* We will scan the table's last page, at least to the extent of
* determining whether it has tuples or not, even if it should be skipped
@@ -536,9 +543,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
bool tupgone,
hastup;
int prev_dead_count;
- int nfrozen;
+ int nfrozen; /* # of tuple is frozen */
+ int nalready_frozen; /* # of tuples is already frozen */
+ int ntotal_frozen; /* # of frozen tuples in single page */
+ int ntup_per_page;
Size freespace;
bool all_visible_according_to_vm;
+ bool all_frozen_according_to_vm;
bool all_visible;
bool all_frozen = true; /* provided all_visible is also true */
bool has_dead_tuples;
@@ -570,13 +581,27 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
skipping_all_visible_blocks = false;
all_visible_according_to_vm = false;
+ all_frozen_according_to_vm = false;
}
else
{
- /* Current block is all-visible */
- if (skipping_all_visible_blocks && !scan_all && !FORCE_CHECK_PAGE())
+ /*
+ * This block is at least all-visible according to visibility map.
+ * Weh check whether this block is all-frozen or not, to skipt to
+ * scan this page even if scan_all is true.
+ */
+ bool all_frozen = VM_ALL_FROZEN(onerel, blkno, &vmbuffer);
+
+ if (scan_all && all_frozen && !FORCE_CHECK_PAGE())
+ {
+ vacrelstats->vmskipped_frozen_pages++;
continue;
+ }
+ else if (skipping_all_visible_blocks && !scan_all && !FORCE_CHECK_PAGE())
+ continue;
+
all_visible_according_to_vm = true;
+ all_frozen_according_to_vm = all_frozen;
}
vacuum_delay_point();
@@ -792,6 +817,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
all_visible = true;
has_dead_tuples = false;
nfrozen = 0;
+ nalready_frozen = 0;
+ ntup_per_page = 0;
hastup = false;
prev_dead_count = vacrelstats->num_dead_tuples;
maxoff = PageGetMaxOffsetNumber(page);
@@ -946,6 +973,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else
{
num_tuples += 1;
+ ntup_per_page += 1;
hastup = true;
/*
@@ -996,6 +1024,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
END_CRIT_SECTION();
}
+ /* Compute total number of frozen tuples in single page */
+ ntotal_frozen = nfrozen + nalready_frozen;
+
/*
* If there are no indexes then we can vacuum the page right now
* instead of doing a second scan.
@@ -1018,31 +1049,45 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
freespace = PageGetHeapFreeSpace(page);
- /* mark page all-visible, if appropriate */
- if (all_visible && !all_visible_according_to_vm)
+ /* If this page is all visible, consider to set all-visible and all-frozen */
+ if (all_visible)
{
- uint8 flags = VISIBILITYMAP_ALL_VISIBLE;
+ uint8 flags = 0;
+ /* mark page all-visible, if appropriate */
+ if (!all_visible_according_to_vm)
+ {
+ /*
+ * It should never be the case that the visibility map page is set
+ * while the page-level bit is clear, but the reverse is allowed
+ * (if checksums are not enabled). Regardless, set the both bits
+ * so that we get back in sync.
+ *
+ * NB: If the heap page is all-visible but the VM bit is not set,
+ * we don't need to dirty the heap page. However, if checksums
+ * are enabled, we do need to make sure that the heap page is
+ * dirtied before passing it to visibilitymap_set(), because it
+ * may be logged. Given that this situation should only happen in
+ * rare cases after a crash, it is not worth optimizing.
+ */
+ PageSetAllVisible(page);
+ flags |= VISIBILITYMAP_ALL_VISIBLE;
+ }
- if (all_frozen)
+ /* Mark page as all-frozen, if all tuples are frozen and not marked yet */
+ if ((all_frozen || (ntotal_frozen = ntup_per_page)) &&
+ !all_frozen_according_to_vm)
+ {
+ PageSetAllFrozen(page);
flags |= VISIBILITYMAP_ALL_FROZEN;
+ }
- /*
- * It should never be the case that the visibility map page is set
- * while the page-level bit is clear, but the reverse is allowed
- * (if checksums are not enabled). Regardless, set the both bits
- * so that we get back in sync.
- *
- * NB: If the heap page is all-visible but the VM bit is not set,
- * we don't need to dirty the heap page. However, if checksums
- * are enabled, we do need to make sure that the heap page is
- * dirtied before passing it to visibilitymap_set(), because it
- * may be logged. Given that this situation should only happen in
- * rare cases after a crash, it is not worth optimizing.
- */
- PageSetAllVisible(page);
- MarkBufferDirty(buf);
- visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
- vmbuffer, visibility_cutoff_xid, flags);
+
+ if (flags)
+ {
+ MarkBufferDirty(buf);
+ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+ vmbuffer, visibility_cutoff_xid, flags);
+ }
}
/*
@@ -1055,7 +1100,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
else if (all_visible_according_to_vm && !PageIsAllVisible(page)
&& VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
{
- elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
+ /* If the all-frozen bit is set then all-visible bit must be set */
+ if (all_frozen_according_to_vm)
+ Assert(VM_ALL_FROZEN(onerel, blkno, &vmbuffer) &&
+ VM_ALL_VISIBLE(onerel, blkno, &vmbuffer));
+
+ elog(WARNING, "page is not marked all-visible (and all-frozen) but visibility map bit is set in relation \"%s\" page %u",
relname, blkno);
visibilitymap_clear(onerel, blkno, vmbuffer);
}
@@ -1063,19 +1113,25 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
/*
* It's possible for the value returned by GetOldestXmin() to move
* backwards, so it's not wrong for us to see tuples that appear to
- * not be visible to everyone yet, while PD_ALL_VISIBLE is already
- * set. The real safe xmin value never moves backwards, but
+ * not be visible to everyone yet, while PD_ALL_VISIBLE (and PD_ALL_FROZEN)
+ * are already set. The real safe xmin value never moves backwards, but
* GetOldestXmin() is conservative and sometimes returns a value
* that's unnecessarily small, so if we see that contradiction it just
* means that the tuples that we think are not visible to everyone yet
- * actually are, and the PD_ALL_VISIBLE flag is correct.
+ * actually are, and the PD_ALL_VISIBLE (and PD_ALL_FROZEN) flag are
+ * correct.
*
* There should never be dead tuples on a page with PD_ALL_VISIBLE
* set, however.
*/
else if (PageIsAllVisible(page) && has_dead_tuples)
{
- elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
+ /* If the all-frozen bit is set then all-visible bit must be set */
+ if (all_frozen_according_to_vm)
+ Assert(VM_ALL_FROZEN(onerel, blkno, &vmbuffer) &&
+ VM_ALL_VISIBLE(onerel, blkno, &vmbuffer));
+
+ elog(WARNING, "page containing dead tuples is marked as all-visible (and all-frozen) in relation \"%s\" page %u",
relname, blkno);
PageClearAllVisible(page);
MarkBufferDirty(buf);
@@ -1167,6 +1223,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
RelationGetRelationName(onerel),
tups_vacuumed, vacuumed_pages)));
+ /* Report how many frozen pages vauum skipped according to visibility map */
+ ereport(elevel,
+ (errmsg_plural("skipped %d frozen page according to visibility map",
+ "skipped %d frozen pages according to visibility map",
+ vacrelstats->vmskipped_frozen_pages,
+ vacrelstats->vmskipped_frozen_pages)));
+
/*
* This is pretty messy, but we split it up so that we can skip emitting
* individual parts of the message when not applicable.
diff --git a/src/test/regress/expected/visibilitymap.out b/src/test/regress/expected/visibilitymap.out
new file mode 100644
index 0000000..767a0ec
--- /dev/null
+++ b/src/test/regress/expected/visibilitymap.out
@@ -0,0 +1,15 @@
+--
+-- Visibility Map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+VACUUM FREEZE vmtest;
+-- Check whether vacuum skips all-frozen pages
+\set VERBOSITY terse
+VACUUM FREEZE VERBOSE vmtest;
+INFO: vacuuming "public.vmtest"
+INFO: index "vmtest_pkey" now contains 10000 row versions in 30 pages
+INFO: skipped 44 frozen pages according to visibility map
+INFO: "vmtest": found 0 removable, 56 nonremovable row versions in 1 out of 45 pages
+\set VERBOSITY default
+DROP TABLE vmtest;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index bec0316..9ad2ffc 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -110,3 +110,6 @@ test: event_trigger
# run stats by itself because its delay may be insufficient under heavy load
test: stats
+
+# visiblity map and vacuum test cannot concurrently with any test that runs SQL
+test: visibilitymap
\ No newline at end of file
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 7e9b319..4b4eb07 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -162,3 +162,4 @@ test: with
test: xml
test: event_trigger
test: stats
+test: visibilitymap
\ No newline at end of file
diff --git a/src/test/regress/sql/visibilitymap.sql b/src/test/regress/sql/visibilitymap.sql
new file mode 100644
index 0000000..fb9c811
--- /dev/null
+++ b/src/test/regress/sql/visibilitymap.sql
@@ -0,0 +1,13 @@
+--
+-- Visibility Map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+VACUUM FREEZE vmtest;
+
+-- Check whether vacuum skips all-frozen pages
+\set VERBOSITY terse
+VACUUM FREEZE VERBOSE vmtest;
+\set VERBOSITY default
+
+DROP TABLE vmtest;
On Wed, Mar 9, 2016 at 9:09 AM, Masahiko Sawada
<sawada.mshk@gmail.com> wrote: Attached latest 2 patches.
* 000 patch : Incorporated the review comments and made rewriting
logic more clearly.
That's better, thanks. But your comments don't survive pgindent.
After running pgindent, I get this:
+ /*
+ * These old_* variables point to old visibility map page.
+ *
+ * cur_old : Points to current position on old
page. blkend_old :
+ * Points to end of old block. break_old : Points to
old page break
+ * position for rewriting a new page. After wrote a
new page, old_end
+ * proceeds rewriteVmBytesPerPgae bytes.
+ */
You need to either surround this sort of thing with dashes to make
pgindent ignore it, or, probably better, rewrite it using complete
sentences that together form a paragraph.
+ Oid pg_database_oid; /* OID of
pg_database relation */
Not used anywhere?
Instead of vm_need_rewrite, how about vm_must_add_frozenbit?
Can you explain the changes to test.sh?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Thank you for reviewing!
Attached updated patch.
On Thu, Mar 10, 2016 at 3:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Mar 9, 2016 at 9:09 AM, Masahiko Sawada
<sawada.mshk@gmail.com> wrote: Attached latest 2 patches.* 000 patch : Incorporated the review comments and made rewriting
logic more clearly.That's better, thanks. But your comments don't survive pgindent.
After running pgindent, I get this:+ /* + * These old_* variables point to old visibility map page. + * + * cur_old : Points to current position on old page. blkend_old : + * Points to end of old block. break_old : Points to old page break + * position for rewriting a new page. After wrote a new page, old_end + * proceeds rewriteVmBytesPerPgae bytes. + */You need to either surround this sort of thing with dashes to make
pgindent ignore it, or, probably better, rewrite it using complete
sentences that together form a paragraph.
Fixed.
+ Oid pg_database_oid; /* OID of
pg_database relation */Not used anywhere?
Fixed.
Instead of vm_need_rewrite, how about vm_must_add_frozenbit?
Fixed.
Can you explain the changes to test.sh?
Current regression test scenario is,
1. Do 'make check' on pre-upgrade cluster
2. Dump relallvisible values of all relation in pre-upgrade cluster to
vm_test1.txt
3. Do pg_upgrade
4. Do analyze (not vacuum), dump relallvisibile values of all relation
in post-upgrade cluster to vm_test2.txt
5. Compare between vm_test1.txt and vm_test2.txt
That is, regression test compares between relallvisible values in
pre-upgrade cluster and post-upgrade cluster.
But because test.sh always uses pre/post clusters with same catalog
version, I realized that we cannot ensure that visibility map
rewriting is processed successfully on test.sh framework.
Rewriting visibility map never be executed.
We might need to have another framework for rewriting visibility map page..
Regards,
--
Masahiko Sawada
Attachments:
000_pgupgrade_rewrite_vm_v39.patchapplication/x-patch; name=000_pgupgrade_rewrite_vm_v39.patchDownload
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 2a99a28..7c5bfa6 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -9,7 +9,11 @@
#include "postgres_fe.h"
+#include "access/visibilitymap.h"
#include "pg_upgrade.h"
+#include "storage/bufpage.h"
+#include "storage/checksum.h"
+#include "storage/checksum_impl.h"
#include <fcntl.h>
@@ -138,6 +142,130 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
#endif
+/*
+ * rewriteVisibilityMap()
+ *
+ * In versions of PostgreSQL prior to catversion 201602181, PostgreSQL's
+ * visibility map included one bit per heap page; it now includes two.
+ * When upgrading a cluster from before that time to a current PostgreSQL
+ * version, we could refuse to copy visibility maps from the old cluster
+ * to the new cluster; the next VACUUM would recreate them, but at the
+ * price of scanning the entire table. So, instead, we rewrite the old
+ * visibility maps in the new format. That way, the all-visible bit
+ * remains set for the pages for which it was set previously. The
+ * all-frozen bit is never set by this conversion; we leave that to
+ * VACUUM.
+ */
+const char *
+rewriteVisibilityMap(const char *fromfile, const char *tofile, bool force)
+{
+#define BITS_PER_HEAPBLOCK_OLD 1
+
+ int src_fd = 0;
+ int dst_fd = 0;
+ char buffer[BLCKSZ];
+ ssize_t bytesRead;
+ int rewriteVmBytesPerPage;
+ BlockNumber blkno = 0;
+
+ /* Compute we need how many old page bytes to rewrite a new page */
+ rewriteVmBytesPerPage = (BLCKSZ - SizeOfPageHeaderData) / 2;
+
+ /* Reset errno */
+ errno = 0;
+
+ if ((fromfile == NULL) || (tofile == NULL))
+ return getErrorText();
+
+ if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
+ goto err;
+
+ if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+ goto err;
+
+ /*
+ * Turn each visibility map page into 2 pages one by one.
+ * Rewritten 2 pages have same page header as old page had.
+ */
+ while ((bytesRead = read(src_fd, buffer, BLCKSZ)) == BLCKSZ)
+ {
+ char *cur_old, *break_old, *blkend_old;
+ PageHeaderData pageheader;
+
+ /* Save the page header data */
+ memcpy(&pageheader, buffer, SizeOfPageHeaderData);
+
+ /*
+ * These old_* variables point to old visibility map page.
+ * cur_old points to current potision on old page. blend_old
+ * points to end of old block. break_old points to old page
+ * break position for rewritin a new page. After wrote a new
+ * page, break_old proceeds rewriteVmBytesPerPage bytes.
+ */
+ cur_old = buffer + SizeOfPageHeaderData;
+ blkend_old = buffer + bytesRead;
+ break_old = cur_old + rewriteVmBytesPerPage;
+
+ while (blkend_old >= break_old)
+ {
+ char vmbuf[BLCKSZ];
+ char *cur_new = vmbuf;
+
+ /* Copy page header in advance */
+ memcpy(vmbuf, &pageheader, SizeOfPageHeaderData);
+
+ cur_new += SizeOfPageHeaderData;
+
+ /*
+ * Process old page bytes one by one, and turn it
+ * into new page.
+ */
+ while (break_old > cur_old)
+ {
+ uint16 new_vmbits;
+ int i;
+
+ /* Generate new format bits while keeping old information */
+ for (i = 0; i < BITS_PER_BYTE; i++)
+ {
+ if ((((uint8) *cur_old) & (1 << i)) != 0)
+ new_vmbits |= 1 << (BITS_PER_HEAPBLOCK * i);
+ }
+
+ /* Copy new visibility map bit to new format page */
+ memcpy(cur_new, &new_vmbits, BITS_PER_HEAPBLOCK);
+
+ cur_old += BITS_PER_HEAPBLOCK_OLD;
+ cur_new += BITS_PER_HEAPBLOCK;
+ }
+
+ /* Set new checksum for a visibility map page, If enabled */
+ if (old_cluster.controldata.data_checksum_version != 0 &&
+ new_cluster.controldata.data_checksum_version != 0)
+ ((PageHeader) vmbuf)->pd_checksum = pg_checksum_page(vmbuf, blkno);
+
+ if (write(dst_fd, vmbuf, BLCKSZ) != BLCKSZ)
+ {
+ if (errno == 0)
+ errno = ENOSPC;
+ goto err;
+ }
+
+ break_old += rewriteVmBytesPerPage;
+ blkno++;
+ }
+ }
+
+err:
+ if (src_fd != 0)
+ close(src_fd);
+
+ if (dst_fd != 0)
+ close(dst_fd);
+
+ return (errno == 0) ? NULL : getErrorText();
+}
+
void
check_hard_link(void)
{
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 6122878..876633b 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -110,6 +110,10 @@ extern char *output_files[];
#define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031
/*
+ * The format of visibility map is changed with this 9.6 commit,
+ */
+#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201602181
+/*
* pg_multixact format changed in 9.3 commit 0ac5ad5134f2769ccbaefec73844f85,
* ("Improve concurrency of foreign key locking") which also updated catalog
* version to this value. pg_upgrade behavior depends on whether old and new
@@ -365,6 +369,8 @@ bool pid_lock_file_exists(const char *datadir);
const char *copyFile(const char *src, const char *dst, bool force);
const char *linkFile(const char *src, const char *dst);
+const char *rewriteVisibilityMap(const char *fromfile, const char *tofile,
+ bool force);
void check_hard_link(void);
FILE *fopen_priv(const char *path, const char *mode);
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index b20f073..9daef0b 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -16,7 +16,7 @@
static void transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace);
-static void transfer_relfile(FileNameMap *map, const char *suffix);
+static void transfer_relfile(FileNameMap *map, const char *suffix, bool vm_must_add_frozenbit);
/*
@@ -132,6 +132,7 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
{
int mapnum;
bool vm_crashsafe_match = true;
+ bool vm_must_add_frozenbit = false;
/*
* Do the old and new cluster disagree on the crash-safetiness of the vm
@@ -141,13 +142,20 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
vm_crashsafe_match = false;
+ /*
+ * Do we need to rewrite visibilitymap?
+ */
+ if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ vm_must_add_frozenbit = true;
+
for (mapnum = 0; mapnum < size; mapnum++)
{
if (old_tablespace == NULL ||
strcmp(maps[mapnum].old_tablespace, old_tablespace) == 0)
{
/* transfer primary file */
- transfer_relfile(&maps[mapnum], "");
+ transfer_relfile(&maps[mapnum], "", vm_must_add_frozenbit);
/* fsm/vm files added in PG 8.4 */
if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)
@@ -155,9 +163,9 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
/*
* Copy/link any fsm and vm files, if they exist
*/
- transfer_relfile(&maps[mapnum], "_fsm");
+ transfer_relfile(&maps[mapnum], "_fsm", vm_must_add_frozenbit);
if (vm_crashsafe_match)
- transfer_relfile(&maps[mapnum], "_vm");
+ transfer_relfile(&maps[mapnum], "_vm", vm_must_add_frozenbit);
}
}
}
@@ -168,9 +176,11 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
* transfer_relfile()
*
* Copy or link file from old cluster to new one.
+ * if vm_must_add_frozenbti is true, each visibility map pages are written while
+ * adding frozen bit, even link mode.
*/
static void
-transfer_relfile(FileNameMap *map, const char *type_suffix)
+transfer_relfile(FileNameMap *map, const char *type_suffix, bool vm_must_add_frozenbit)
{
const char *msg;
char old_file[MAXPGPATH];
@@ -232,7 +242,13 @@ transfer_relfile(FileNameMap *map, const char *type_suffix)
{
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = copyFile(old_file, new_file, true)) != NULL)
+ /* Rewrite visibility map if needed */
+ if (vm_must_add_frozenbit && (strcmp(type_suffix, "_vm") == 0))
+ msg = rewriteVisibilityMap(old_file, new_file, true);
+ else
+ msg = copyFile(old_file, new_file, true);
+
+ if (msg)
pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
@@ -240,7 +256,13 @@ transfer_relfile(FileNameMap *map, const char *type_suffix)
{
pg_log(PG_VERBOSE, "linking \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = linkFile(old_file, new_file)) != NULL)
+ /* Rewrite visibility map if needed */
+ if (vm_must_add_frozenbit && (strcmp(type_suffix, "_vm") == 0))
+ msg = rewriteVisibilityMap(old_file, new_file, true);
+ else
+ msg = linkFile(old_file, new_file);
+
+ if (msg)
pg_fatal("error while creating link for relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
diff --git a/src/bin/pg_upgrade/test.sh b/src/bin/pg_upgrade/test.sh
index ba79fb3..8f037d5 100644
--- a/src/bin/pg_upgrade/test.sh
+++ b/src/bin/pg_upgrade/test.sh
@@ -174,6 +174,12 @@ if "$MAKE" -C "$oldsrc" installcheck; then
mv "$temp_root"/dump1.sql "$temp_root"/dump1.sql.orig
sed "s;$oldsrc;$newsrc;g" "$temp_root"/dump1.sql.orig >"$temp_root"/dump1.sql
fi
+
+ # After vacuum all relation, dump rellallvisible values of all relation in pre-upgarde
+ # cluster and save them to vm_test1.txt to visibility map rewriting regression test.
+ vm_sql="SELECT c.relname, c.relallvisible FROM pg_class as c, pg_namespace as n WHERE c.relnamespace = n.oid AND n.nspname NOT IN ('information_schema', 'pg_toast', 'pg_catalog') ORDER BY c.relname;"
+ vacuumdb -d regression || visibilitymap_vacuum1_status=$?
+ psql -d regression -c "$vm_sql" > "$temp_root"/vm_test1.txt || visibilitymap_test1_status=$?
else
make_installcheck_status=$?
fi
@@ -188,6 +194,14 @@ if [ -n "$pg_dumpall1_status" ]; then
echo "pg_dumpall of pre-upgrade database cluster failed"
exit 1
fi
+if [ -n "$visibilitymap_vacuum1_status" ];then
+ echo "VACUUM of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+if [ -n "$visibilitymap_test1_status" ];then
+ echo "SELECT of pre-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
PGDATA=$BASE_PGDATA
@@ -203,6 +217,10 @@ case $testhost in
esac
pg_dumpall -f "$temp_root"/dump2.sql || pg_dumpall2_status=$?
+# After analyze (do not vacuum) all relation, dump relallvisible values of all relation in
+# post-upgrade cluster to vm_test2.txt
+psql -d regression -c "$vm_sql" > "$temp_root"/vm_test2.txt || visibilitymap_test2_status=$?
+
pg_ctl -m fast stop
# no need to echo commands anymore
@@ -214,11 +232,28 @@ if [ -n "$pg_dumpall2_status" ]; then
exit 1
fi
+if [ -n "$visibilitymap_vacuum2_status" ];then
+ echo "VACUUM of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
+if [ -n "$visibilitymap_test2_status" ];then
+ echo "SELECT of post-upgrade database cluster for visibility map test failed"
+ exit 1
+fi
+
case $testhost in
MINGW*) cmd /c delete_old_cluster.bat ;;
*) sh ./delete_old_cluster.sh ;;
esac
+# Compare relallvisible values of all relation between pre-upgrade cluster and
+# post-upgrade cluster.
+if ! diff "$temp_root"/vm_test1.txt "$temp_root"/vm_test2.txt >/dev/null; then
+ echo "Visibility map rewriting test failed"
+ exit 1
+fi
+
if diff "$temp_root"/dump1.sql "$temp_root"/dump2.sql >/dev/null; then
echo PASSED
exit 0
On Wed, Mar 9, 2016 at 9:09 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
* 001 patch : Incorporated the documentation suggestions and updated
logic a little.
This 001 patch looks so little like what I was expecting that I
decided to start over from scratch. The new version I wrote is
attached here. I don't understand why your version tinkers with the
logic for setting the all-frozen bit; I thought that what I already
committed dealt with that already, and in any case, your version
doesn't even compile against latest sources. Your version also leaves
the scan_all terminology intact even though it's not accurate any
more, and I am not very convinced that the updates to the
page-skipping logic are actually correct. Please have a look over
this version and see what you think.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
optimize-scanall-vacuum.patchapplication/x-patch; name=optimize-scanall-vacuum.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a09ceb2..2f72633 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5984,12 +5984,15 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs an aggressive scan if the table's
<structname>pg_class</>.<structfield>relfrozenxid</> field has reached
- the age specified by this setting. The default is 150 million
- transactions. Although users can set this value anywhere from zero to
- two billions, <command>VACUUM</> will silently limit the effective value
- to 95% of <xref linkend="guc-autovacuum-freeze-max-age">, so that a
+ the age specified by this setting. An aggressive scan differs from
+ a regular <command>VACUUM</> in that it visits every page that might
+ contain unfrozen XIDs or MXIDs, not just those that might contain dead
+ tuples. The default is 150 million transactions. Although users can
+ set this value anywhere from zero to two billions, <command>VACUUM</>
+ will silently limit the effective value to 95% of
+ <xref linkend="guc-autovacuum-freeze-max-age">, so that a
periodical manual <command>VACUUM</> has a chance to run before an
anti-wraparound autovacuum is launched for the table. For more
information see
@@ -6028,9 +6031,12 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs an aggressive scan if the table's
<structname>pg_class</>.<structfield>relminmxid</> field has reached
- the age specified by this setting. The default is 150 million multixacts.
+ the age specified by this setting. An aggressive scan differs from
+ a regular <command>VACUUM</> in that it visits every page that might
+ contain unfrozen XIDs or MXIDs, not just those that might contain dead
+ tuples. The default is 150 million multixacts.
Although users can set this value anywhere from zero to two billions,
<command>VACUUM</> will silently limit the effective value to 95% of
<xref linkend="guc-autovacuum-multixact-freeze-max-age">, so that a
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 5204b34..d742ec9 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -438,22 +438,27 @@
</para>
<para>
- <command>VACUUM</> normally skips pages that don't have any dead row
- versions, but those pages might still have row versions with old XID
- values. To ensure all old row versions have been frozen, a
- scan of the whole table is needed.
- <xref linkend="guc-vacuum-freeze-table-age"> controls when
- <command>VACUUM</> does that: a whole table sweep is forced if
- the table hasn't been fully scanned for <varname>vacuum_freeze_table_age</>
- minus <varname>vacuum_freeze_min_age</> transactions. Setting it to 0
- forces <command>VACUUM</> to always scan all pages, effectively ignoring
- the visibility map.
+ <command>VACUUM</> uses the <link linkend="storage-vm">visibility map</>
+ to determine which pages of a relation must be scanned. Normally, it
+ will skips pages that don't have any dead row versions even if those pages
+ might still have row versions with old XID values. Therefore, normal
+ scans won't succeed in freezing every row version in the table.
+ Periodically, <command>VACUUM</> will perform an <firstterm>aggressive
+ vacuum</>, skipping only those pages which contain neither dead rows nor
+ any unfrozen XID or MXID values.
+ <xref linkend="guc-vacuum-freeze-table-age">
+ controls when <command>VACUUM</> does that: all-visible but not all-frozen
+ pages are scanned if the number of transactions that have passed since the
+ last such scan is greater than <varname>vacuum_freeze_table_age</> minus
+ <varname>vacuum_freeze_min_age</>. Setting
+ <varname>vacuum_freeze_table_age</> to 0 forces <command>VACUUM</> to
+ use this more aggressive strategy for all scans.
</para>
<para>
The maximum time that a table can go unvacuumed is two billion
transactions minus the <varname>vacuum_freeze_min_age</> value at
- the time <command>VACUUM</> last scanned the whole table. If it were to go
+ the time of the last aggressive vacuum. If it were to go
unvacuumed for longer than
that, data loss could result. To ensure that this does not happen,
autovacuum is invoked on any table that might contain unfrozen rows with
@@ -491,7 +496,7 @@
normal delete and update activity is run in that window. Setting it too
close could lead to anti-wraparound autovacuums, even though the table
was recently vacuumed to reclaim space, whereas lower values lead to more
- frequent whole-table scans.
+ frequent aggressive vaccuuming.
</para>
<para>
@@ -527,7 +532,7 @@
<structname>pg_database</>. In particular,
the <structfield>relfrozenxid</> column of a table's
<structname>pg_class</> row contains the freeze cutoff XID that was used
- by the last whole-table <command>VACUUM</> for that table. All rows
+ by the last aggressive <command>VACUUM</> for that table. All rows
inserted by transactions with XIDs older than this cutoff XID are
guaranteed to have been frozen. Similarly,
the <structfield>datfrozenxid</> column of a database's
@@ -554,18 +559,21 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
<para>
<command>VACUUM</> normally
only scans pages that have been modified since the last vacuum, but
- <structfield>relfrozenxid</> can only be advanced when the whole table is
- scanned. The whole table is scanned when <structfield>relfrozenxid</> is
- more than <varname>vacuum_freeze_table_age</> transactions old, when
- <command>VACUUM</>'s <literal>FREEZE</> option is used, or when all pages
- happen to
+ <structfield>relfrozenxid</> can only be advanced every page of the table
+ that might contain unfrozen XIDs is scanned. This happens when
+ <structfield>relfrozenxid</> is more than
+ <varname>vacuum_freeze_table_age</> transactions old, when
+ <command>VACUUM</>'s <literal>FREEZE</> option is used, or when all
+ pages that are not already all-frozen happen to
require vacuuming to remove dead row versions. When <command>VACUUM</>
- scans the whole table, after it's finished <literal>age(relfrozenxid)</>
- should be a little more than the <varname>vacuum_freeze_min_age</> setting
+ scans every page in the table that is not already all-frozen, it should
+ set <literal>age(relfrozenxid)</> to a value just a little more than the
+ <varname>vacuum_freeze_min_age</> setting
that was used (more by the number of transactions started since the
- <command>VACUUM</> started). If no whole-table-scanning <command>VACUUM</>
- is issued on the table until <varname>autovacuum_freeze_max_age</> is
- reached, an autovacuum will soon be forced for the table.
+ <command>VACUUM</> started). If no <structfield>relfrozenxid</>-advancing
+ <command>VACUUM</> is issued on the table until
+ <varname>autovacuum_freeze_max_age</> is reached, an autovacuum will soon
+ be forced for the table.
</para>
<para>
@@ -634,21 +642,23 @@ HINT: Stop the postmaster and vacuum that database in single-user mode.
</para>
<para>
- During a <command>VACUUM</> table scan, either partial or of the whole
- table, any multixact ID older than
+ Whenever <command>VACUUM</> scans any part of a table, it will replace
+ any multixact ID it encounters which is older than
<xref linkend="guc-vacuum-multixact-freeze-min-age">
- is replaced by a different value, which can be the zero value, a single
+ by a different value, which can be the zero value, a single
transaction ID, or a newer multixact ID. For each table,
<structname>pg_class</>.<structfield>relminmxid</> stores the oldest
possible multixact ID still appearing in any tuple of that table.
If this value is older than
- <xref linkend="guc-vacuum-multixact-freeze-table-age">, a whole-table
- scan is forced. <function>mxid_age()</> can be used on
+ <xref linkend="guc-vacuum-multixact-freeze-table-age">, an aggressive
+ vacuum is forced. As discussed in the previous section, an aggressive
+ vacuum means that only those pages which are known to be all-frozen will
+ be skipped. <function>mxid_age()</> can be used on
<structname>pg_class</>.<structfield>relminmxid</> to find its age.
</para>
<para>
- Whole-table <command>VACUUM</> scans, regardless of
+ Aggressive <command>VACUUM</> scans, regardless of
what causes them, enable advancing the value for that table.
Eventually, as all tables in all databases are scanned and their
oldest multixact values are advanced, on-disk storage for older
@@ -656,13 +666,13 @@ HINT: Stop the postmaster and vacuum that database in single-user mode.
</para>
<para>
- As a safety device, a whole-table vacuum scan will occur for any table
+ As a safety device, an aggressive vacuum scan will occur for any table
whose multixact-age is greater than
- <xref linkend="guc-autovacuum-multixact-freeze-max-age">. Whole-table
+ <xref linkend="guc-autovacuum-multixact-freeze-max-age">. Aggressive
vacuum scans will also occur progressively for all tables, starting with
those that have the oldest multixact-age, if the amount of used member
storage space exceeds the amount 50% of the addressable storage space.
- Both of these kinds of whole-table scans will occur even if autovacuum is
+ Both of these kinds of aggressive scans will occur even if autovacuum is
nominally disabled.
</para>
</sect3>
@@ -743,9 +753,9 @@ vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuple
<command>UPDATE</command> and <command>DELETE</command> operation. (It
is only semi-accurate because some information might be lost under heavy
load.) If the <structfield>relfrozenxid</> value of the table is more
- than <varname>vacuum_freeze_table_age</> transactions old, the whole
- table is scanned to freeze old tuples and advance
- <structfield>relfrozenxid</>, otherwise only pages that have been modified
+ than <varname>vacuum_freeze_table_age</> transactions old, an aggressive
+ vacuum is performed to freeze old tuples and advance
+ <structfield>relfrozenxid</>; otherwise, only pages that have been modified
since the last vacuum are scanned.
</para>
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 61d2edd..a8fd4ac 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -106,6 +106,7 @@ typedef struct LVRelStats
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* number of pages we examined */
BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber frozenskipped_pages; /* # of frozen pages we skipped */
double scanned_tuples; /* counts only tuples on scanned pages */
double old_rel_tuples; /* previous value of pg_class.reltuples */
double new_rel_tuples; /* new estimated total # of tuples */
@@ -136,7 +137,7 @@ static BufferAccessStrategy vac_strategy;
/* non-export function prototypes */
static void lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
- Relation *Irel, int nindexes, bool scan_all);
+ Relation *Irel, int nindexes, bool aggressive);
static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
static bool lazy_check_needs_freeze(Buffer buf, bool *hastup);
static void lazy_vacuum_index(Relation indrel,
@@ -182,8 +183,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
int usecs;
double read_rate,
write_rate;
- bool scan_all; /* should we scan all pages? */
- bool scanned_all; /* did we actually scan all pages? */
+ bool aggressive; /* should we scan all unfrozen pages? */
+ bool scanned_all_unfrozen; /* actually scanned all such pages? */
TransactionId xidFullScanLimit;
MultiXactId mxactFullScanLimit;
BlockNumber new_rel_pages;
@@ -221,15 +222,15 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
&MultiXactCutoff, &mxactFullScanLimit);
/*
- * We request a full scan if either the table's frozen Xid is now older
- * than or equal to the requested Xid full-table scan limit; or if the
- * table's minimum MultiXactId is older than or equal to the requested
+ * We request an aggressive scan if either the table's frozen Xid is now
+ * older than or equal to the requested Xid full-table scan limit; or if
+ * the table's minimum MultiXactId is older than or equal to the requested
* mxid full-table scan limit.
*/
- scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
- xidFullScanLimit);
- scan_all |= MultiXactIdPrecedesOrEquals(onerel->rd_rel->relminmxid,
- mxactFullScanLimit);
+ aggressive = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
+ xidFullScanLimit);
+ aggressive |= MultiXactIdPrecedesOrEquals(onerel->rd_rel->relminmxid,
+ mxactFullScanLimit);
vacrelstats = (LVRelStats *) palloc0(sizeof(LVRelStats));
@@ -244,25 +245,30 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
vacrelstats->hasindex = (nindexes > 0);
/* Do the vacuuming */
- lazy_scan_heap(onerel, vacrelstats, Irel, nindexes, scan_all);
+ lazy_scan_heap(onerel, vacrelstats, Irel, nindexes, aggressive);
/* Done with indexes */
vac_close_indexes(nindexes, Irel, NoLock);
/*
- * Compute whether we actually scanned the whole relation. If we did, we
- * can adjust relfrozenxid and relminmxid.
+ * Compute whether we actually scanned every unfrozen page in the
+ * relation. If we did, we can adjust relfrozenxid and relminmxid.
*
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
- if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if (vacrelstats->scanned_pages + vacrelstats->frozenskipped_pages
+ < vacrelstats->rel_pages)
{
- Assert(!scan_all);
- scanned_all = false;
+ if (aggressive)
+ elog(FATAL, "scanned %u frozenskipped %u total %u",
+ vacrelstats->scanned_pages, vacrelstats->frozenskipped_pages,
+ vacrelstats->rel_pages);
+ Assert(!aggressive);
+ scanned_all_unfrozen = false;
}
else
- scanned_all = true;
+ scanned_all_unfrozen = true;
/*
* Optionally truncate the relation.
@@ -302,8 +308,9 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
if (new_rel_allvisible > new_rel_pages)
new_rel_allvisible = new_rel_pages;
- new_frozen_xid = scanned_all ? FreezeLimit : InvalidTransactionId;
- new_min_multi = scanned_all ? MultiXactCutoff : InvalidMultiXactId;
+ new_frozen_xid = scanned_all_unfrozen ? FreezeLimit : InvalidTransactionId;
+ new_min_multi = scanned_all_unfrozen ? MultiXactCutoff
+ : InvalidMultiXactId;
vac_update_relstats(onerel,
new_rel_pages,
@@ -434,7 +441,7 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
*/
static void
lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
- Relation *Irel, int nindexes, bool scan_all)
+ Relation *Irel, int nindexes, bool aggressive)
{
BlockNumber nblocks,
blkno;
@@ -450,8 +457,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
int i;
PGRUsage ru0;
Buffer vmbuffer = InvalidBuffer;
- BlockNumber next_not_all_visible_block;
- bool skipping_all_visible_blocks;
+ BlockNumber next_unskippable_block;
+ bool skipping_blocks;
xl_heap_freeze_tuple *frozen;
StringInfoData buf;
@@ -479,35 +486,39 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
frozen = palloc(sizeof(xl_heap_freeze_tuple) * MaxHeapTuplesPerPage);
/*
- * We want to skip pages that don't require vacuuming according to the
- * visibility map, but only when we can skip at least SKIP_PAGES_THRESHOLD
- * consecutive pages. Since we're reading sequentially, the OS should be
- * doing readahead for us, so there's no gain in skipping a page now and
- * then; that's likely to disable readahead and so be counterproductive.
- * Also, skipping even a single page means that we can't update
- * relfrozenxid, so we only want to do it if we can skip a goodly number
- * of pages.
+ * Except when aggressive is set, we want to skip pages that are
+ * all-visible according to the visibility map, but only when we can skip
+ * at least SKIP_PAGES_THRESHOLD consecutive pages. Since we're reading
+ * sequentially, the OS should be doing readahead for us, so there's no
+ * gain in skipping a page now and then; that's likely to disable
+ * readahead and so be counterproductive. Also, skipping even a single
+ * page means that we can't update relfrozenxid, so we only want to do it
+ * if we can skip a goodly number of pages.
*
- * Before entering the main loop, establish the invariant that
- * next_not_all_visible_block is the next block number >= blkno that's not
- * all-visible according to the visibility map, or nblocks if there's no
- * such block. Also, we set up the skipping_all_visible_blocks flag,
- * which is needed because we need hysteresis in the decision: once we've
- * started skipping blocks, we may as well skip everything up to the next
- * not-all-visible block.
+ * When aggressive is set, we can't skip pages just because they are
+ * all-visible, but we can still skip pages that are all-frozen, since
+ * such pages do not need freezing and do not affect the value that we can
+ * safely set for relfrozenxid or relminmxid.
*
- * Note: if scan_all is true, we won't actually skip any pages; but we
- * maintain next_not_all_visible_block anyway, so as to set up the
- * all_visible_according_to_vm flag correctly for each page.
+ * Before entering the main loop, establish the invariant that
+ * next_unskippable_block is the next block number >= blkno that's not we
+ * can't skip based on the visibility map, either all-visible for a
+ * regular scan or all-frozen for an aggressive scan. We set it to
+ * nblocks if there's no such block. We also set up the skipping_blocks
+ * flag correctly at this stage.
*
* Note: The value returned by visibilitymap_get_status could be slightly
* out-of-date, since we make this test before reading the corresponding
* heap page or locking the buffer. This is OK. If we mistakenly think
- * that the page is all-visible when in fact the flag's just been cleared,
- * we might fail to vacuum the page. But it's OK to skip pages when
- * scan_all is not set, so no great harm done; the next vacuum will find
- * them. If we make the reverse mistake and vacuum a page unnecessarily,
- * it'll just be a no-op.
+ * that the page is all-visible or all-frozen when in fact the flag's just
+ * been cleared, we might fail to vacuum the page. It's easy to see that
+ * skipping a page when aggressive is not set is not a very big deal; we
+ * might leave some dead tuples lying around, but the next vacuum will
+ * find them. But even when aggressive *is* set, it's still OK if we miss
+ * a page whose all-frozen marking has just been cleared. Any new XIDs
+ * just added to that page are necessarily newer than the GlobalXmin we
+ * computed, so they'll have no effect on the value to which we can safely
+ * set relfrozenxid. A similar argument applies for MXIDs and relminmxid.
*
* We will scan the table's last page, at least to the extent of
* determining whether it has tuples or not, even if it should be skipped
@@ -518,18 +529,31 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* the last page. This is worth avoiding mainly because such a lock must
* be replayed on any hot standby, where it can be disruptive.
*/
- for (next_not_all_visible_block = 0;
- next_not_all_visible_block < nblocks;
- next_not_all_visible_block++)
+ for (next_unskippable_block = 0;
+ next_unskippable_block < nblocks;
+ next_unskippable_block++)
{
- if (!VM_ALL_VISIBLE(onerel, next_not_all_visible_block, &vmbuffer))
- break;
+ uint8 vmstatus;
+
+ vmstatus = visibilitymap_get_status(onerel, next_unskippable_block,
+ &vmbuffer);
+ if (aggressive)
+ {
+ if ((vmstatus & VISIBILITYMAP_ALL_FROZEN) == 0)
+ break;
+ }
+ else
+ {
+ if ((vmstatus & VISIBILITYMAP_ALL_VISIBLE) == 0)
+ break;
+ }
vacuum_delay_point();
}
- if (next_not_all_visible_block >= SKIP_PAGES_THRESHOLD)
- skipping_all_visible_blocks = true;
+
+ if (next_unskippable_block >= SKIP_PAGES_THRESHOLD)
+ skipping_blocks = true;
else
- skipping_all_visible_blocks = false;
+ skipping_blocks = false;
for (blkno = 0; blkno < nblocks; blkno++)
{
@@ -552,15 +576,28 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
#define FORCE_CHECK_PAGE() \
(blkno == nblocks - 1 && should_attempt_truncation(vacrelstats))
- if (blkno == next_not_all_visible_block)
+ if (blkno == next_unskippable_block)
{
- /* Time to advance next_not_all_visible_block */
- for (next_not_all_visible_block++;
- next_not_all_visible_block < nblocks;
- next_not_all_visible_block++)
+ /* Time to advance next_unskippable_block */
+ for (next_unskippable_block++;
+ next_unskippable_block < nblocks;
+ next_unskippable_block++)
{
- if (!VM_ALL_VISIBLE(onerel, next_not_all_visible_block, &vmbuffer))
- break;
+ uint8 vmskipflags;
+
+ vmskipflags = visibilitymap_get_status(onerel,
+ next_unskippable_block,
+ &vmbuffer);
+ if (aggressive)
+ {
+ if ((vmskipflags & VISIBILITYMAP_ALL_FROZEN) == 0)
+ break;
+ }
+ else
+ {
+ if ((vmskipflags & VISIBILITYMAP_ALL_VISIBLE) == 0)
+ break;
+ }
vacuum_delay_point();
}
@@ -569,17 +606,45 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* skipping_all_visible_blocks to do the right thing at the
* following blocks.
*/
- if (next_not_all_visible_block - blkno > SKIP_PAGES_THRESHOLD)
- skipping_all_visible_blocks = true;
+ if (next_unskippable_block - blkno > SKIP_PAGES_THRESHOLD)
+ skipping_blocks = true;
else
- skipping_all_visible_blocks = false;
+ skipping_blocks = false;
+
+ /*
+ * Normally, the fact that we can't skip this block must mean that
+ * it's not all-visible. But in an aggressive vacuum we know only
+ * that it's not all-frozen, so it might still be all-visible.
+ */
all_visible_according_to_vm = false;
+ if (aggressive && VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
+ all_visible_according_to_vm = true;
}
else
{
- /* Current block is all-visible */
- if (skipping_all_visible_blocks && !scan_all && !FORCE_CHECK_PAGE())
+ /*
+ * The current block is potentially skippable; if we've seen a
+ * long enough run of skippable blocks to justify skipping it,
+ * and we're not forced to check it, then go ahead and skip.
+ * Otherwise, the page must be at least all-visible if not
+ * all-frozen, so we can set all_visible_according_to_vm = true.
+ */
+ if (skipping_blocks && !FORCE_CHECK_PAGE())
+ {
+ /*
+ * Tricky, tricky. If this is in aggressive vacuum, the page
+ * mut have been all-frozen at the time we checked whether it
+ * was skippable, but it might not be any more. We must be
+ * careful to count it as a skipped all-frozen page in that
+ * case, or else we'll think we can't update relfrozenxid and
+ * relminmxid. If it's not an aggressive vacuum, we don't know
+ * whether it was all-frozen, so we have to recheck; but in
+ * this case an approximate answer is OK.
+ */
+ if (aggressive || VM_ALL_FROZEN(onerel, blkno, &vmbuffer))
+ vacrelstats->frozenskipped_pages++;
continue;
+ }
all_visible_according_to_vm = true;
}
@@ -628,9 +693,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* Pin the visibility map page in case we need to mark the page
* all-visible. In most cases this will be very cheap, because we'll
* already have the correct page pinned anyway. However, it's
- * possible that (a) next_not_all_visible_block is covered by a
- * different VM page than the current block or (b) we released our pin
- * and did a cycle of index vacuuming.
+ * possible that (a) next_unskippable_block is covered by a different
+ * VM page than the current block or (b) we released our pin and did a
+ * cycle of index vacuuming.
*/
visibilitymap_pin(onerel, blkno, &vmbuffer);
@@ -641,12 +706,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
if (!ConditionalLockBufferForCleanup(buf))
{
/*
- * If we're not scanning the whole relation to guard against XID
+ * If we're not performing an aggressive scan to guard against XID
* wraparound, and we don't want to forcibly check the page, then
* it's OK to skip vacuuming pages we get a lock conflict on. They
* will be dealt with in some future vacuum.
*/
- if (!scan_all && !FORCE_CHECK_PAGE())
+ if (!aggressive && !FORCE_CHECK_PAGE())
{
ReleaseBuffer(buf);
vacrelstats->pinskipped_pages++;
@@ -663,7 +728,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* ourselves for multiple buffers and then service whichever one
* is received first. For now, this seems good enough.
*
- * If we get here with scan_all false, then we're just forcibly
+ * If we get here with aggressive false, then we're just forcibly
* checking the page, and so we don't want to insist on getting
* the lock; we only need to know if the page contains tuples, so
* that we can update nonempty_pages correctly. It's convenient
@@ -679,7 +744,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
vacrelstats->nonempty_pages = blkno + 1;
continue;
}
- if (!scan_all)
+ if (!aggressive)
{
/*
* Here, we must not advance scanned_pages; that would amount
On Thu, Mar 10, 2016 at 3:27 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Thank you for reviewing!
Attached updated patch.On Thu, Mar 10, 2016 at 3:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Mar 9, 2016 at 9:09 AM, Masahiko Sawada
<sawada.mshk@gmail.com> wrote: Attached latest 2 patches.* 000 patch : Incorporated the review comments and made rewriting
logic more clearly.That's better, thanks. But your comments don't survive pgindent.
After running pgindent, I get this:+ /* + * These old_* variables point to old visibility map page. + * + * cur_old : Points to current position on old page. blkend_old : + * Points to end of old block. break_old : Points to old page break + * position for rewriting a new page. After wrote a new page, old_end + * proceeds rewriteVmBytesPerPgae bytes. + */You need to either surround this sort of thing with dashes to make
pgindent ignore it, or, probably better, rewrite it using complete
sentences that together form a paragraph.Fixed.
+ Oid pg_database_oid; /* OID of
pg_database relation */Not used anywhere?
Fixed.
Instead of vm_need_rewrite, how about vm_must_add_frozenbit?
Fixed.
Can you explain the changes to test.sh?
Current regression test scenario is,
1. Do 'make check' on pre-upgrade cluster
2. Dump relallvisible values of all relation in pre-upgrade cluster to
vm_test1.txt
3. Do pg_upgrade
4. Do analyze (not vacuum), dump relallvisibile values of all relation
in post-upgrade cluster to vm_test2.txt
5. Compare between vm_test1.txt and vm_test2.txtThat is, regression test compares between relallvisible values in
pre-upgrade cluster and post-upgrade cluster.
But because test.sh always uses pre/post clusters with same catalog
version, I realized that we cannot ensure that visibility map
rewriting is processed successfully on test.sh framework.
Rewriting visibility map never be executed.
We might need to have another framework for rewriting visibility map page..
After some further thought, I thought that it's better to add check
logic for result of rewriting visibility map to upgrading logic rather
than regression test in order to ensure that rewriting visibility map
has been successfully done.
As a draft, attached patch checks the result of rewriting visibility
map after rewrote for each relation as a routine of pg_upgrade.
The disadvantage point of this is that we need to scan each visibility
map page for 2 times.
But since visibility map size would not be so large, it would not bad.
Thoughts?
Regards,
--
Regards,
--
Masahiko Sawada
Attachments:
000_pgupgrade_rewrite_vm_v40.patchapplication/x-patch; name=000_pgupgrade_rewrite_vm_v40.patchDownload
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 2a99a28..6fd1460 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -9,10 +9,15 @@
#include "postgres_fe.h"
+#include "access/visibilitymap.h"
#include "pg_upgrade.h"
+#include "storage/bufpage.h"
+#include "storage/checksum.h"
+#include "storage/checksum_impl.h"
#include <fcntl.h>
+#define BITS_PER_HEAPBLOCK_OLD 1
#ifndef WIN32
@@ -21,6 +26,7 @@ static int copy_file(const char *fromfile, const char *tofile, bool force);
static int win32_pghardlink(const char *src, const char *dst);
#endif
+static bool checkRewriteVisibilityMap(const char *oldfile, const char *newfile);
/*
* copyFile()
@@ -138,6 +144,235 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
#endif
+/*
+ * rewriteVisibilityMap()
+ *
+ * In versions of PostgreSQL prior to catversion 201603011, PostgreSQL's
+ * visibility map included one bit per heap page; it now includes two.
+ * When upgrading a cluster from before that time to a current PostgreSQL
+ * version, we could refuse to copy visibility maps from the old cluster
+ * to the new cluster; the next VACUUM would recreate them, but at the
+ * price of scanning the entire table. So, instead, we rewrite the old
+ * visibility maps in the new format. That way, the all-visible bit
+ * remains set for the pages for which it was set previously. The
+ * all-frozen bit is never set by this conversion; we leave that to
+ * VACUUM.
+ */
+const char *
+rewriteVisibilityMap(const char *fromfile, const char *tofile, bool force)
+{
+ int src_fd = 0;
+ int dst_fd = 0;
+ char buffer[BLCKSZ];
+ ssize_t bytesRead;
+ int rewriteVmBytesPerPage;
+ BlockNumber blkno = 0;
+
+ /* Compute we need how many old page bytes to rewrite a new page */
+ rewriteVmBytesPerPage = (BLCKSZ - SizeOfPageHeaderData) / 2;
+
+ if ((fromfile == NULL) || (tofile == NULL))
+ return "Invalid old file or new file";
+
+ if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
+ return getErrorText();
+
+ if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+ {
+ close(src_fd);
+ return getErrorText();
+ }
+
+ /*
+ * Turn each visibility map page into 2 pages one by one.
+ * Rewritten 2 pages have same page header as old page had.
+ */
+ while ((bytesRead = read(src_fd, buffer, BLCKSZ)) == BLCKSZ)
+ {
+ char *old_cur, *old_break, *old_blkend;
+ PageHeaderData pageheader;
+
+ /* Save the page header data */
+ memcpy(&pageheader, buffer, SizeOfPageHeaderData);
+
+ /*
+ * These old_* variables point to old visibility map page.
+ * old_cur points to current potision on old page. old_blkend
+ * points to end of old block. old_break points to old page
+ * break position for rewritin a new page. After wrote a new
+ * page, old_break proceeds rewriteVmBytesPerPage bytes.
+ */
+ old_cur = buffer + SizeOfPageHeaderData;
+ old_blkend = buffer + bytesRead;
+ old_break = old_cur + rewriteVmBytesPerPage;
+
+ while (old_blkend >= old_break)
+ {
+ char vmbuf[BLCKSZ];
+ char *new_cur = vmbuf;
+
+ /* Copy page header in advance */
+ memcpy(vmbuf, &pageheader, SizeOfPageHeaderData);
+
+ new_cur += SizeOfPageHeaderData;
+
+ /*
+ * Process old page bytes one by one, and turn it
+ * into new page.
+ */
+ while (old_break > old_cur)
+ {
+ uint16 new_vmbits = 0;
+ int i;
+
+ /* Generate new format bits while keeping old information */
+ for (i = 0; i < BITS_PER_BYTE; i++)
+ {
+ if ((((uint8) *old_cur) & (1 << (BITS_PER_HEAPBLOCK_OLD * i))))
+ new_vmbits |= 1 << (BITS_PER_HEAPBLOCK * i);
+ }
+
+ /* Copy new visibility map bit to new format page */
+ memcpy(new_cur, &new_vmbits, BITS_PER_HEAPBLOCK);
+
+ old_cur += BITS_PER_HEAPBLOCK_OLD;
+ new_cur += BITS_PER_HEAPBLOCK;
+ }
+
+ /* Set new checksum for a visibility map page, If enabled */
+ if (old_cluster.controldata.data_checksum_version != 0 &&
+ new_cluster.controldata.data_checksum_version != 0)
+ ((PageHeader) vmbuf)->pd_checksum = pg_checksum_page(vmbuf, blkno);
+
+ if (write(dst_fd, vmbuf, BLCKSZ) != BLCKSZ)
+ {
+ close(dst_fd);
+ close(src_fd);
+ return getErrorText();
+ }
+
+ old_break += rewriteVmBytesPerPage;
+ blkno++;
+ }
+ }
+
+ /* Close files */
+ close(dst_fd);
+ close(src_fd);
+
+ /*
+ * After rewrote visibility map, we must check the visibility map bits of both files.
+ */
+ if(!checkRewriteVisibilityMap(fromfile, tofile))
+ return "failed to rewrite visibility map";
+
+ return NULL;
+
+}
+
+/*
+ * checkRewriteVisibilityMap()
+ *
+ * To ensure that rewriting visiblity map has been done successfully,
+ * this function compares the visibility map bits between oldfile and newfile.
+ */
+bool
+checkRewriteVisibilityMap(const char *oldfile, const char *newfile)
+{
+ int old_fd = 0;
+ int new_fd = 0;
+ char old_buffer[BLCKSZ];
+ char new_buffer[BLCKSZ];
+ bool ret = true;
+
+ if ((old_fd = open(oldfile, O_RDONLY, 0)) < 0)
+ return false;
+
+ if ((new_fd = open(newfile, O_RDONLY, 0)) < 0)
+ {
+ close(old_fd);
+ return false;
+ }
+
+ /*
+ * Since new visibility map format size is greater than old format,
+ * we read old format page at first.
+ */
+ while ((read(old_fd, old_buffer, BLCKSZ)) == BLCKSZ)
+ {
+ int i;
+ char *old_cur;
+
+ /* Skip page header area */
+ old_cur = old_buffer + SizeOfPageHeaderData;
+
+ /*
+ * New format visibility map size is larger than old format
+ * as (BITS_PER_HEAPBLOCK / BITS_PER_HEAPBLOCK_OLD) times.
+ */
+ for (i = 0; i < (BITS_PER_HEAPBLOCK / BITS_PER_HEAPBLOCK_OLD); i++)
+ {
+ int j;
+ char *new_cur;
+
+ if ((read(new_fd, new_buffer, BLCKSZ)) != BLCKSZ)
+ {
+ ret = false;
+ goto err;
+ }
+
+ /* SKip page header area */
+ new_cur = new_buffer + SizeOfPageHeaderData;
+
+ while ((new_buffer + BLCKSZ) > new_cur)
+ {
+ uint8 old_visiblebits = 0;
+ uint8 new_visiblebits = 0;
+ uint16 new_vmbits = *(uint16 *) new_cur;
+
+ /*
+ * Transform new format bits to old format bits while
+ * checking all-frozen bit is not set.
+ */
+ for (j = 0; j < (BITS_PER_BYTE * BITS_PER_HEAPBLOCK); j++)
+ {
+ /* all-frozen bit (second bit) must not be set on new format */
+ if (new_vmbits & (2 << (BITS_PER_HEAPBLOCK * j)))
+ {
+ ret = false;
+ goto err;
+ }
+
+ if (new_vmbits & (1 << (BITS_PER_HEAPBLOCK * j)))
+ new_visiblebits |= 1 << (BITS_PER_HEAPBLOCK_OLD * j);
+ }
+
+ old_visiblebits = (uint8) *old_cur;
+
+ /*
+ * Compare visible bits between from old page and from new page.
+ * These all-visible bits (first bit) must be same.
+ */
+ if (old_visiblebits != new_visiblebits)
+ {
+ ret = false;
+ goto err;
+ }
+
+ old_cur += BITS_PER_HEAPBLOCK_OLD;
+ new_cur += BITS_PER_HEAPBLOCK;
+ }
+ }
+ }
+
+err:
+ close(new_fd);
+ close(old_fd);
+
+ return ret;
+
+}
+
void
check_hard_link(void)
{
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 6122878..89beb73 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -110,6 +110,10 @@ extern char *output_files[];
#define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031
/*
+ * The format of visibility map is changed with this 9.6 commit,
+ */
+#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201603011
+/*
* pg_multixact format changed in 9.3 commit 0ac5ad5134f2769ccbaefec73844f85,
* ("Improve concurrency of foreign key locking") which also updated catalog
* version to this value. pg_upgrade behavior depends on whether old and new
@@ -365,6 +369,8 @@ bool pid_lock_file_exists(const char *datadir);
const char *copyFile(const char *src, const char *dst, bool force);
const char *linkFile(const char *src, const char *dst);
+const char *rewriteVisibilityMap(const char *fromfile, const char *tofile,
+ bool force);
void check_hard_link(void);
FILE *fopen_priv(const char *path, const char *mode);
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index b20f073..9daef0b 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -16,7 +16,7 @@
static void transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace);
-static void transfer_relfile(FileNameMap *map, const char *suffix);
+static void transfer_relfile(FileNameMap *map, const char *suffix, bool vm_must_add_frozenbit);
/*
@@ -132,6 +132,7 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
{
int mapnum;
bool vm_crashsafe_match = true;
+ bool vm_must_add_frozenbit = false;
/*
* Do the old and new cluster disagree on the crash-safetiness of the vm
@@ -141,13 +142,20 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
vm_crashsafe_match = false;
+ /*
+ * Do we need to rewrite visibilitymap?
+ */
+ if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ vm_must_add_frozenbit = true;
+
for (mapnum = 0; mapnum < size; mapnum++)
{
if (old_tablespace == NULL ||
strcmp(maps[mapnum].old_tablespace, old_tablespace) == 0)
{
/* transfer primary file */
- transfer_relfile(&maps[mapnum], "");
+ transfer_relfile(&maps[mapnum], "", vm_must_add_frozenbit);
/* fsm/vm files added in PG 8.4 */
if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)
@@ -155,9 +163,9 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
/*
* Copy/link any fsm and vm files, if they exist
*/
- transfer_relfile(&maps[mapnum], "_fsm");
+ transfer_relfile(&maps[mapnum], "_fsm", vm_must_add_frozenbit);
if (vm_crashsafe_match)
- transfer_relfile(&maps[mapnum], "_vm");
+ transfer_relfile(&maps[mapnum], "_vm", vm_must_add_frozenbit);
}
}
}
@@ -168,9 +176,11 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
* transfer_relfile()
*
* Copy or link file from old cluster to new one.
+ * if vm_must_add_frozenbti is true, each visibility map pages are written while
+ * adding frozen bit, even link mode.
*/
static void
-transfer_relfile(FileNameMap *map, const char *type_suffix)
+transfer_relfile(FileNameMap *map, const char *type_suffix, bool vm_must_add_frozenbit)
{
const char *msg;
char old_file[MAXPGPATH];
@@ -232,7 +242,13 @@ transfer_relfile(FileNameMap *map, const char *type_suffix)
{
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = copyFile(old_file, new_file, true)) != NULL)
+ /* Rewrite visibility map if needed */
+ if (vm_must_add_frozenbit && (strcmp(type_suffix, "_vm") == 0))
+ msg = rewriteVisibilityMap(old_file, new_file, true);
+ else
+ msg = copyFile(old_file, new_file, true);
+
+ if (msg)
pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
@@ -240,7 +256,13 @@ transfer_relfile(FileNameMap *map, const char *type_suffix)
{
pg_log(PG_VERBOSE, "linking \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = linkFile(old_file, new_file)) != NULL)
+ /* Rewrite visibility map if needed */
+ if (vm_must_add_frozenbit && (strcmp(type_suffix, "_vm") == 0))
+ msg = rewriteVisibilityMap(old_file, new_file, true);
+ else
+ msg = linkFile(old_file, new_file);
+
+ if (msg)
pg_fatal("error while creating link for relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
On Thu, Mar 10, 2016 at 8:51 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
After some further thought, I thought that it's better to add check
logic for result of rewriting visibility map to upgrading logic rather
than regression test in order to ensure that rewriting visibility map
has been successfully done.
As a draft, attached patch checks the result of rewriting visibility
map after rewrote for each relation as a routine of pg_upgrade.
The disadvantage point of this is that we need to scan each visibility
map page for 2 times.
But since visibility map size would not be so large, it would not bad.
Thoughts?
I think that's kind of pointless. We need to test that this
conversion code works, but once it does, I don't think we should make
everybody pay the overhead of retesting that. Anyway, the test code
could have bugs, too.
Here's an updated version of your patch with that code removed and
some cosmetic cleanups like fixing typos and stuff like that. I think
this is mostly ready to commit, but I noticed one problem: your
conversion code always produces two output pages for each input page
even if one of them would be empty. In particular, if you have a
large number of small relations and run pg_upgrade, all of their
visibility maps will go from 8kB to 16kB. That isn't the end of the
world, maybe, but I think you should see if you can't fix it
somehow....
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
pgupgrade-rewrite-v41.patchapplication/x-patch; name=pgupgrade-rewrite-v41.patchDownload
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 2a99a28..34e1451 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -9,10 +9,15 @@
#include "postgres_fe.h"
+#include "access/visibilitymap.h"
#include "pg_upgrade.h"
+#include "storage/bufpage.h"
+#include "storage/checksum.h"
+#include "storage/checksum_impl.h"
#include <fcntl.h>
+#define BITS_PER_HEAPBLOCK_OLD 1
#ifndef WIN32
@@ -138,6 +143,130 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
#endif
+/*
+ * rewriteVisibilityMap()
+ *
+ * In versions of PostgreSQL prior to catversion 201603011, PostgreSQL's
+ * visibility map included one bit per heap page; it now includes two.
+ * When upgrading a cluster from before that time to a current PostgreSQL
+ * version, we could refuse to copy visibility maps from the old cluster
+ * to the new cluster; the next VACUUM would recreate them, but at the
+ * price of scanning the entire table. So, instead, we rewrite the old
+ * visibility maps in the new format. That way, the all-visible bit
+ * remains set for the pages for which it was set previously. The
+ * all-frozen bit is never set by this conversion; we leave that to
+ * VACUUM.
+ */
+const char *
+rewriteVisibilityMap(const char *fromfile, const char *tofile, bool force)
+{
+ int src_fd = 0;
+ int dst_fd = 0;
+ char buffer[BLCKSZ];
+ ssize_t bytesRead;
+ int rewriteVmBytesPerPage;
+ BlockNumber blkno = 0;
+
+ /* Compute we need how many old page bytes to rewrite a new page */
+ rewriteVmBytesPerPage = (BLCKSZ - SizeOfPageHeaderData) / 2;
+
+ if ((fromfile == NULL) || (tofile == NULL))
+ return "Invalid old file or new file";
+
+ if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
+ return getErrorText();
+
+ if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+ {
+ close(src_fd);
+ return getErrorText();
+ }
+
+ /*
+ * Turn each visibility map page into 2 pages one by one. Each new page
+ * has the same page header as the old one.
+ */
+ while ((bytesRead = read(src_fd, buffer, BLCKSZ)) == BLCKSZ)
+ {
+ char *old_cur,
+ *old_break,
+ *old_blkend;
+ PageHeaderData pageheader;
+
+ /* Save the page header data */
+ memcpy(&pageheader, buffer, SizeOfPageHeaderData);
+
+ /*
+ * These old_* variables point to old visibility map page. old_cur
+ * points to current position on old page. old_blkend points to end of
+ * old block. old_break points to old page break position for rewriting
+ * a new page. After wrote a new page, old_break proceeds
+ * rewriteVmBytesPerPage bytes.
+ */
+ old_cur = buffer + SizeOfPageHeaderData;
+ old_blkend = buffer + bytesRead;
+ old_break = old_cur + rewriteVmBytesPerPage;
+
+ while (old_blkend >= old_break)
+ {
+ char vmbuf[BLCKSZ];
+ char *new_cur = vmbuf;
+
+ /* Copy page header in advance */
+ memcpy(vmbuf, &pageheader, SizeOfPageHeaderData);
+
+ new_cur += SizeOfPageHeaderData;
+
+ /*
+ * Process old page bytes one by one, and turn it into new page.
+ */
+ while (old_break > old_cur)
+ {
+ uint16 new_vmbits = 0;
+ int i;
+
+ /* Generate new format bits while keeping old information */
+ for (i = 0; i < BITS_PER_BYTE; i++)
+ {
+ uint8 byte = * (uint8 *) old_cur;
+
+ if (((byte & (1 << (BITS_PER_HEAPBLOCK_OLD * i)))) != 0)
+ new_vmbits |= 1 << (BITS_PER_HEAPBLOCK * i);
+ }
+
+ /* Copy new visibility map bit to new format page */
+ memcpy(new_cur, &new_vmbits, BITS_PER_HEAPBLOCK);
+
+ old_cur += BITS_PER_HEAPBLOCK_OLD;
+ new_cur += BITS_PER_HEAPBLOCK;
+ }
+
+ /* Set new checksum for a visibility map page (if enabled) */
+ if (old_cluster.controldata.data_checksum_version != 0 &&
+ new_cluster.controldata.data_checksum_version != 0)
+ ((PageHeader) vmbuf)->pd_checksum =
+ pg_checksum_page(vmbuf, blkno);
+
+ if (write(dst_fd, vmbuf, BLCKSZ) != BLCKSZ)
+ {
+ close(dst_fd);
+ close(src_fd);
+ return getErrorText();
+ }
+
+ old_break += rewriteVmBytesPerPage;
+ blkno++;
+ }
+ }
+
+ /* Close files */
+ close(dst_fd);
+ close(src_fd);
+
+ return NULL;
+
+}
+
void
check_hard_link(void)
{
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 6122878..89beb73 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -110,6 +110,10 @@ extern char *output_files[];
#define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031
/*
+ * The format of visibility map is changed with this 9.6 commit,
+ */
+#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201603011
+/*
* pg_multixact format changed in 9.3 commit 0ac5ad5134f2769ccbaefec73844f85,
* ("Improve concurrency of foreign key locking") which also updated catalog
* version to this value. pg_upgrade behavior depends on whether old and new
@@ -365,6 +369,8 @@ bool pid_lock_file_exists(const char *datadir);
const char *copyFile(const char *src, const char *dst, bool force);
const char *linkFile(const char *src, const char *dst);
+const char *rewriteVisibilityMap(const char *fromfile, const char *tofile,
+ bool force);
void check_hard_link(void);
FILE *fopen_priv(const char *path, const char *mode);
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index b20f073..103651a 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -16,7 +16,7 @@
static void transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace);
-static void transfer_relfile(FileNameMap *map, const char *suffix);
+static void transfer_relfile(FileNameMap *map, const char *suffix, bool vm_must_add_frozenbit);
/*
@@ -132,6 +132,7 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
{
int mapnum;
bool vm_crashsafe_match = true;
+ bool vm_must_add_frozenbit = false;
/*
* Do the old and new cluster disagree on the crash-safetiness of the vm
@@ -141,13 +142,20 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
vm_crashsafe_match = false;
+ /*
+ * Do we need to rewrite visibilitymap?
+ */
+ if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ vm_must_add_frozenbit = true;
+
for (mapnum = 0; mapnum < size; mapnum++)
{
if (old_tablespace == NULL ||
strcmp(maps[mapnum].old_tablespace, old_tablespace) == 0)
{
/* transfer primary file */
- transfer_relfile(&maps[mapnum], "");
+ transfer_relfile(&maps[mapnum], "", vm_must_add_frozenbit);
/* fsm/vm files added in PG 8.4 */
if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)
@@ -155,9 +163,9 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
/*
* Copy/link any fsm and vm files, if they exist
*/
- transfer_relfile(&maps[mapnum], "_fsm");
+ transfer_relfile(&maps[mapnum], "_fsm", vm_must_add_frozenbit);
if (vm_crashsafe_match)
- transfer_relfile(&maps[mapnum], "_vm");
+ transfer_relfile(&maps[mapnum], "_vm", vm_must_add_frozenbit);
}
}
}
@@ -167,10 +175,12 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
/*
* transfer_relfile()
*
- * Copy or link file from old cluster to new one.
+ * Copy or link file from old cluster to new one. If vm_must_add_frozenbit
+ * is true, visibility map forks are converted and rewritten, even in link
+ * mode.
*/
static void
-transfer_relfile(FileNameMap *map, const char *type_suffix)
+transfer_relfile(FileNameMap *map, const char *type_suffix, bool vm_must_add_frozenbit)
{
const char *msg;
char old_file[MAXPGPATH];
@@ -232,7 +242,13 @@ transfer_relfile(FileNameMap *map, const char *type_suffix)
{
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = copyFile(old_file, new_file, true)) != NULL)
+ /* Rewrite visibility map if needed */
+ if (vm_must_add_frozenbit && (strcmp(type_suffix, "_vm") == 0))
+ msg = rewriteVisibilityMap(old_file, new_file, true);
+ else
+ msg = copyFile(old_file, new_file, true);
+
+ if (msg)
pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
@@ -240,7 +256,13 @@ transfer_relfile(FileNameMap *map, const char *type_suffix)
{
pg_log(PG_VERBOSE, "linking \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = linkFile(old_file, new_file)) != NULL)
+ /* Rewrite visibility map if needed */
+ if (vm_must_add_frozenbit && (strcmp(type_suffix, "_vm") == 0))
+ msg = rewriteVisibilityMap(old_file, new_file, true);
+ else
+ msg = linkFile(old_file, new_file);
+
+ if (msg)
pg_fatal("error while creating link for relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
On Fri, Mar 11, 2016 at 1:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:
This 001 patch looks so little like what I was expecting that I
decided to start over from scratch. The new version I wrote is
attached here. I don't understand why your version tinkers with the
logic for setting the all-frozen bit; I thought that what I already
committed dealt with that already, and in any case, your version
doesn't even compile against latest sources. Your version also leaves
the scan_all terminology intact even though it's not accurate any
more, and I am not very convinced that the updates to the
page-skipping logic are actually correct. Please have a look over
this version and see what you think.
Thank you for your advise.
Sorry, optimising logic of previous patch was old by mistake.
Attached latest patch incorporated your suggestions with a little revising.
I think that's kind of pointless. We need to test that this
conversion code works, but once it does, I don't think we should make
everybody pay the overhead of retesting that. Anyway, the test code
could have bugs, too.Here's an updated version of your patch with that code removed and
some cosmetic cleanups like fixing typos and stuff like that. I think
this is mostly ready to commit, but I noticed one problem: your
conversion code always produces two output pages for each input page
even if one of them would be empty. In particular, if you have a
large number of small relations and run pg_upgrade, all of their
visibility maps will go from 8kB to 16kB. That isn't the end of the
world, maybe, but I think you should see if you can't fix it
somehow....
Thank you for updating patch.
To deal with this problem, I've changed it so that pg_upgrade checks
file size before conversion.
And if fork file does not exist or size is 0 (empty), ignore.
Attached latest patch.
Regards,
--
Masahiko Sawada
Attachments:
001_optimize_vacuum_by_frozen_bit_v40.patchapplication/octet-stream; name=001_optimize_vacuum_by_frozen_bit_v40.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a09ceb2..2f72633 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5984,12 +5984,15 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs an aggressive scan if the table's
<structname>pg_class</>.<structfield>relfrozenxid</> field has reached
- the age specified by this setting. The default is 150 million
- transactions. Although users can set this value anywhere from zero to
- two billions, <command>VACUUM</> will silently limit the effective value
- to 95% of <xref linkend="guc-autovacuum-freeze-max-age">, so that a
+ the age specified by this setting. An aggressive scan differs from
+ a regular <command>VACUUM</> in that it visits every page that might
+ contain unfrozen XIDs or MXIDs, not just those that might contain dead
+ tuples. The default is 150 million transactions. Although users can
+ set this value anywhere from zero to two billions, <command>VACUUM</>
+ will silently limit the effective value to 95% of
+ <xref linkend="guc-autovacuum-freeze-max-age">, so that a
periodical manual <command>VACUUM</> has a chance to run before an
anti-wraparound autovacuum is launched for the table. For more
information see
@@ -6028,9 +6031,12 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</term>
<listitem>
<para>
- <command>VACUUM</> performs a whole-table scan if the table's
+ <command>VACUUM</> performs an aggressive scan if the table's
<structname>pg_class</>.<structfield>relminmxid</> field has reached
- the age specified by this setting. The default is 150 million multixacts.
+ the age specified by this setting. An aggressive scan differs from
+ a regular <command>VACUUM</> in that it visits every page that might
+ contain unfrozen XIDs or MXIDs, not just those that might contain dead
+ tuples. The default is 150 million multixacts.
Although users can set this value anywhere from zero to two billions,
<command>VACUUM</> will silently limit the effective value to 95% of
<xref linkend="guc-autovacuum-multixact-freeze-max-age">, so that a
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 5204b34..1f2d70c 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -438,27 +438,32 @@
</para>
<para>
- <command>VACUUM</> normally skips pages that don't have any dead row
- versions, but those pages might still have row versions with old XID
- values. To ensure all old row versions have been frozen, a
- scan of the whole table is needed.
- <xref linkend="guc-vacuum-freeze-table-age"> controls when
- <command>VACUUM</> does that: a whole table sweep is forced if
- the table hasn't been fully scanned for <varname>vacuum_freeze_table_age</>
- minus <varname>vacuum_freeze_min_age</> transactions. Setting it to 0
- forces <command>VACUUM</> to always scan all pages, effectively ignoring
- the visibility map.
+ <command>VACUUM</> uses the <link linkend="storage-vm">visibility map</>
+ to determine which pages of a relation must be scanned. Normally, it
+ will skips pages that don't have any dead row versions even if those pages
+ might still have row versions with old XID values. Therefore, normal
+ scans won't succeed in freezing every row version in the table.
+ Periodically, <command>VACUUM</> will perform an <firstterm>aggressive
+ vacuum</>, skipping only those pages which contain neither dead rows nor
+ any unfrozen XID or MXID values.
+ <xref linkend="guc-vacuum-freeze-table-age">
+ controls when <command>VACUUM</> does that: all-visible but not all-frozen
+ pages are scanned if the number of transactions that have passed since the
+ last such scan is greater than <varname>vacuum_freeze_table_age</> minus
+ <varname>vacuum_freeze_min_age</>. Setting
+ <varname>vacuum_freeze_table_age</> to 0 forces <command>VACUUM</> to
+ use this more aggressive strategy for all scans.
</para>
<para>
The maximum time that a table can go unvacuumed is two billion
transactions minus the <varname>vacuum_freeze_min_age</> value at
- the time <command>VACUUM</> last scanned the whole table. If it were to go
+ the time of the last aggressive vacuum. If it were to go
unvacuumed for longer than
that, data loss could result. To ensure that this does not happen,
autovacuum is invoked on any table that might contain unfrozen rows with
XIDs older than the age specified by the configuration parameter <xref
- linkend="guc-autovacuum-freeze-max-age">. (This will happen even if
+ linkend="guc-autovacuum-freeze-max-age">. (This will happen even if
autovacuum is disabled.)
</para>
@@ -491,7 +496,7 @@
normal delete and update activity is run in that window. Setting it too
close could lead to anti-wraparound autovacuums, even though the table
was recently vacuumed to reclaim space, whereas lower values lead to more
- frequent whole-table scans.
+ frequent aggressive vacuuming.
</para>
<para>
@@ -527,7 +532,7 @@
<structname>pg_database</>. In particular,
the <structfield>relfrozenxid</> column of a table's
<structname>pg_class</> row contains the freeze cutoff XID that was used
- by the last whole-table <command>VACUUM</> for that table. All rows
+ by the last aggressive <command>VACUUM</> for that table. All rows
inserted by transactions with XIDs older than this cutoff XID are
guaranteed to have been frozen. Similarly,
the <structfield>datfrozenxid</> column of a database's
@@ -554,18 +559,21 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
<para>
<command>VACUUM</> normally
only scans pages that have been modified since the last vacuum, but
- <structfield>relfrozenxid</> can only be advanced when the whole table is
- scanned. The whole table is scanned when <structfield>relfrozenxid</> is
- more than <varname>vacuum_freeze_table_age</> transactions old, when
- <command>VACUUM</>'s <literal>FREEZE</> option is used, or when all pages
- happen to
+ <structfield>relfrozenxid</> can only be advanced every page of the table
+ that might contain unfrozen XIDs is scanned. This happens when
+ <structfield>relfrozenxid</> is more than
+ <varname>vacuum_freeze_table_age</> transactions old, when
+ <command>VACUUM</>'s <literal>FREEZE</> option is used, or when all
+ pages that are not already all-frozen happen to
require vacuuming to remove dead row versions. When <command>VACUUM</>
- scans the whole table, after it's finished <literal>age(relfrozenxid)</>
- should be a little more than the <varname>vacuum_freeze_min_age</> setting
- that was used (more by the number of transactions started since the
- <command>VACUUM</> started). If no whole-table-scanning <command>VACUUM</>
- is issued on the table until <varname>autovacuum_freeze_max_age</> is
- reached, an autovacuum will soon be forced for the table.
+ scans every page in the table that is not already all-frozen, it should
+ set <literal>age(relfrozenxid)</> to a value just a little more than the
+ <varname>vacuum_freeze_min_age</> setting
+ that was used (more by the number of transcations started since the
+ <command>VACUUM</> started). If no <structfield>relfrozenxid</>-advancing
+ <command>VACUUM</> is issued on the table until
+ <varname>autovacuum_freeze_max_age</> is reached, an autovacuum will soon
+ be forced for the table.
</para>
<para>
@@ -634,21 +642,23 @@ HINT: Stop the postmaster and vacuum that database in single-user mode.
</para>
<para>
- During a <command>VACUUM</> table scan, either partial or of the whole
- table, any multixact ID older than
+ Whenever <command>VACUUM</> scans any part of a table, it will replace
+ any multixact ID it encounters which is older than
<xref linkend="guc-vacuum-multixact-freeze-min-age">
- is replaced by a different value, which can be the zero value, a single
+ by a different value, which can be the zero value, a single
transaction ID, or a newer multixact ID. For each table,
<structname>pg_class</>.<structfield>relminmxid</> stores the oldest
possible multixact ID still appearing in any tuple of that table.
If this value is older than
- <xref linkend="guc-vacuum-multixact-freeze-table-age">, a whole-table
- scan is forced. <function>mxid_age()</> can be used on
+ <xref linkend="guc-vacuum-multixact-freeze-table-age">, an aggressive
+ vacuum is forced. As discussed in the previous section, an aggressive
+ vacuum means that only those pages which are known to be all-frozen will
+ be skipped. <function>mxid_age()</> can be used on
<structname>pg_class</>.<structfield>relminmxid</> to find its age.
</para>
<para>
- Whole-table <command>VACUUM</> scans, regardless of
+ Aggressive <command>VACUUM</> scans, regardless of
what causes them, enable advancing the value for that table.
Eventually, as all tables in all databases are scanned and their
oldest multixact values are advanced, on-disk storage for older
@@ -656,13 +666,13 @@ HINT: Stop the postmaster and vacuum that database in single-user mode.
</para>
<para>
- As a safety device, a whole-table vacuum scan will occur for any table
+ As a safety device, an aggressive vacuum scan will occur for any table
whose multixact-age is greater than
- <xref linkend="guc-autovacuum-multixact-freeze-max-age">. Whole-table
+ <xref linkend="guc-autovacuum-multixact-freeze-max-age">. Aggressive
vacuum scans will also occur progressively for all tables, starting with
those that have the oldest multixact-age, if the amount of used member
storage space exceeds the amount 50% of the addressable storage space.
- Both of these kinds of whole-table scans will occur even if autovacuum is
+ Both of these kinds of aggressive scans will occur even if autovacuum is
nominally disabled.
</para>
</sect3>
@@ -743,9 +753,9 @@ vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuple
<command>UPDATE</command> and <command>DELETE</command> operation. (It
is only semi-accurate because some information might be lost under heavy
load.) If the <structfield>relfrozenxid</> value of the table is more
- than <varname>vacuum_freeze_table_age</> transactions old, the whole
- table is scanned to freeze old tuples and advance
- <structfield>relfrozenxid</>, otherwise only pages that have been modified
+ than <varname>vacuum_freeze_table_age</> transactions old, an aggressive
+ vacuum is performed to freeze old tuples and advance
+ <structfield>relfrozenxid</>; otherwise, only pages that have been modified
since the last vacuum are scanned.
</para>
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 61d2edd..777efbb 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -106,6 +106,7 @@ typedef struct LVRelStats
BlockNumber rel_pages; /* total number of pages */
BlockNumber scanned_pages; /* number of pages we examined */
BlockNumber pinskipped_pages; /* # of pages we skipped due to a pin */
+ BlockNumber frozenskipped_pages; /* # of frozen pages we skipped */
double scanned_tuples; /* counts only tuples on scanned pages */
double old_rel_tuples; /* previous value of pg_class.reltuples */
double new_rel_tuples; /* new estimated total # of tuples */
@@ -136,7 +137,7 @@ static BufferAccessStrategy vac_strategy;
/* non-export function prototypes */
static void lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
- Relation *Irel, int nindexes, bool scan_all);
+ Relation *Irel, int nindexes, bool aggressive);
static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
static bool lazy_check_needs_freeze(Buffer buf, bool *hastup);
static void lazy_vacuum_index(Relation indrel,
@@ -182,8 +183,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
int usecs;
double read_rate,
write_rate;
- bool scan_all; /* should we scan all pages? */
- bool scanned_all; /* did we actually scan all pages? */
+ bool aggressive; /* should we scan all unfrozen pages? */
+ bool scanned_all_unfrozen; /* actually scanned all such pages? */
TransactionId xidFullScanLimit;
MultiXactId mxactFullScanLimit;
BlockNumber new_rel_pages;
@@ -221,14 +222,15 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
&MultiXactCutoff, &mxactFullScanLimit);
/*
- * We request a full scan if either the table's frozen Xid is now older
- * than or equal to the requested Xid full-table scan limit; or if the
- * table's minimum MultiXactId is older than or equal to the requested
- * mxid full-table scan limit.
+ * We request an aggressive scan if either the table's frozen Xid is now
+ * older than or equal to the requested Xid full-table scan limit; or if
+ * the table's minimum MultiXactId is older than or equal to the requested
+ * mxid full-table scan limit. During full scan, we could skip to scan
+ * pages according to all-frozen bit of visibility map.
*/
- scan_all = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
+ aggressive = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
xidFullScanLimit);
- scan_all |= MultiXactIdPrecedesOrEquals(onerel->rd_rel->relminmxid,
+ aggressive |= MultiXactIdPrecedesOrEquals(onerel->rd_rel->relminmxid,
mxactFullScanLimit);
vacrelstats = (LVRelStats *) palloc0(sizeof(LVRelStats));
@@ -244,7 +246,7 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
vacrelstats->hasindex = (nindexes > 0);
/* Do the vacuuming */
- lazy_scan_heap(onerel, vacrelstats, Irel, nindexes, scan_all);
+ lazy_scan_heap(onerel, vacrelstats, Irel, nindexes, aggressive);
/* Done with indexes */
vac_close_indexes(nindexes, Irel, NoLock);
@@ -256,13 +258,19 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* NB: We need to check this before truncating the relation, because that
* will change ->rel_pages.
*/
- if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
+ if ((vacrelstats->scanned_pages + vacrelstats->frozenskipped_pages)
+ < vacrelstats->rel_pages)
{
- Assert(!scan_all);
- scanned_all = false;
+ if (aggressive)
+ elog(FATAL, "scanned %u frozenskipped %u total %u",
+ vacrelstats->scanned_pages, vacrelstats->frozenskipped_pages,
+ vacrelstats->rel_pages);
+
+ Assert(!aggressive);
+ scanned_all_unfrozen = false;
}
else
- scanned_all = true;
+ scanned_all_unfrozen = true;
/*
* Optionally truncate the relation.
@@ -277,9 +285,9 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
* Update statistics in pg_class.
*
* A corner case here is that if we scanned no pages at all because every
- * page is all-visible, we should not update relpages/reltuples, because
- * we have no new information to contribute. In particular this keeps us
- * from replacing relpages=reltuples=0 (which means "unknown tuple
+ * page is all-visible or all-forzen, we should not update relpages/reltuples,
+ * because we have no new information to contribute. In particular this keeps
+ * us from replacing relpages=reltuples=0 (which means "unknown tuple
* density") with nonzero relpages and reltuples=0 (which means "zero
* tuple density") unless there's some actual evidence for the latter.
*
@@ -302,8 +310,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
if (new_rel_allvisible > new_rel_pages)
new_rel_allvisible = new_rel_pages;
- new_frozen_xid = scanned_all ? FreezeLimit : InvalidTransactionId;
- new_min_multi = scanned_all ? MultiXactCutoff : InvalidMultiXactId;
+ new_frozen_xid = scanned_all_unfrozen ? FreezeLimit : InvalidTransactionId;
+ new_min_multi = scanned_all_unfrozen ? MultiXactCutoff : InvalidMultiXactId;
vac_update_relstats(onerel,
new_rel_pages,
@@ -358,10 +366,11 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel),
vacrelstats->num_index_scans);
- appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins\n"),
+ appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped frozen\n"),
vacrelstats->pages_removed,
vacrelstats->rel_pages,
- vacrelstats->pinskipped_pages);
+ vacrelstats->pinskipped_pages,
+ vacrelstats->frozenskipped_pages);
appendStringInfo(&buf,
_("tuples: %.0f removed, %.0f remain, %.0f are dead but not yet removable\n"),
vacrelstats->tuples_deleted,
@@ -434,7 +443,7 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
*/
static void
lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
- Relation *Irel, int nindexes, bool scan_all)
+ Relation *Irel, int nindexes, bool aggressive)
{
BlockNumber nblocks,
blkno;
@@ -450,8 +459,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
int i;
PGRUsage ru0;
Buffer vmbuffer = InvalidBuffer;
- BlockNumber next_not_all_visible_block;
- bool skipping_all_visible_blocks;
+ BlockNumber next_unskippable_block;
+ bool skipping_blocks;
xl_heap_freeze_tuple *frozen;
StringInfoData buf;
@@ -479,35 +488,39 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
frozen = palloc(sizeof(xl_heap_freeze_tuple) * MaxHeapTuplesPerPage);
/*
- * We want to skip pages that don't require vacuuming according to the
- * visibility map, but only when we can skip at least SKIP_PAGES_THRESHOLD
- * consecutive pages. Since we're reading sequentially, the OS should be
- * doing readahead for us, so there's no gain in skipping a page now and
- * then; that's likely to disable readahead and so be counterproductive.
- * Also, skipping even a single page means that we can't update
- * relfrozenxid, so we only want to do it if we can skip a goodly number
- * of pages.
+ * Except when aggressive is set, we want to skip pages that are
+ * all-visible according to the visibility map, but only when we can skip
+ * at least SKIP_PAGES_THRESHOLD consecutive pages. Since we're reading
+ * sequentially, the OS should be doing readahead for us, so there's no
+ * gain in skipping a page now and then; that's likely to disable
+ * readahead and so be counterproductive. Also, skipping even a single
+ * page means that we can't update relfrozenxid, so we only want to do it
+ * if we can skip a goodly number of pages.
*
- * Before entering the main loop, establish the invariant that
- * next_not_all_visible_block is the next block number >= blkno that's not
- * all-visible according to the visibility map, or nblocks if there's no
- * such block. Also, we set up the skipping_all_visible_blocks flag,
- * which is needed because we need hysteresis in the decision: once we've
- * started skipping blocks, we may as well skip everything up to the next
- * not-all-visible block.
+ * When aggressive is set, we can't skip pages just because they are
+ * all-visible, but we can still skip pages that are all-frozen, since
+ * such pages do not need freezing and do not affect the value that we can
+ * safely set for relfrozenxid or relminmxid.
*
- * Note: if scan_all is true, we won't actually skip any pages; but we
- * maintain next_not_all_visible_block anyway, so as to set up the
- * all_visible_according_to_vm flag correctly for each page.
+ * Before entering the main loop, establish the invariant that
+ * next_unskippable_block is the next block number >= blkno that's not we
+ * can't skip based on the visibility map, either all-visible for a
+ * regular scan or all-frozen for an aggressive scan. We set it to
+ * nblocks if there's no such block. We also set up the skipping_blocks
+ * flag correctly at this stage.
*
* Note: The value returned by visibilitymap_get_status could be slightly
* out-of-date, since we make this test before reading the corresponding
* heap page or locking the buffer. This is OK. If we mistakenly think
- * that the page is all-visible when in fact the flag's just been cleared,
- * we might fail to vacuum the page. But it's OK to skip pages when
- * scan_all is not set, so no great harm done; the next vacuum will find
- * them. If we make the reverse mistake and vacuum a page unnecessarily,
- * it'll just be a no-op.
+ * that the page is all-visible or all-frozen when in fact the flag's just
+ * been cleared, we might fail to vacuum the page. It's easy to see that
+ * skipping a page when aggressive is not set is not a very big deal; we
+ * might leave some dead tuples lying around, but the next vacuum will
+ * find them. But even when aggressive *is* set, it's still OK if we miss
+ * a page whose all-frozen marking has just been cleared. Any new XIDs
+ * just added to that page are necessarily newer than the GlobalXmin we
+ * computed, so they'll have no effect on the value to which we can safely
+ * set relfrozenxid. A similar argument applies for MXIDs and relminmxid.
*
* We will scan the table's last page, at least to the extent of
* determining whether it has tuples or not, even if it should be skipped
@@ -518,18 +531,31 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* the last page. This is worth avoiding mainly because such a lock must
* be replayed on any hot standby, where it can be disruptive.
*/
- for (next_not_all_visible_block = 0;
- next_not_all_visible_block < nblocks;
- next_not_all_visible_block++)
+ for (next_unskippable_block = 0;
+ next_unskippable_block < nblocks;
+ next_unskippable_block++)
{
- if (!VM_ALL_VISIBLE(onerel, next_not_all_visible_block, &vmbuffer))
- break;
+ uint8 vmstatus;
+
+ vmstatus = visibilitymap_get_status(onerel, next_unskippable_block,
+ &vmbuffer);
+ if (aggressive)
+ {
+ if ((vmstatus & VISIBILITYMAP_ALL_FROZEN) == 0)
+ break;
+ }
+ else
+ {
+ if ((vmstatus & VISIBILITYMAP_ALL_VISIBLE) == 0)
+ break;
+ }
vacuum_delay_point();
}
- if (next_not_all_visible_block >= SKIP_PAGES_THRESHOLD)
- skipping_all_visible_blocks = true;
+
+ if (next_unskippable_block >= SKIP_PAGES_THRESHOLD)
+ skipping_blocks = true;
else
- skipping_all_visible_blocks = false;
+ skipping_blocks = false;
for (blkno = 0; blkno < nblocks; blkno++)
{
@@ -542,7 +568,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
int prev_dead_count;
int nfrozen;
Size freespace;
- bool all_visible_according_to_vm;
+ bool all_visible_according_to_vm = false;
bool all_visible;
bool all_frozen = true; /* provided all_visible is also true */
bool has_dead_tuples;
@@ -552,15 +578,28 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
#define FORCE_CHECK_PAGE() \
(blkno == nblocks - 1 && should_attempt_truncation(vacrelstats))
- if (blkno == next_not_all_visible_block)
+ if (blkno == next_unskippable_block)
{
- /* Time to advance next_not_all_visible_block */
- for (next_not_all_visible_block++;
- next_not_all_visible_block < nblocks;
- next_not_all_visible_block++)
+ /* Time to advance next_unskippable_block */
+ for (next_unskippable_block++;
+ next_unskippable_block < nblocks;
+ next_unskippable_block++)
{
- if (!VM_ALL_VISIBLE(onerel, next_not_all_visible_block, &vmbuffer))
- break;
+ uint8 vmskipflags;
+
+ vmskipflags = visibilitymap_get_status(onerel,
+ next_unskippable_block,
+ &vmbuffer);
+ if (aggressive)
+ {
+ if ((vmskipflags & VISIBILITYMAP_ALL_FROZEN) == 0)
+ break;
+ }
+ else
+ {
+ if ((vmskipflags & VISIBILITYMAP_ALL_VISIBLE) == 0)
+ break;
+ }
vacuum_delay_point();
}
@@ -569,17 +608,44 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* skipping_all_visible_blocks to do the right thing at the
* following blocks.
*/
- if (next_not_all_visible_block - blkno > SKIP_PAGES_THRESHOLD)
- skipping_all_visible_blocks = true;
+ if (next_unskippable_block - blkno > SKIP_PAGES_THRESHOLD)
+ skipping_blocks = true;
else
- skipping_all_visible_blocks = false;
- all_visible_according_to_vm = false;
+ skipping_blocks = false;
+
+ /*
+ * Normally, the fact that we can't skip this block must mean that
+ * it's not all-visible. But in an aggressive vacuum we know only
+ * that it's not all-frozen, so it might still be all-visible.
+ */
+ if (aggressive && VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
+ all_visible_according_to_vm = true;
}
else
{
- /* Current block is all-visible */
- if (skipping_all_visible_blocks && !scan_all && !FORCE_CHECK_PAGE())
+ /*
+ * The current block is potentially skippable; if we've seen a
+ * long enough run of skippable blocks to justify skipping it,
+ * and we're not forced to check it, then go ahead and skip.
+ * Otherwise, the page must be at least all-visible if not
+ * all-frozen, so we can set all_visible_according_to_vm = true.
+ */
+ if (skipping_blocks && !FORCE_CHECK_PAGE())
+ {
+ /*
+ * Tricky, tricky. If this is in aggressive vacuum, the page
+ * must have been all-frozen at the time we checked whether it
+ * was skippable, but it might not be any more. We must be
+ * careful to count it as a skipped all-frozen page in that
+ * case, or else we'll think we can't update relfrozenxid and
+ * relminmxid. If it's not an aggressive vacuum, we don't know
+ * whether it was all-frozen, so we have to recheck; but in
+ * this case an approximate answer is OK.
+ */
+ if (aggressive || VM_ALL_FROZEN(onerel, blkno, &vmbuffer))
+ vacrelstats->frozenskipped_pages++;
continue;
+ }
all_visible_according_to_vm = true;
}
@@ -628,9 +694,10 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* Pin the visibility map page in case we need to mark the page
* all-visible. In most cases this will be very cheap, because we'll
* already have the correct page pinned anyway. However, it's
- * possible that (a) next_not_all_visible_block is covered by a
- * different VM page than the current block or (b) we released our pin
- * and did a cycle of index vacuuming.
+ * possible that (a) next_unskippable_block is covered by a different
+ * VM page than the current block or (b) we released our pin and did a
+ * cycle of index vacuuming.
+
*/
visibilitymap_pin(onerel, blkno, &vmbuffer);
@@ -641,12 +708,12 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
if (!ConditionalLockBufferForCleanup(buf))
{
/*
- * If we're not scanning the whole relation to guard against XID
+ * If we're not performing an aggressive scan to guard against XID
* wraparound, and we don't want to forcibly check the page, then
* it's OK to skip vacuuming pages we get a lock conflict on. They
* will be dealt with in some future vacuum.
*/
- if (!scan_all && !FORCE_CHECK_PAGE())
+ if (!aggressive && !FORCE_CHECK_PAGE())
{
ReleaseBuffer(buf);
vacrelstats->pinskipped_pages++;
@@ -663,7 +730,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
* ourselves for multiple buffers and then service whichever one
* is received first. For now, this seems good enough.
*
- * If we get here with scan_all false, then we're just forcibly
+ * If we get here with aggressive false, then we're just forcibly
* checking the page, and so we don't want to insist on getting
* the lock; we only need to know if the page contains tuples, so
* that we can update nonempty_pages correctly. It's convenient
@@ -679,7 +746,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
vacrelstats->nonempty_pages = blkno + 1;
continue;
}
- if (!scan_all)
+ if (!aggressive)
{
/*
* Here, we must not advance scanned_pages; that would amount
@@ -1025,7 +1092,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
/* mark page all-visible, if appropriate */
if (all_visible && !all_visible_according_to_vm)
{
- uint8 flags = VISIBILITYMAP_ALL_VISIBLE;
+ uint8 flags = VISIBILITYMAP_ALL_VISIBLE;
if (all_frozen)
flags |= VISIBILITYMAP_ALL_FROZEN;
@@ -1171,6 +1238,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
RelationGetRelationName(onerel),
tups_vacuumed, vacuumed_pages)));
+ /* Report how many frozen pages vauum skipped according to visibility map */
+ ereport(elevel,
+ (errmsg_plural("skipped %d frozen page according to visibility map",
+ "skipped %d frozen pages according to visibility map",
+ vacrelstats->frozenskipped_pages,
+ vacrelstats->frozenskipped_pages)));
+
/*
* This is pretty messy, but we split it up so that we can skip emitting
* individual parts of the message when not applicable.
diff --git a/src/test/regress/expected/visibilitymap.out b/src/test/regress/expected/visibilitymap.out
new file mode 100644
index 0000000..6a0b7b8
--- /dev/null
+++ b/src/test/regress/expected/visibilitymap.out
@@ -0,0 +1,24 @@
+--
+-- Visibility Map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+VACUUM vmtest;
+--
+-- Check whether vacuum skips pages according to visibility map
+--
+\set VERBOSITY terse
+-- First VACUUM FREEZE cannot skip any page.
+VACUUM FREEZE VERBOSE vmtest;
+INFO: vacuuming "public.vmtest"
+INFO: index "vmtest_pkey" now contains 10000 row versions in 30 pages
+INFO: skipped 0 frozen pages according to visibility map
+INFO: "vmtest": found 0 removable, 10000 nonremovable row versions in 45 out of 45 pages
+-- Second VACUUM FREEZE should skip all pages.
+VACUUM FREEZE VERBOSE vmtest;
+INFO: vacuuming "public.vmtest"
+INFO: index "vmtest_pkey" now contains 10000 row versions in 30 pages
+INFO: skipped 44 frozen pages according to visibility map
+INFO: "vmtest": found 0 removable, 56 nonremovable row versions in 1 out of 45 pages
+\set VERBOSITY default
+DROP TABLE vmtest;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index bec0316..9ad2ffc 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -110,3 +110,6 @@ test: event_trigger
# run stats by itself because its delay may be insufficient under heavy load
test: stats
+
+# visiblity map and vacuum test cannot concurrently with any test that runs SQL
+test: visibilitymap
\ No newline at end of file
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 7e9b319..4b4eb07 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -162,3 +162,4 @@ test: with
test: xml
test: event_trigger
test: stats
+test: visibilitymap
\ No newline at end of file
diff --git a/src/test/regress/sql/visibilitymap.sql b/src/test/regress/sql/visibilitymap.sql
new file mode 100644
index 0000000..7ce5ae1
--- /dev/null
+++ b/src/test/regress/sql/visibilitymap.sql
@@ -0,0 +1,18 @@
+--
+-- Visibility Map
+--
+CREATE TABLE vmtest (i INT primary key);
+INSERT INTO vmtest SELECT generate_series(1,10000);
+VACUUM vmtest;
+
+--
+-- Check whether vacuum skips pages according to visibility map
+--
+\set VERBOSITY terse
+-- First VACUUM FREEZE cannot skip any page.
+VACUUM FREEZE VERBOSE vmtest;
+-- Second VACUUM FREEZE should skip all pages.
+VACUUM FREEZE VERBOSE vmtest;
+\set VERBOSITY default
+
+DROP TABLE vmtest;
000_pgupgrade_rewrite_vm_v42.patchapplication/octet-stream; name=000_pgupgrade_rewrite_vm_v42.patchDownload
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 2a99a28..34e1451 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -9,10 +9,15 @@
#include "postgres_fe.h"
+#include "access/visibilitymap.h"
#include "pg_upgrade.h"
+#include "storage/bufpage.h"
+#include "storage/checksum.h"
+#include "storage/checksum_impl.h"
#include <fcntl.h>
+#define BITS_PER_HEAPBLOCK_OLD 1
#ifndef WIN32
@@ -138,6 +143,130 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
#endif
+/*
+ * rewriteVisibilityMap()
+ *
+ * In versions of PostgreSQL prior to catversion 201603011, PostgreSQL's
+ * visibility map included one bit per heap page; it now includes two.
+ * When upgrading a cluster from before that time to a current PostgreSQL
+ * version, we could refuse to copy visibility maps from the old cluster
+ * to the new cluster; the next VACUUM would recreate them, but at the
+ * price of scanning the entire table. So, instead, we rewrite the old
+ * visibility maps in the new format. That way, the all-visible bit
+ * remains set for the pages for which it was set previously. The
+ * all-frozen bit is never set by this conversion; we leave that to
+ * VACUUM.
+ */
+const char *
+rewriteVisibilityMap(const char *fromfile, const char *tofile, bool force)
+{
+ int src_fd = 0;
+ int dst_fd = 0;
+ char buffer[BLCKSZ];
+ ssize_t bytesRead;
+ int rewriteVmBytesPerPage;
+ BlockNumber blkno = 0;
+
+ /* Compute we need how many old page bytes to rewrite a new page */
+ rewriteVmBytesPerPage = (BLCKSZ - SizeOfPageHeaderData) / 2;
+
+ if ((fromfile == NULL) || (tofile == NULL))
+ return "Invalid old file or new file";
+
+ if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
+ return getErrorText();
+
+ if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+ {
+ close(src_fd);
+ return getErrorText();
+ }
+
+ /*
+ * Turn each visibility map page into 2 pages one by one. Each new page
+ * has the same page header as the old one.
+ */
+ while ((bytesRead = read(src_fd, buffer, BLCKSZ)) == BLCKSZ)
+ {
+ char *old_cur,
+ *old_break,
+ *old_blkend;
+ PageHeaderData pageheader;
+
+ /* Save the page header data */
+ memcpy(&pageheader, buffer, SizeOfPageHeaderData);
+
+ /*
+ * These old_* variables point to old visibility map page. old_cur
+ * points to current position on old page. old_blkend points to end of
+ * old block. old_break points to old page break position for rewriting
+ * a new page. After wrote a new page, old_break proceeds
+ * rewriteVmBytesPerPage bytes.
+ */
+ old_cur = buffer + SizeOfPageHeaderData;
+ old_blkend = buffer + bytesRead;
+ old_break = old_cur + rewriteVmBytesPerPage;
+
+ while (old_blkend >= old_break)
+ {
+ char vmbuf[BLCKSZ];
+ char *new_cur = vmbuf;
+
+ /* Copy page header in advance */
+ memcpy(vmbuf, &pageheader, SizeOfPageHeaderData);
+
+ new_cur += SizeOfPageHeaderData;
+
+ /*
+ * Process old page bytes one by one, and turn it into new page.
+ */
+ while (old_break > old_cur)
+ {
+ uint16 new_vmbits = 0;
+ int i;
+
+ /* Generate new format bits while keeping old information */
+ for (i = 0; i < BITS_PER_BYTE; i++)
+ {
+ uint8 byte = * (uint8 *) old_cur;
+
+ if (((byte & (1 << (BITS_PER_HEAPBLOCK_OLD * i)))) != 0)
+ new_vmbits |= 1 << (BITS_PER_HEAPBLOCK * i);
+ }
+
+ /* Copy new visibility map bit to new format page */
+ memcpy(new_cur, &new_vmbits, BITS_PER_HEAPBLOCK);
+
+ old_cur += BITS_PER_HEAPBLOCK_OLD;
+ new_cur += BITS_PER_HEAPBLOCK;
+ }
+
+ /* Set new checksum for a visibility map page (if enabled) */
+ if (old_cluster.controldata.data_checksum_version != 0 &&
+ new_cluster.controldata.data_checksum_version != 0)
+ ((PageHeader) vmbuf)->pd_checksum =
+ pg_checksum_page(vmbuf, blkno);
+
+ if (write(dst_fd, vmbuf, BLCKSZ) != BLCKSZ)
+ {
+ close(dst_fd);
+ close(src_fd);
+ return getErrorText();
+ }
+
+ old_break += rewriteVmBytesPerPage;
+ blkno++;
+ }
+ }
+
+ /* Close files */
+ close(dst_fd);
+ close(src_fd);
+
+ return NULL;
+
+}
+
void
check_hard_link(void)
{
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 6122878..89beb73 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -110,6 +110,10 @@ extern char *output_files[];
#define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031
/*
+ * The format of visibility map is changed with this 9.6 commit,
+ */
+#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201603011
+/*
* pg_multixact format changed in 9.3 commit 0ac5ad5134f2769ccbaefec73844f85,
* ("Improve concurrency of foreign key locking") which also updated catalog
* version to this value. pg_upgrade behavior depends on whether old and new
@@ -365,6 +369,8 @@ bool pid_lock_file_exists(const char *datadir);
const char *copyFile(const char *src, const char *dst, bool force);
const char *linkFile(const char *src, const char *dst);
+const char *rewriteVisibilityMap(const char *fromfile, const char *tofile,
+ bool force);
void check_hard_link(void);
FILE *fopen_priv(const char *path, const char *mode);
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index b20f073..6dea2b4 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -11,12 +11,13 @@
#include "pg_upgrade.h"
+#include <sys/stat.h>
#include "catalog/pg_class.h"
#include "access/transam.h"
static void transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace);
-static void transfer_relfile(FileNameMap *map, const char *suffix);
+static void transfer_relfile(FileNameMap *map, const char *suffix, bool vm_must_add_frozenbit);
/*
@@ -132,6 +133,7 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
{
int mapnum;
bool vm_crashsafe_match = true;
+ bool vm_must_add_frozenbit = false;
/*
* Do the old and new cluster disagree on the crash-safetiness of the vm
@@ -141,13 +143,20 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
vm_crashsafe_match = false;
+ /*
+ * Do we need to rewrite visibilitymap?
+ */
+ if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ vm_must_add_frozenbit = true;
+
for (mapnum = 0; mapnum < size; mapnum++)
{
if (old_tablespace == NULL ||
strcmp(maps[mapnum].old_tablespace, old_tablespace) == 0)
{
/* transfer primary file */
- transfer_relfile(&maps[mapnum], "");
+ transfer_relfile(&maps[mapnum], "", vm_must_add_frozenbit);
/* fsm/vm files added in PG 8.4 */
if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)
@@ -155,9 +164,9 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
/*
* Copy/link any fsm and vm files, if they exist
*/
- transfer_relfile(&maps[mapnum], "_fsm");
+ transfer_relfile(&maps[mapnum], "_fsm", vm_must_add_frozenbit);
if (vm_crashsafe_match)
- transfer_relfile(&maps[mapnum], "_vm");
+ transfer_relfile(&maps[mapnum], "_vm", vm_must_add_frozenbit);
}
}
}
@@ -167,17 +176,19 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
/*
* transfer_relfile()
*
- * Copy or link file from old cluster to new one.
+ * Copy or link file from old cluster to new one. If vm_must_add_frozenbit
+ * is true, visibility map forks are converted and rewritten, even in link
+ * mode.
*/
static void
-transfer_relfile(FileNameMap *map, const char *type_suffix)
+transfer_relfile(FileNameMap *map, const char *type_suffix, bool vm_must_add_frozenbit)
{
const char *msg;
char old_file[MAXPGPATH];
char new_file[MAXPGPATH];
- int fd;
int segno;
char extent_suffix[65];
+ struct stat statbuf;
/*
* Now copy/link any related segments as well. Remember, PG breaks large
@@ -210,7 +221,7 @@ transfer_relfile(FileNameMap *map, const char *type_suffix)
if (type_suffix[0] != '\0' || segno != 0)
{
/* Did file open fail? */
- if ((fd = open(old_file, O_RDONLY, 0)) == -1)
+ if ((stat(old_file, &statbuf) != 0))
{
/* File does not exist? That's OK, just return */
if (errno == ENOENT)
@@ -220,7 +231,10 @@ transfer_relfile(FileNameMap *map, const char *type_suffix)
map->nspname, map->relname, old_file, new_file,
getErrorText());
}
- close(fd);
+
+ /* If file is empty, just return */
+ if (statbuf.st_size == 0)
+ return;
}
unlink(new_file);
@@ -232,7 +246,13 @@ transfer_relfile(FileNameMap *map, const char *type_suffix)
{
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = copyFile(old_file, new_file, true)) != NULL)
+ /* Rewrite visibility map if needed */
+ if (vm_must_add_frozenbit && (strcmp(type_suffix, "_vm") == 0))
+ msg = rewriteVisibilityMap(old_file, new_file, true);
+ else
+ msg = copyFile(old_file, new_file, true);
+
+ if (msg)
pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
@@ -240,7 +260,13 @@ transfer_relfile(FileNameMap *map, const char *type_suffix)
{
pg_log(PG_VERBOSE, "linking \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = linkFile(old_file, new_file)) != NULL)
+ /* Rewrite visibility map if needed */
+ if (vm_must_add_frozenbit && (strcmp(type_suffix, "_vm") == 0))
+ msg = rewriteVisibilityMap(old_file, new_file, true);
+ else
+ msg = linkFile(old_file, new_file);
+
+ if (msg)
pg_fatal("error while creating link for relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
On Thu, Mar 10, 2016 at 1:41 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Fri, Mar 11, 2016 at 1:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:
This 001 patch looks so little like what I was expecting that I
decided to start over from scratch. The new version I wrote is
attached here. I don't understand why your version tinkers with the
logic for setting the all-frozen bit; I thought that what I already
committed dealt with that already, and in any case, your version
doesn't even compile against latest sources. Your version also leaves
the scan_all terminology intact even though it's not accurate any
more, and I am not very convinced that the updates to the
page-skipping logic are actually correct. Please have a look over
this version and see what you think.Thank you for your advise.
Sorry, optimising logic of previous patch was old by mistake.
Attached latest patch incorporated your suggestions with a little revising.
OK, I'll have a look. Thanks.
I think that's kind of pointless. We need to test that this
conversion code works, but once it does, I don't think we should make
everybody pay the overhead of retesting that. Anyway, the test code
could have bugs, too.Here's an updated version of your patch with that code removed and
some cosmetic cleanups like fixing typos and stuff like that. I think
this is mostly ready to commit, but I noticed one problem: your
conversion code always produces two output pages for each input page
even if one of them would be empty. In particular, if you have a
large number of small relations and run pg_upgrade, all of their
visibility maps will go from 8kB to 16kB. That isn't the end of the
world, maybe, but I think you should see if you can't fix it
somehow....Thank you for updating patch.
To deal with this problem, I've changed it so that pg_upgrade checks
file size before conversion.
And if fork file does not exist or size is 0 (empty), ignore.
Attached latest patch.
I think what I really want is some logic so that if we have a 1-page
visibility map in the old cluster and the second half of that page is
all zeroes, we only create a 1-page visibility map in the new cluster
rather than a 2-page visibility map.
Or more generally, if the old VM is N pages, but the last half of the
last page is empty, then let the output VM be 2*N-1 pages instead of
2*N pages.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Mar 10, 2016 at 1:41 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Fri, Mar 11, 2016 at 1:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:
This 001 patch looks so little like what I was expecting that I
decided to start over from scratch. The new version I wrote is
attached here. I don't understand why your version tinkers with the
logic for setting the all-frozen bit; I thought that what I already
committed dealt with that already, and in any case, your version
doesn't even compile against latest sources. Your version also leaves
the scan_all terminology intact even though it's not accurate any
more, and I am not very convinced that the updates to the
page-skipping logic are actually correct. Please have a look over
this version and see what you think.Thank you for your advise.
Sorry, optimising logic of previous patch was old by mistake.
Attached latest patch incorporated your suggestions with a little revising.
Thanks. I adopted some of your suggested, rejected others, fixed a
few minor things that I missed previously, and committed this. If you
think any of the changes that I rejected still have merit, please
resubmit those changes as separate patches.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Mar 11, 2016 at 6:16 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Mar 10, 2016 at 1:41 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Fri, Mar 11, 2016 at 1:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:
This 001 patch looks so little like what I was expecting that I
decided to start over from scratch. The new version I wrote is
attached here. I don't understand why your version tinkers with the
logic for setting the all-frozen bit; I thought that what I already
committed dealt with that already, and in any case, your version
doesn't even compile against latest sources. Your version also leaves
the scan_all terminology intact even though it's not accurate any
more, and I am not very convinced that the updates to the
page-skipping logic are actually correct. Please have a look over
this version and see what you think.Thank you for your advise.
Sorry, optimising logic of previous patch was old by mistake.
Attached latest patch incorporated your suggestions with a little revising.Thanks. I adopted some of your suggested, rejected others, fixed a
few minor things that I missed previously, and committed this. If you
think any of the changes that I rejected still have merit, please
resubmit those changes as separate patches.
Thank you for your effort to this feature and committing it.
I guess that I couldn't do good work to this feature at final stage,
but I really appreciate all your advice and suggestion.
I think what I really want is some logic so that if we have a 1-page
visibility map in the old cluster and the second half of that page is
all zeroes, we only create a 1-page visibility map in the new cluster
rather than a 2-page visibility map.Or more generally, if the old VM is N pages, but the last half of the
last page is empty, then let the output VM be 2*N-1 pages instead of
2*N pages.
I got your point.
Attached latest patch can skip to write the last part of last old page
if it's empty.
Please review it.
Regards,
--
Masahiko Sawada
Attachments:
000_pgupgrade_rewrite_vm_v42.patchapplication/x-patch; name=000_pgupgrade_rewrite_vm_v42.patchDownload
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 2a99a28..7783b8a 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -9,10 +9,16 @@
#include "postgres_fe.h"
+#include "access/visibilitymap.h"
#include "pg_upgrade.h"
+#include "storage/bufpage.h"
+#include "storage/checksum.h"
+#include "storage/checksum_impl.h"
+#include <sys/stat.h>
#include <fcntl.h>
+#define BITS_PER_HEAPBLOCK_OLD 1
#ifndef WIN32
@@ -138,6 +144,156 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
#endif
+/*
+ * rewriteVisibilityMap()
+ *
+ * In versions of PostgreSQL prior to catversion 201603011, PostgreSQL's
+ * visibility map included one bit per heap page; it now includes two.
+ * When upgrading a cluster from before that time to a current PostgreSQL
+ * version, we could refuse to copy visibility maps from the old cluster
+ * to the new cluster; the next VACUUM would recreate them, but at the
+ * price of scanning the entire table. So, instead, we rewrite the old
+ * visibility maps in the new format. That way, the all-visible bit
+ * remains set for the pages for which it was set previously. The
+ * all-frozen bit is never set by this conversion; we leave that to
+ * VACUUM.
+ */
+const char *
+rewriteVisibilityMap(const char *fromfile, const char *tofile, bool force)
+{
+ int src_fd = 0;
+ int dst_fd = 0;
+ char buffer[BLCKSZ];
+ ssize_t bytesRead;
+ ssize_t src_filesize;
+ int rewriteVmBytesPerPage;
+ BlockNumber new_blkno = 0;
+ struct stat statbuf;
+
+ /* Compute we need how many old page bytes to rewrite a new page */
+ rewriteVmBytesPerPage = (BLCKSZ - SizeOfPageHeaderData) / 2;
+
+ if ((fromfile == NULL) || (tofile == NULL))
+ return "Invalid old file or new file";
+
+ if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
+ return getErrorText();
+
+ if (fstat(src_fd, &statbuf) != 0)
+ {
+ close(src_fd);
+ return getErrorText();
+ }
+
+ if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
+ {
+ close(src_fd);
+ return getErrorText();
+ }
+
+ /* Save old file size */
+ src_filesize = statbuf.st_size;
+
+ /*
+ * Turn each visibility map page into 2 pages one by one. Each new page
+ * has the same page header as the old one. If last section of last page
+ * is empty, we skip to write it. That is, more generally the old visibility
+ * map is N pages, but the last part of the last page is empty, this routine
+ * ouputs (BITS_PER_HEAPBLOCK / BITS_PER_HEAPBLOCK_OLD) * N - 1 pages instead
+ * of (BITS_PER_HEAPBLOCK / BITS_PER_HEAPBLOCK_OLD) * N pages.
+ */
+ while ((bytesRead = read(src_fd, buffer, BLCKSZ)) == BLCKSZ)
+ {
+ char *old_cur;
+ char *old_break;
+ char *old_blkend;
+ PageHeaderData pageheader;
+ bool old_lastblk = ((BLCKSZ * (new_blkno + 1)) == src_filesize);
+
+ /* Save the page header data */
+ memcpy(&pageheader, buffer, SizeOfPageHeaderData);
+
+ /*
+ * These old_* variables point to old visibility map page. old_cur
+ * points to current position on old page. old_blkend points to end of
+ * old block. old_break points to old page break position for rewriting
+ * a new page. After wrote a new page, old_break proceeds
+ * rewriteVmBytesPerPage bytes.
+ */
+ old_cur = buffer + SizeOfPageHeaderData;
+ old_blkend = buffer + bytesRead;
+ old_break = old_cur + rewriteVmBytesPerPage;
+
+ while (old_blkend >= old_break)
+ {
+ char new_vmbuf[BLCKSZ];
+ char *new_cur = new_vmbuf;
+ bool empty = true;
+ bool old_lastpart;
+
+ /* Copy page header in advance */
+ memcpy(new_vmbuf, &pageheader, SizeOfPageHeaderData);
+
+ /* Rewrite the last part of the old page? */
+ old_lastpart = old_lastblk && (old_blkend == old_break);
+
+ new_cur += SizeOfPageHeaderData;
+
+ /* Process old page bytes one by one, and turn it into new page. */
+ while (old_break > old_cur)
+ {
+ uint16 new_vmbits = 0;
+ int i;
+
+ /* Generate new format bits while keeping old information */
+ for (i = 0; i < BITS_PER_BYTE; i++)
+ {
+ uint8 byte = * (uint8 *) old_cur;
+
+ if (byte & (1 << (BITS_PER_HEAPBLOCK_OLD * i)))
+ {
+ empty = false;
+ new_vmbits |= 1 << (BITS_PER_HEAPBLOCK * i);
+ }
+ }
+
+ /* Copy new visibility map bit to new format page */
+ memcpy(new_cur, &new_vmbits, BITS_PER_HEAPBLOCK);
+
+ old_cur += BITS_PER_HEAPBLOCK_OLD;
+ new_cur += BITS_PER_HEAPBLOCK;
+ }
+
+ /* If the last part of the old page is empty, skip to write it */
+ if (old_lastpart && empty)
+ break;
+
+ /* Set new checksum for a visibility map page (if enabled) */
+ if (old_cluster.controldata.data_checksum_version != 0 &&
+ new_cluster.controldata.data_checksum_version != 0)
+ ((PageHeader) new_vmbuf)->pd_checksum =
+ pg_checksum_page(new_vmbuf, new_blkno);
+
+ if (write(dst_fd, new_vmbuf, BLCKSZ) != BLCKSZ)
+ {
+ close(dst_fd);
+ close(src_fd);
+ return getErrorText();
+ }
+
+ old_break += rewriteVmBytesPerPage;
+ new_blkno++;
+ }
+ }
+
+ /* Close files */
+ close(dst_fd);
+ close(src_fd);
+
+ return NULL;
+
+}
+
void
check_hard_link(void)
{
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 6122878..89beb73 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -110,6 +110,10 @@ extern char *output_files[];
#define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031
/*
+ * The format of visibility map is changed with this 9.6 commit,
+ */
+#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201603011
+/*
* pg_multixact format changed in 9.3 commit 0ac5ad5134f2769ccbaefec73844f85,
* ("Improve concurrency of foreign key locking") which also updated catalog
* version to this value. pg_upgrade behavior depends on whether old and new
@@ -365,6 +369,8 @@ bool pid_lock_file_exists(const char *datadir);
const char *copyFile(const char *src, const char *dst, bool force);
const char *linkFile(const char *src, const char *dst);
+const char *rewriteVisibilityMap(const char *fromfile, const char *tofile,
+ bool force);
void check_hard_link(void);
FILE *fopen_priv(const char *path, const char *mode);
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index b20f073..0c1a822 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -11,12 +11,13 @@
#include "pg_upgrade.h"
+#include <sys/stat.h>
#include "catalog/pg_class.h"
#include "access/transam.h"
static void transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace);
-static void transfer_relfile(FileNameMap *map, const char *suffix);
+static void transfer_relfile(FileNameMap *map, const char *suffix, bool vm_must_add_frozenbit);
/*
@@ -132,6 +133,7 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
{
int mapnum;
bool vm_crashsafe_match = true;
+ bool vm_must_add_frozenbit = false;
/*
* Do the old and new cluster disagree on the crash-safetiness of the vm
@@ -141,13 +143,20 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
vm_crashsafe_match = false;
+ /*
+ * Do we need to rewrite visibilitymap?
+ */
+ if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_FROZEN_BIT_CAT_VER &&
+ new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
+ vm_must_add_frozenbit = true;
+
for (mapnum = 0; mapnum < size; mapnum++)
{
if (old_tablespace == NULL ||
strcmp(maps[mapnum].old_tablespace, old_tablespace) == 0)
{
/* transfer primary file */
- transfer_relfile(&maps[mapnum], "");
+ transfer_relfile(&maps[mapnum], "", vm_must_add_frozenbit);
/* fsm/vm files added in PG 8.4 */
if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)
@@ -155,9 +164,9 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
/*
* Copy/link any fsm and vm files, if they exist
*/
- transfer_relfile(&maps[mapnum], "_fsm");
+ transfer_relfile(&maps[mapnum], "_fsm", vm_must_add_frozenbit);
if (vm_crashsafe_match)
- transfer_relfile(&maps[mapnum], "_vm");
+ transfer_relfile(&maps[mapnum], "_vm", vm_must_add_frozenbit);
}
}
}
@@ -167,17 +176,19 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
/*
* transfer_relfile()
*
- * Copy or link file from old cluster to new one.
+ * Copy or link file from old cluster to new one. If vm_must_add_frozenbit
+ * is true, visibility map forks are converted and rewritten, even in link
+ * mode.
*/
static void
-transfer_relfile(FileNameMap *map, const char *type_suffix)
+transfer_relfile(FileNameMap *map, const char *type_suffix, bool vm_must_add_frozenbit)
{
const char *msg;
char old_file[MAXPGPATH];
char new_file[MAXPGPATH];
- int fd;
int segno;
char extent_suffix[65];
+ struct stat statbuf;
/*
* Now copy/link any related segments as well. Remember, PG breaks large
@@ -210,7 +221,7 @@ transfer_relfile(FileNameMap *map, const char *type_suffix)
if (type_suffix[0] != '\0' || segno != 0)
{
/* Did file open fail? */
- if ((fd = open(old_file, O_RDONLY, 0)) == -1)
+ if (stat(old_file, &statbuf) != 0)
{
/* File does not exist? That's OK, just return */
if (errno == ENOENT)
@@ -220,7 +231,10 @@ transfer_relfile(FileNameMap *map, const char *type_suffix)
map->nspname, map->relname, old_file, new_file,
getErrorText());
}
- close(fd);
+
+ /* If file is empty, just return */
+ if (statbuf.st_size == 0)
+ return;
}
unlink(new_file);
@@ -232,7 +246,13 @@ transfer_relfile(FileNameMap *map, const char *type_suffix)
{
pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = copyFile(old_file, new_file, true)) != NULL)
+ /* Rewrite visibility map if needed */
+ if (vm_must_add_frozenbit && (strcmp(type_suffix, "_vm") == 0))
+ msg = rewriteVisibilityMap(old_file, new_file, true);
+ else
+ msg = copyFile(old_file, new_file, true);
+
+ if (msg)
pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
@@ -240,7 +260,13 @@ transfer_relfile(FileNameMap *map, const char *type_suffix)
{
pg_log(PG_VERBOSE, "linking \"%s\" to \"%s\"\n", old_file, new_file);
- if ((msg = linkFile(old_file, new_file)) != NULL)
+ /* Rewrite visibility map if needed */
+ if (vm_must_add_frozenbit && (strcmp(type_suffix, "_vm") == 0))
+ msg = rewriteVisibilityMap(old_file, new_file, true);
+ else
+ msg = linkFile(old_file, new_file);
+
+ if (msg)
pg_fatal("error while creating link for relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
map->nspname, map->relname, old_file, new_file, msg);
}
On Thu, Mar 10, 2016 at 10:47 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Thanks. I adopted some of your suggested, rejected others, fixed a
few minor things that I missed previously, and committed this. If you
think any of the changes that I rejected still have merit, please
resubmit those changes as separate patches.Thank you for your effort to this feature and committing it.
I guess that I couldn't do good work to this feature at final stage,
but I really appreciate all your advice and suggestion.
Don't feel bad, you put a lot of work on this, and if you were getting
a little tired towards the end, that's very understandable. This
extremely important feature was largely driven by you, and that's a
big accomplishment.
I got your point.
Attached latest patch can skip to write the last part of last old page
if it's empty.
Please review it.
Committed.
Which I think just about brings us to the end of this epic journey,
except for any cleanup of what's already been committed that needs to
be done. Thanks so much for your hard work!
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Mar 12, 2016 at 2:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Mar 10, 2016 at 10:47 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Thanks. I adopted some of your suggested, rejected others, fixed a
few minor things that I missed previously, and committed this. If you
think any of the changes that I rejected still have merit, please
resubmit those changes as separate patches.Thank you for your effort to this feature and committing it.
I guess that I couldn't do good work to this feature at final stage,
but I really appreciate all your advice and suggestion.Don't feel bad, you put a lot of work on this, and if you were getting
a little tired towards the end, that's very understandable. This
extremely important feature was largely driven by you, and that's a
big accomplishment.I got your point.
Attached latest patch can skip to write the last part of last old page
if it's empty.
Please review it.Committed.
Which I think just about brings us to the end of this epic journey,
except for any cleanup of what's already been committed that needs to
be done. Thanks so much for your hard work!
Thank you so much!
What I wanted deal with in thread is almost done. I'm going to more
test the feature for 9.6 releasing.
Regards,
--
Masahiko Sawada
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 03/11/2016 09:48 AM, Masahiko Sawada wrote:
Thank you so much!
What I wanted deal with in thread is almost done. I'm going to more
test the feature for 9.6 releasing.
Nicely done!
Regards,
--
Masahiko Sawada
--
Command Prompt, Inc. http://the.postgres.company/
+1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers