CLUSTER not lose indexes
Hackers:
I've modified commands/cluster.c so that it recreates the indexes on the
table after clustering the table. I attach the patch.
There are (of course) things I don't understand. For example, whether
(or when) I should use CommandCounterIncrement() after each
index_create, or if I should call setRelhasindex() only once (and not
once per index); or whether I need to acquire some lock on the indexes.
I tested it with one table and several indexes. Truth is I don't know
how to test for concurrency, or if it's worth the trouble.
The purpose of this experiment (and, I hope, more to follow) is to
familiarize myself with the guts of PostgreSQL, so I can work on my CS
thesis with it. If you can point me my misconceptions I'd be happy to
try again (and again, and...)
Thank you.
--
Alvaro Herrera (<alvherre[a]atentus.com>)
"La Primavera ha venido. Nadie sabe como ha sido" (A. Machado)
Attachments:
cluster.patchtext/plain; charset=US-ASCII; name=cluster.patchDownload
Index: cluster.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/commands/cluster.c,v
retrieving revision 1.82
diff -c -c -r1.82 cluster.c
*** cluster.c 2002/06/20 20:29:26 1.82
--- cluster.c 2002/07/05 02:33:40
***************
*** 26,69 ****
#include "access/heapam.h"
#include "catalog/heap.h"
#include "catalog/index.h"
#include "catalog/pg_index.h"
#include "catalog/pg_proc.h"
#include "commands/cluster.h"
#include "commands/tablecmds.h"
#include "miscadmin.h"
#include "utils/builtins.h"
#include "utils/lsyscache.h"
#include "utils/syscache.h"
static Oid copy_heap(Oid OIDOldHeap, const char *NewName);
- static Oid copy_index(Oid OIDOldIndex, Oid OIDNewHeap,
- const char *NewIndexName);
static void rebuildheap(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex);
/*
* cluster
*
* STILL TO DO:
! * Create a list of all the other indexes on this relation. Because
! * the cluster will wreck all the tids, I'll need to destroy bogus
! * indexes. The user will have to re-create them. Not nice, but
! * I'm not a nice guy. The alternative is to try some kind of post
! * destroy re-build. This may be possible. I'll check out what the
! * index create functiond want in the way of paramaters. On the other
! * hand, re-creating n indexes may blow out the space.
*/
void
cluster(RangeVar *oldrelation, char *oldindexname)
{
Oid OIDOldHeap,
OIDOldIndex,
! OIDNewHeap,
! OIDNewIndex;
Relation OldHeap,
! OldIndex;
char NewHeapName[NAMEDATALEN];
! char NewIndexName[NAMEDATALEN];
/*
* We grab exclusive access to the target rel and index for the
--- 26,82 ----
#include "access/heapam.h"
#include "catalog/heap.h"
#include "catalog/index.h"
+ #include "catalog/catname.h"
#include "catalog/pg_index.h"
#include "catalog/pg_proc.h"
#include "commands/cluster.h"
#include "commands/tablecmds.h"
#include "miscadmin.h"
#include "utils/builtins.h"
+ #include "utils/fmgroids.h"
#include "utils/lsyscache.h"
#include "utils/syscache.h"
static Oid copy_heap(Oid OIDOldHeap, const char *NewName);
static void rebuildheap(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex);
/*
+ * We need one of these structs for each index in the relation to be
+ * clustered.
+ */
+ typedef struct
+ {
+ char *indexName;
+ IndexInfo *indexInfo;
+ Oid accessMethodOID;
+ Oid *classOID;
+ Oid indexOID;
+ bool isPrimary;
+ } IndexAttrs;
+
+ /*
* cluster
*
* STILL TO DO:
! * Keep foreign keys of the clustered table.
*/
void
cluster(RangeVar *oldrelation, char *oldindexname)
{
Oid OIDOldHeap,
OIDOldIndex,
! OIDNewHeap;
Relation OldHeap,
! OldIndex,
! indexRelation;
char NewHeapName[NAMEDATALEN];
! HeapTuple indexTuple;
! ScanKeyData entry;
! HeapScanDesc scan;
! List *indexes=NIL;
! List *elem;
! IndexAttrs *attrs;
/*
* We grab exclusive access to the target rel and index for the
***************
*** 95,100 ****
--- 108,146 ----
index_close(OldIndex);
/*
+ * Save the information of all indexes on the relation.
+ */
+ indexRelation = heap_openr(IndexRelationName, AccessShareLock);
+ ScanKeyEntryInitialize(&entry, 0, Anum_pg_index_indrelid,
+ F_OIDEQ, ObjectIdGetDatum(OIDOldHeap));
+ scan = heap_beginscan(indexRelation, SnapshotNow, 1, &entry);
+ while ((indexTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ Form_pg_index index = (Form_pg_index) GETSTRUCT(indexTuple);
+ HeapTuple tuple;
+ Form_pg_class class;
+ attrs = (IndexAttrs *) palloc(sizeof(IndexAttrs));
+ attrs->indexInfo = BuildIndexInfo(index);
+ attrs->isPrimary = index->indisprimary;
+ attrs->indexOID = index->indexrelid;
+ attrs->classOID = (Oid *)palloc(sizeof(Oid) *
+ attrs->indexInfo->ii_NumIndexAttrs);
+ memcpy(attrs->classOID, index->indclass,
+ sizeof(Oid) * attrs->indexInfo->ii_NumIndexAttrs);
+ tuple = SearchSysCache(RELOID, ObjectIdGetDatum(attrs->indexOID),
+ 0, 0, 0);
+ if (!HeapTupleIsValid(tuple))
+ break;
+ class = (Form_pg_class) GETSTRUCT(tuple);
+ attrs->indexName = pstrdup(NameStr(class->relname));
+ attrs->accessMethodOID = class->relam;
+ ReleaseSysCache(tuple);
+ indexes=lcons((void *)attrs, indexes);
+ }
+ heap_endscan(scan);
+ heap_close(indexRelation, AccessShareLock);
+
+ /*
* Create the new heap with a temporary name.
*/
snprintf(NewHeapName, NAMEDATALEN, "temp_%u", OIDOldHeap);
***************
*** 111,123 ****
/* To make the new heap's data visible. */
CommandCounterIncrement();
- /* Create new index over the tuples of the new heap. */
- snprintf(NewIndexName, NAMEDATALEN, "temp_%u", OIDOldIndex);
-
- OIDNewIndex = copy_index(OIDOldIndex, OIDNewHeap, NewIndexName);
-
- CommandCounterIncrement();
-
/* Destroy old heap (along with its index) and rename new. */
heap_drop_with_catalog(OIDOldHeap, allowSystemTableMods);
--- 157,162 ----
***************
*** 128,134 ****
/* This one might be unnecessary, but let's be safe. */
CommandCounterIncrement();
! renamerel(OIDNewIndex, oldindexname);
}
static Oid
--- 167,180 ----
/* This one might be unnecessary, but let's be safe. */
CommandCounterIncrement();
! foreach (elem, indexes)
! {
! attrs=(IndexAttrs *) lfirst(elem);
! index_create(OIDNewHeap, attrs->indexName, attrs->indexInfo,
! attrs->accessMethodOID, attrs->classOID, attrs->isPrimary,
! allowSystemTableMods);
! setRelhasindex(OIDNewHeap, true, attrs->isPrimary, InvalidOid);
! }
}
static Oid
***************
*** 173,214 ****
return OIDNewHeap;
}
-
- static Oid
- copy_index(Oid OIDOldIndex, Oid OIDNewHeap, const char *NewIndexName)
- {
- Oid OIDNewIndex;
- Relation OldIndex,
- NewHeap;
- IndexInfo *indexInfo;
-
- NewHeap = heap_open(OIDNewHeap, AccessExclusiveLock);
- OldIndex = index_open(OIDOldIndex);
-
- /*
- * Create a new index like the old one. To do this I get the info
- * from pg_index, and add a new index with a temporary name (that will
- * be changed later).
- */
- indexInfo = BuildIndexInfo(OldIndex->rd_index);
-
- OIDNewIndex = index_create(OIDNewHeap,
- NewIndexName,
- indexInfo,
- OldIndex->rd_rel->relam,
- OldIndex->rd_index->indclass,
- OldIndex->rd_index->indisprimary,
- allowSystemTableMods);
-
- setRelhasindex(OIDNewHeap, true,
- OldIndex->rd_index->indisprimary, InvalidOid);
-
- index_close(OldIndex);
- heap_close(NewHeap, NoLock);
-
- return OIDNewIndex;
- }
-
static void
rebuildheap(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex)
--- 219,224 ----
Alvaro Herrera <alvherre@atentus.com> writes:
There are (of course) things I don't understand. For example, whether
(or when) I should use CommandCounterIncrement() after each
index_create, or if I should call setRelhasindex() only once (and not
once per index); or whether I need to acquire some lock on the indexes.
I think you probably want a CommandCounterIncrement at the bottom of the
loop (after setRelhasindex). If it works as-is it's just by chance,
ie due to internal CCI calls in index_create.
Locking newly-created indexes is not really necessary, since no one else
can see them until you commit anyhow.
+ tuple = SearchSysCache(RELOID, ObjectIdGetDatum(attrs->indexOID),
+ 0, 0, 0);
+ if (!HeapTupleIsValid(tuple))
+ break;
Breaking out of the loop hardly seems an appropriate response to this
failure condition. Not finding the index' pg_class entry is definitely
an error.
I'd also suggest more-liberal commenting, as well as more attention to
updating the existing comments to match new reality.
In general, I'm not thrilled about expending more code on the existing
fundamentally-broken implementation of CLUSTER. We need to look at
making use of the ability to write a new version of a table (or index)
under a new relfilenode value, without changing the table's OID.
However, some parts of your patch will probably still be needed when
someone gets around to making that happen, so I won't object for now.
regards, tom lane
Alvaro Herrera wrote:
Hackers:
I've modified commands/cluster.c so that it recreates the indexes on the
table after clustering the table. I attach the patch.There are (of course) things I don't understand. For example, whether
(or when) I should use CommandCounterIncrement() after each
index_create, or if I should call setRelhasindex() only once (and not
once per index); or whether I need to acquire some lock on the indexes.I tested it with one table and several indexes. Truth is I don't know
how to test for concurrency, or if it's worth the trouble.The purpose of this experiment (and, I hope, more to follow) is to
familiarize myself with the guts of PostgreSQL, so I can work on my CS
thesis with it. If you can point me my misconceptions I'd be happy to
try again (and again, and...)
I think Tom was suggesting that you may want to continue work on CLUSTER
and make use of relfilenode. After the cluster, you can just update
pg_class.relfilenode with the new file name (random oid generated at
build time) and as soon as you commit, all backends will start using the
new file and you can delete the old one.
The particular case we would like to improve is this:
/* Destroy old heap (along with its index) and rename new. */
heap_drop_with_catalog(OIDOldHeap, allowSystemTableMods);
CommandCounterIncrement();
renamerel(OIDNewHeap, oldrelation->relname);
In this code, we delete the old relation, then rename the new one. It
would be good to have this all happen in one update of
pg_class.relfilenode; that way it is an atomic operation.
So, create a heap (in the temp namespace so it is deleted on crash),
copy the old heap into the new file in cluster order, and when you are
done, point the old pg_class relfilenode at the new clustered heap
filename, then point the new cluster heap pg_class at the old heap file,
and then drop the cluster heap file; that will remove the _old_ file
(I believe on commit) and you are ready to go.
So, you are basically creating a new heap, but at finish, the new heap's
pg_class and the old heap's file go away. I thought about doing it
without creating the pg_class entry for the new heap, but the code
really wants to have a heap it can manipulate.
Same with index rebuilding, I think. The indexes are built on the new
heap, and the relfilenode swap works just the same.
Let us know if you want to pursue that and we can give you additional
assistance. I would like to see CLUSTER really polished up. More
people are using it now and it really needs someone to focus on it.
Glad to see you working on CLUSTER. Welcome aboard.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian dijo:
Alvaro Herrera wrote:
Hackers:
I've modified commands/cluster.c so that it recreates the indexes on the
table after clustering the table. I attach the patch.
I think Tom was suggesting that you may want to continue work on CLUSTER
and make use of relfilenode. After the cluster, you can just update
pg_class.relfilenode with the new file name (random oid generated at
build time) and as soon as you commit, all backends will start using the
new file and you can delete the old one.
I'm looking at pg_class, and relfilenode is equal to oid in all cases
AFAICS. If I create a new, "random" relfilenode, the equality will not
hold. Is that OK? Also, is the new relfilenode somehow guaranteed to
not be assigned to another relation (pg_class tuple, I think)?
In this code, we delete the old relation, then rename the new one. It
would be good to have this all happen in one update of
pg_class.relfilenode; that way it is an atomic operation.
OK, I'll see if I can do that.
Let us know if you want to pursue that and we can give you additional
assistance. I would like to see CLUSTER really polished up. More
people are using it now and it really needs someone to focus on it.
I thought CLUSTER was meant to be replaced by automatically clustering
indexes. Isn't that so? Is anyone working on that?
Glad to see you working on CLUSTER. Welcome aboard.
Thank you. I'm really happy to be able to work in Postgres.
--
Alvaro Herrera (<alvherre[a]atentus.com>)
"Limitate a mirar... y algun dia veras"
Alvaro Herrera wrote:
Bruce Momjian dijo:
Alvaro Herrera wrote:
Hackers:
I've modified commands/cluster.c so that it recreates the indexes on the
table after clustering the table. I attach the patch.I think Tom was suggesting that you may want to continue work on CLUSTER
and make use of relfilenode. After the cluster, you can just update
pg_class.relfilenode with the new file name (random oid generated at
build time) and as soon as you commit, all backends will start using the
new file and you can delete the old one.I'm looking at pg_class, and relfilenode is equal to oid in all cases
AFAICS. If I create a new, "random" relfilenode, the equality will not
hold. Is that OK? Also, is the new relfilenode somehow guaranteed to
Yes, they are all the same only because we haven't gotten CLUSTER fixed
yet. ;-) We knew we needed to make cases where they are not the same
specifically for cases like CLUSTER.
not be assigned to another relation (pg_class tuple, I think)?
It will be fine. The relfilenode will be the one assigned to the new
heap table and will not be used by others.
In this code, we delete the old relation, then rename the new one. It
would be good to have this all happen in one update of
pg_class.relfilenode; that way it is an atomic operation.OK, I'll see if I can do that.
Let us know if you want to pursue that and we can give you additional
assistance. I would like to see CLUSTER really polished up. More
people are using it now and it really needs someone to focus on it.I thought CLUSTER was meant to be replaced by automatically clustering
indexes. Isn't that so? Is anyone working on that?
We have no idea how to automatically cluster based on an index. Not
even sure how the commercial guys do it.
Glad to see you working on CLUSTER. Welcome aboard.
Thank you. I'm really happy to be able to work in Postgres.
Great.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Tom Lane dijo:
I think you probably want a CommandCounterIncrement at the bottom of the
loop (after setRelhasindex). If it works as-is it's just by chance,
ie due to internal CCI calls in index_create.
Done.
+ tuple = SearchSysCache(RELOID, ObjectIdGetDatum(attrs->indexOID), + 0, 0, 0); + if (!HeapTupleIsValid(tuple)) + break;Breaking out of the loop hardly seems an appropriate response to this
failure condition. Not finding the index' pg_class entry is definitely
an error.
Sure. elog(ERROR) now. I'm not sure what was I thinking when I wrote
that.
I'd also suggest more-liberal commenting, as well as more attention to
updating the existing comments to match new reality.
I'm afraid I cannot get too verbose no matter how hard I try. I hope
this one is OK.
In general, I'm not thrilled about expending more code on the existing
fundamentally-broken implementation of CLUSTER. We need to look at
making use of the ability to write a new version of a table (or index)
under a new relfilenode value, without changing the table's OID.
However, some parts of your patch will probably still be needed when
someone gets around to making that happen, so I won't object for now.
Will try to do this.
--
Alvaro Herrera (<alvherre[a]atentus.com>)
Licensee shall have no right to use the Licensed Software
for productive or commercial use. (Licencia de StarOffice 6.0 beta)
Attachments:
cluster.patchtext/plain; charset=US-ASCII; name=cluster.patchDownload
Index: cluster.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/commands/cluster.c,v
retrieving revision 1.82
diff -c -u -r1.82 cluster.c
--- cluster.c 2002/06/20 20:29:26 1.82
+++ cluster.c 2002/07/05 06:25:17
@@ -26,44 +26,62 @@
#include "access/heapam.h"
#include "catalog/heap.h"
#include "catalog/index.h"
+#include "catalog/catname.h"
#include "catalog/pg_index.h"
#include "catalog/pg_proc.h"
#include "commands/cluster.h"
#include "commands/tablecmds.h"
#include "miscadmin.h"
#include "utils/builtins.h"
+#include "utils/fmgroids.h"
#include "utils/lsyscache.h"
#include "utils/syscache.h"
static Oid copy_heap(Oid OIDOldHeap, const char *NewName);
-static Oid copy_index(Oid OIDOldIndex, Oid OIDNewHeap,
- const char *NewIndexName);
static void rebuildheap(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex);
/*
+ * We need one of these structs for each index in the relation to be
+ * clustered. It's basically the data needed by index_create() so
+ * we can recreate the indexes after destroying the old heap.
+ */
+typedef struct
+{
+ char *indexName;
+ IndexInfo *indexInfo;
+ Oid accessMethodOID;
+ Oid *classOID;
+ Oid indexOID;
+ bool isPrimary;
+} IndexAttrs;
+
+/*
* cluster
*
* STILL TO DO:
- * Create a list of all the other indexes on this relation. Because
- * the cluster will wreck all the tids, I'll need to destroy bogus
- * indexes. The user will have to re-create them. Not nice, but
- * I'm not a nice guy. The alternative is to try some kind of post
- * destroy re-build. This may be possible. I'll check out what the
- * index create functiond want in the way of paramaters. On the other
- * hand, re-creating n indexes may blow out the space.
+ * Keep foreign keys, permissions and inheritance of the clustered table.
+ *
+ * We need to look at making use of the ability to write a new version of a
+ * table (or index) under a new relfilenode value, without changing the
+ * table's OID.
*/
void
cluster(RangeVar *oldrelation, char *oldindexname)
{
Oid OIDOldHeap,
OIDOldIndex,
- OIDNewHeap,
- OIDNewIndex;
+ OIDNewHeap;
Relation OldHeap,
- OldIndex;
+ OldIndex,
+ indexRelation;
char NewHeapName[NAMEDATALEN];
- char NewIndexName[NAMEDATALEN];
+ HeapTuple indexTuple;
+ ScanKeyData entry;
+ HeapScanDesc scan;
+ List *indexes=NIL;
+ List *elem;
+ IndexAttrs *attrs;
/*
* We grab exclusive access to the target rel and index for the
@@ -95,6 +113,53 @@
index_close(OldIndex);
/*
+ * Save the information of all indexes on the relation.
+ * Get the info from IndexRelationName.
+ */
+ indexRelation = heap_openr(IndexRelationName, AccessShareLock);
+ ScanKeyEntryInitialize(&entry, 0, Anum_pg_index_indrelid,
+ F_OIDEQ, ObjectIdGetDatum(OIDOldHeap));
+ scan = heap_beginscan(indexRelation, SnapshotNow, 1, &entry);
+ while ((indexTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ Form_pg_index index = (Form_pg_index) GETSTRUCT(indexTuple);
+ HeapTuple tuple;
+ Form_pg_class class;
+
+ attrs = (IndexAttrs *) palloc(sizeof(IndexAttrs));
+ attrs->indexInfo = BuildIndexInfo(index);
+ attrs->isPrimary = index->indisprimary;
+ attrs->indexOID = index->indexrelid;
+
+ /* The opclasses are copied verbatim from the original indexes.
+ */
+ attrs->classOID = (Oid *)palloc(sizeof(Oid) *
+ attrs->indexInfo->ii_NumIndexAttrs);
+ memcpy(attrs->classOID, index->indclass,
+ sizeof(Oid) * attrs->indexInfo->ii_NumIndexAttrs);
+
+ /* Name and access method of each index come from
+ * RelationRelationName.
+ */
+ tuple = SearchSysCache(RELOID, ObjectIdGetDatum(attrs->indexOID),
+ 0, 0, 0);
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "CLUSTER: cannot find index %u for table %s",
+ attrs->indexOID, oldrelation->relname);
+ class = (Form_pg_class) GETSTRUCT(tuple);
+ attrs->indexName = pstrdup(NameStr(class->relname));
+ attrs->accessMethodOID = class->relam;
+ ReleaseSysCache(tuple);
+
+ /* Cons the data to the index list. We don't care about ordering,
+ * and this is more efficient.
+ */
+ indexes=lcons((void *)attrs, indexes);
+ }
+ heap_endscan(scan);
+ heap_close(indexRelation, AccessShareLock);
+
+ /*
* Create the new heap with a temporary name.
*/
snprintf(NewHeapName, NAMEDATALEN, "temp_%u", OIDOldHeap);
@@ -111,13 +176,6 @@
/* To make the new heap's data visible. */
CommandCounterIncrement();
- /* Create new index over the tuples of the new heap. */
- snprintf(NewIndexName, NAMEDATALEN, "temp_%u", OIDOldIndex);
-
- OIDNewIndex = copy_index(OIDOldIndex, OIDNewHeap, NewIndexName);
-
- CommandCounterIncrement();
-
/* Destroy old heap (along with its index) and rename new. */
heap_drop_with_catalog(OIDOldHeap, allowSystemTableMods);
@@ -128,9 +186,21 @@
/* This one might be unnecessary, but let's be safe. */
CommandCounterIncrement();
- renamerel(OIDNewIndex, oldindexname);
+ /* Recreate the indexes on the relation.
+ */
+ foreach (elem, indexes)
+ {
+ attrs=(IndexAttrs *) lfirst(elem);
+ index_create(OIDNewHeap, attrs->indexName, attrs->indexInfo,
+ attrs->accessMethodOID, attrs->classOID, attrs->isPrimary,
+ allowSystemTableMods);
+ setRelhasindex(OIDNewHeap, true, attrs->isPrimary, InvalidOid);
+ CommandCounterIncrement();
+ }
}
+/* Create and initialize the new heap
+ */
static Oid
copy_heap(Oid OIDOldHeap, const char *NewName)
{
@@ -174,42 +244,8 @@
return OIDNewHeap;
}
-static Oid
-copy_index(Oid OIDOldIndex, Oid OIDNewHeap, const char *NewIndexName)
-{
- Oid OIDNewIndex;
- Relation OldIndex,
- NewHeap;
- IndexInfo *indexInfo;
-
- NewHeap = heap_open(OIDNewHeap, AccessExclusiveLock);
- OldIndex = index_open(OIDOldIndex);
-
- /*
- * Create a new index like the old one. To do this I get the info
- * from pg_index, and add a new index with a temporary name (that will
- * be changed later).
- */
- indexInfo = BuildIndexInfo(OldIndex->rd_index);
-
- OIDNewIndex = index_create(OIDNewHeap,
- NewIndexName,
- indexInfo,
- OldIndex->rd_rel->relam,
- OldIndex->rd_index->indclass,
- OldIndex->rd_index->indisprimary,
- allowSystemTableMods);
-
- setRelhasindex(OIDNewHeap, true,
- OldIndex->rd_index->indisprimary, InvalidOid);
-
- index_close(OldIndex);
- heap_close(NewHeap, NoLock);
-
- return OIDNewIndex;
-}
-
-
+/* Load the data into the new heap, clustered.
+ */
static void
rebuildheap(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex)
{
Alvaro Herrera <alvherre@atentus.com> writes:
I'm looking at pg_class, and relfilenode is equal to oid in all cases
AFAICS. If I create a new, "random" relfilenode, the equality will not
hold. Is that OK?
The idea is that OID is the logical table identifier and relfilenode is
the physical identifier (so relfilenode is what should be used in bufmgr
and below). There are undoubtedly a few places that get this wrong, but
we won't be able to ferret them out until someone starts to exercise
the facility.
Also, is the new relfilenode somehow guaranteed to
not be assigned to another relation (pg_class tuple, I think)?
I've been wondering about that myself. We might have to add a unique
index on pg_class.relfilenode to ensure this; otherwise, after OID
wraparound there would be no guarantees.
In this code, we delete the old relation, then rename the new one. It
would be good to have this all happen in one update of
pg_class.relfilenode; that way it is an atomic operation.
As long as you have not committed, it's atomic anyway because no one can
see your updates. It'd be nice to do it in one update for efficiency,
but don't contort the code beyond reason to achieve that.
regards, tom lane
Tom Lane wrote:
Also, is the new relfilenode somehow guaranteed to
not be assigned to another relation (pg_class tuple, I think)?I've been wondering about that myself. We might have to add a unique
index on pg_class.relfilenode to ensure this; otherwise, after OID
wraparound there would be no guarantees.
Yep, good point.
In this code, we delete the old relation, then rename the new one. It
would be good to have this all happen in one update of
pg_class.relfilenode; that way it is an atomic operation.As long as you have not committed, it's atomic anyway because no one can
see your updates. It'd be nice to do it in one update for efficiency,
but don't contort the code beyond reason to achieve that.
Sorry, I meant to say that we added relfilenode for exactly this case,
so that we have atomic file access semantics. Do we already have that
feature in the current code?
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Alvaro Herrera wrote:
In general, I'm not thrilled about expending more code on the existing
fundamentally-broken implementation of CLUSTER. We need to look at
making use of the ability to write a new version of a table (or index)
under a new relfilenode value, without changing the table's OID.
However, some parts of your patch will probably still be needed when
someone gets around to making that happen, so I won't object for now.
OK, I remember now. In renamerel(), the new clustered file is renamed
to the old table name. However, it has a new oid. The idea of
relfilenode was that we want to cluster the table and keep the same
relation oid. That way, other system tables that reference that OID
will still work.
Seems like renamerel will have to stay because it is used by ALTER TABLE
RENAME, so we just need some new code that updates the relfilenode of
the old pg_class row to point to the new clustered file. Swapping
relfilenodes between the old and new pg_class rows and deleting the new
table should do the trick of deleting the non-clustered file and the
temp pg_class row at the same time.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Seems like renamerel will have to stay because it is used by ALTER TABLE
RENAME, so we just need some new code that updates the relfilenode of
the old pg_class row to point to the new clustered file. Swapping
relfilenodes between the old and new pg_class rows and deleting the new
table should do the trick of deleting the non-clustered file and the
temp pg_class row at the same time.
I think you're still letting your thinking be contorted by the existing
CLUSTER implementation. Do we need a temp pg_class entry at all? Seems
like we just want to UPDATE the pg_class row with the new relfilenode
value; then we can see the update but no one else can (till we commit).
Ditto for the indexes.
What's still a little unclear to me is how to access the old heap and
index files to read the data while simultaneously accessing the new ones
to write it. Much of the existing code likes to have a Relation struct
available, not only a RelFileNode, so it may be necessary to have both
old and new Relations present at the same time. If that's the case we
might be stuck with making a temp pg_class entry just to support a phony
Relation :-(
regards, tom lane
Tom Lane wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Seems like renamerel will have to stay because it is used by ALTER TABLE
RENAME, so we just need some new code that updates the relfilenode of
the old pg_class row to point to the new clustered file. Swapping
relfilenodes between the old and new pg_class rows and deleting the new
table should do the trick of deleting the non-clustered file and the
temp pg_class row at the same time.I think you're still letting your thinking be contorted by the existing
CLUSTER implementation. Do we need a temp pg_class entry at all? Seems
like we just want to UPDATE the pg_class row with the new relfilenode
value; then we can see the update but no one else can (till we commit).
Ditto for the indexes.What's still a little unclear to me is how to access the old heap and
index files to read the data while simultaneously accessing the new ones
to write it. Much of the existing code likes to have a Relation struct
available, not only a RelFileNode, so it may be necessary to have both
old and new Relations present at the same time. If that's the case we
might be stuck with making a temp pg_class entry just to support a phony
Relation :-(
Yes, that was my conclusion, that we need the temp heap so we can access
it in a clean manner. Sure, it would be nice if we could access a file
on its own, but it doesn't seem worth the complexity required to
accomplish it.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Tom Lane dijo:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Seems like renamerel will have to stay because it is used by ALTER TABLE
RENAME, so we just need some new code that updates the relfilenode of
the old pg_class row to point to the new clustered file. Swapping
relfilenodes between the old and new pg_class rows and deleting the new
table should do the trick of deleting the non-clustered file and the
temp pg_class row at the same time.I think you're still letting your thinking be contorted by the existing
CLUSTER implementation. Do we need a temp pg_class entry at all? Seems
like we just want to UPDATE the pg_class row with the new relfilenode
value; then we can see the update but no one else can (till we commit).
Ditto for the indexes.
That's what I originally thought: mess around directly with the smgr (or
with some upper layer? I don't know) to create a new relfilenode, and
then attach it to the heap. I don't know if it's possible or too
difficult.
Then, with Bruce's explanation, I thought I should just create a temp
table and exchange relfilenodes, which is much simpler.
--
Alvaro Herrera (<alvherre[a]atentus.com>)
"El dia que dejes de cambiar dejaras de vivir"
Alvaro Herrera wrote:
I think you're still letting your thinking be contorted by the existing
CLUSTER implementation. Do we need a temp pg_class entry at all? Seems
like we just want to UPDATE the pg_class row with the new relfilenode
value; then we can see the update but no one else can (till we commit).
Ditto for the indexes.That's what I originally thought: mess around directly with the smgr (or
with some upper layer? I don't know) to create a new relfilenode, and
then attach it to the heap. I don't know if it's possible or too
difficult.Then, with Bruce's explanation, I thought I should just create a temp
table and exchange relfilenodes, which is much simpler.
Yes, creating the file is easy. Inserting into a file without a heap
relation pointer is a pain. ;-)
Actually, I did a little research on the crash/abort handling of
swapping the relfilenodes between the old and clustered pg_class heap
entries. A abort/crash can happen in several ways- abort of the
transaction, backend crash during the transaction, or cleaning up of old
file at commit.
First, when you create a table, the relfilenode is _copied_ to
pendingDeletes. These files are cleaned up if the transaction aborts in
smgrDoPendingDeletes(). There will also be an entry in there for the
same table after you call heap_delete. As long as the relfilenode at
heap_delete time points to the original heap file, you should be fine.
If it commits, the new file is removed (remember that oid was saved).
If it aborts, the new file is removed (CLUSTER failed).
As far as a crash, there is RemoveTempRelations which calls
FindTempRelations() but I think that only gets populated after the
commit happens and the temp namespaces is populated. I assume the
aborted part of the transaction is invisible to the backend at that
point (crash time). Of course, it wouldn't hurt to test these cases to
make sure it works right, and I would document that the copying of
relfilenode in smgr create/unlink is a desired effect relied upon by
cluster and any other commands that may use relfilenode renaming in the
future. (If it wasn't copied, then an abort would delete the _old_
heap. Bad.)
FYI, RENAME also deals with relfilenode renaming in setNewRelfilenode().
The difference is that RENAME doesn't need to access the old index, just
build a new one, so it can take shortcuts in how it handles things. It
uses two methods to modify the tuple, one directly modifying pg_class,
rather than inserting a new heap row. and another of doing it the
traditional way. It does this because REINDEX is used to fix corrupt
indexes, including corrupt system indexes. You will not be using that
type of code in CLUSTER because there is a real temp heap associated
with this operation. Just heap_update() like normal for both relations.
Hope this helps.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian dijo:
FYI, RENAME also deals with relfilenode renaming in setNewRelfilenode().
The difference is that RENAME doesn't need to access the old index, just
build a new one, so it can take shortcuts in how it handles things. It
uses two methods to modify the tuple, one directly modifying pg_class,
rather than inserting a new heap row. and another of doing it the
traditional way. It does this because REINDEX is used to fix corrupt
indexes, including corrupt system indexes. You will not be using that
type of code in CLUSTER because there is a real temp heap associated
with this operation. Just heap_update() like normal for both relations.
Well, I think my approach is somewhat more naive. What I'm actually
doing is something like:
1. save information on extant indexes
2. create a new heap, and populate (clustered)
3. swap the relfilenodes of the new and old heaps
4. drop new heap (with old relfilenode)
5. for each index saved in 1:
5.1. create a new index with the same attributes
5.2. swap relfilenodes of original and new index
5.3. drop new index (with old relfilenode)
But now I'm lost. It has worked sometimes; then I change a minimal
thing, recompile and then it doesn't work (bufmgr fails an assertion).
I can tell that the new (table) filenode is correct if I skip step 4
above, and check the temp table manually (but of course it has no
indexes). I've never gotten as far as completing all steps right, so I
cannot tell whether the new indexes' filenodes are correct.
I'm posting the new patch. Please review it; I've turned it upside down
a few times and frankly don't know what's happening. I sure am
forgetting a lock or something but cannot find what is it.
--
Alvaro Herrera (<alvherre[a]atentus.com>)
"Aprender sin pensar es inutil; pensar sin aprender, peligroso" (Confucio)
Attachments:
cluster.patchtext/plain; charset=US-ASCII; name=cluster.patchDownload
Index: cluster.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/commands/cluster.c,v
retrieving revision 1.82
diff -c -u -r1.82 cluster.c
--- cluster.c 2002/06/20 20:29:26 1.82
+++ cluster.c 2002/07/11 02:39:30
@@ -26,47 +26,69 @@
#include "access/heapam.h"
#include "catalog/heap.h"
#include "catalog/index.h"
+#include "catalog/indexing.h"
+#include "catalog/catname.h"
#include "catalog/pg_index.h"
#include "catalog/pg_proc.h"
#include "commands/cluster.h"
#include "commands/tablecmds.h"
#include "miscadmin.h"
#include "utils/builtins.h"
+#include "utils/fmgroids.h"
#include "utils/lsyscache.h"
#include "utils/syscache.h"
+/*
+ * We need one of these structs for each index in the relation to be
+ * clustered. It's basically the data needed by index_create() so
+ * we can recreate the indexes after destroying the old heap.
+ */
+typedef struct
+{
+ char *indexName;
+ IndexInfo *indexInfo;
+ Oid accessMethodOID;
+ Oid *classOID;
+ Oid indexOID;
+ bool isPrimary;
+} IndexAttrs;
static Oid copy_heap(Oid OIDOldHeap, const char *NewName);
-static Oid copy_index(Oid OIDOldIndex, Oid OIDNewHeap,
- const char *NewIndexName);
static void rebuildheap(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex);
+List *get_indexattr_list (Oid OIDOldHeap);
+void recreate_indexattr(Oid OIDOldHeap, List *indexes);
+void swap_relfilenodes(Oid r1, Oid r2);
/*
* cluster
*
* STILL TO DO:
- * Create a list of all the other indexes on this relation. Because
- * the cluster will wreck all the tids, I'll need to destroy bogus
- * indexes. The user will have to re-create them. Not nice, but
- * I'm not a nice guy. The alternative is to try some kind of post
- * destroy re-build. This may be possible. I'll check out what the
- * index create functiond want in the way of paramaters. On the other
- * hand, re-creating n indexes may blow out the space.
+ * Keep foreign keys, permissions and inheritance of the clustered table.
+ *
+ * We need to look at making use of the ability to write a new version of a
+ * table (or index) under a new relfilenode value, without changing the
+ * table's OID.
*/
void
cluster(RangeVar *oldrelation, char *oldindexname)
{
Oid OIDOldHeap,
OIDOldIndex,
- OIDNewHeap,
- OIDNewIndex;
+ OIDNewHeap;
Relation OldHeap,
OldIndex;
char NewHeapName[NAMEDATALEN];
- char NewIndexName[NAMEDATALEN];
+ List *indexes;
+ /* The games with filenodes may not be rollbackable, so
+ * disallow running this inside a transaction block.
+ * This may be a false assumption though.
+ */
+ if (IsTransactionBlock())
+ elog (ERROR, "CLUSTER: may not be called inside a transaction block");
+
/*
- * We grab exclusive access to the target rel and index for the
+ * We grab exclusive access to the target relation for the
* duration of the transaction.
*/
OldHeap = heap_openrv(oldrelation, AccessExclusiveLock);
@@ -94,43 +116,39 @@
heap_close(OldHeap, NoLock);
index_close(OldIndex);
- /*
- * Create the new heap with a temporary name.
- */
+ /* Save the information of all indexes on the relation. */
+ indexes = get_indexattr_list(OIDOldHeap);
+
+ /* Create the new heap with a temporary name. */
snprintf(NewHeapName, NAMEDATALEN, "temp_%u", OIDOldHeap);
OIDNewHeap = copy_heap(OIDOldHeap, NewHeapName);
/* We do not need CommandCounterIncrement() because copy_heap did it. */
- /*
- * Copy the heap data into the new table in the desired order.
- */
+ /* Copy the heap data into the new table in the desired order. */
rebuildheap(OIDNewHeap, OIDOldHeap, OIDOldIndex);
/* To make the new heap's data visible. */
CommandCounterIncrement();
-
- /* Create new index over the tuples of the new heap. */
- snprintf(NewIndexName, NAMEDATALEN, "temp_%u", OIDOldIndex);
-
- OIDNewIndex = copy_index(OIDOldIndex, OIDNewHeap, NewIndexName);
-
- CommandCounterIncrement();
- /* Destroy old heap (along with its index) and rename new. */
- heap_drop_with_catalog(OIDOldHeap, allowSystemTableMods);
+ /* Swap the relfilenodes of the old and new heaps. */
+ swap_relfilenodes(OIDNewHeap, OIDOldHeap);
CommandCounterIncrement();
-
- renamerel(OIDNewHeap, oldrelation->relname);
- /* This one might be unnecessary, but let's be safe. */
+ /* Destroy the new heap, carrying the old filenode along. */
+ heap_drop_with_catalog(OIDNewHeap, allowSystemTableMods);
CommandCounterIncrement();
- renamerel(OIDNewIndex, oldindexname);
+ /* Recreate the indexes on the relation. We do not need
+ * CommandCounterIncrement() because recreate_indexattr does it.
+ */
+ recreate_indexattr(OIDOldHeap, indexes);
}
+/* Create and initialize the new heap
+ */
static Oid
copy_heap(Oid OIDOldHeap, const char *NewName)
{
@@ -173,43 +191,9 @@
return OIDNewHeap;
}
-
-static Oid
-copy_index(Oid OIDOldIndex, Oid OIDNewHeap, const char *NewIndexName)
-{
- Oid OIDNewIndex;
- Relation OldIndex,
- NewHeap;
- IndexInfo *indexInfo;
-
- NewHeap = heap_open(OIDNewHeap, AccessExclusiveLock);
- OldIndex = index_open(OIDOldIndex);
-
- /*
- * Create a new index like the old one. To do this I get the info
- * from pg_index, and add a new index with a temporary name (that will
- * be changed later).
- */
- indexInfo = BuildIndexInfo(OldIndex->rd_index);
-
- OIDNewIndex = index_create(OIDNewHeap,
- NewIndexName,
- indexInfo,
- OldIndex->rd_rel->relam,
- OldIndex->rd_index->indclass,
- OldIndex->rd_index->indisprimary,
- allowSystemTableMods);
-
- setRelhasindex(OIDNewHeap, true,
- OldIndex->rd_index->indisprimary, InvalidOid);
-
- index_close(OldIndex);
- heap_close(NewHeap, NoLock);
-
- return OIDNewIndex;
-}
-
+/* Load the data into the new heap, clustered.
+ */
static void
rebuildheap(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex)
{
@@ -252,4 +236,157 @@
index_close(LocalOldIndex);
heap_close(LocalOldHeap, NoLock);
heap_close(LocalNewHeap, NoLock);
+}
+
+/* Get the necessary info about the indexes in the relation and
+ * return a List of IndexAttrs.
+ */
+List *
+get_indexattr_list (Oid OIDOldHeap)
+{
+ ScanKeyData entry;
+ HeapScanDesc scan;
+ Relation indexRelation;
+ HeapTuple indexTuple;
+ List *indexes = NIL;
+ IndexAttrs *attrs;
+ HeapTuple tuple;
+ Form_pg_index index;
+
+ /* Grab the index tuples by looking into RelationRelationName
+ * by the OID of the old heap.
+ */
+ indexRelation = heap_openr(IndexRelationName, AccessShareLock);
+ ScanKeyEntryInitialize(&entry, 0, Anum_pg_index_indrelid,
+ F_OIDEQ, ObjectIdGetDatum(OIDOldHeap));
+ scan = heap_beginscan(indexRelation, SnapshotNow, 1, &entry);
+ while ((indexTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ index = (Form_pg_index) GETSTRUCT(indexTuple);
+
+ attrs = (IndexAttrs *) palloc(sizeof(IndexAttrs));
+ attrs->indexInfo = BuildIndexInfo(index);
+ attrs->isPrimary = index->indisprimary;
+ attrs->indexOID = index->indexrelid;
+
+ /* The opclasses are copied verbatim from the original indexes.
+ */
+ attrs->classOID = (Oid *)palloc(sizeof(Oid) *
+ attrs->indexInfo->ii_NumIndexAttrs);
+ memcpy(attrs->classOID, index->indclass,
+ sizeof(Oid) * attrs->indexInfo->ii_NumIndexAttrs);
+
+ /* Name and access method of each index come from
+ * RelationRelationName.
+ */
+ tuple = SearchSysCache(RELOID,
+ ObjectIdGetDatum(attrs->indexOID),
+ 0, 0, 0);
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "CLUSTER: cannot find index %u", attrs->indexOID);
+ attrs->indexName = pstrdup(NameStr(((Form_pg_class) GETSTRUCT(tuple))->relname));
+ attrs->accessMethodOID = ((Form_pg_class) GETSTRUCT(tuple))->relam;
+ ReleaseSysCache(tuple);
+
+ /* Cons the gathered data into the list. We do not care about
+ * ordering, and this is more efficient than append.
+ */
+ indexes=lcons((void *)attrs, indexes);
+ }
+ heap_endscan(scan);
+ heap_close(indexRelation, AccessShareLock);
+ return indexes;
+}
+
+/* Create new indexes and swap the filenodes with old indexes. Then drop
+ * the new index (carrying the old heap along).
+ */
+void
+recreate_indexattr(Oid OIDOldHeap, List *indexes)
+{
+ IndexAttrs *attrs;
+ List *elem;
+ Oid newIndexOID;
+ char newIndexName[NAMEDATALEN];
+
+ foreach (elem, indexes)
+ {
+ attrs=(IndexAttrs *) lfirst(elem);
+
+ /* Create the new index under a temporary name */
+ snprintf(newIndexName, NAMEDATALEN, "temp_%u", attrs->indexOID);
+ newIndexOID = index_create(OIDOldHeap, newIndexName,
+ attrs->indexInfo, attrs->accessMethodOID,
+ attrs->classOID, attrs->isPrimary,
+ allowSystemTableMods);
+ CommandCounterIncrement();
+
+ /* Swap the filenodes. */
+ swap_relfilenodes(newIndexOID, attrs->indexOID);
+ setRelhasindex(OIDOldHeap, true, attrs->isPrimary, InvalidOid);
+
+ /* I'm not sure this one is needed, but let's be safe. */
+ CommandCounterIncrement();
+
+ /* Drop the new index, carrying the old filenode along. */
+ index_drop(newIndexOID);
+ CommandCounterIncrement();
+
+ pfree(attrs->classOID);
+ pfree(attrs);
+ }
+ freeList(indexes);
+}
+
+/* Swap the relfilenodes for two given relations.
+ */
+void
+swap_relfilenodes(Oid r1, Oid r2)
+{
+ /* I can probably keep RelationRelationName open in the main
+ * function and pass the Relation around so I don't have to open
+ * it avery time.
+ */
+ Relation relRelation,
+ irels[Num_pg_class_indices];
+ HeapTuple reltup[2];
+ Oid tempRFNode;
+
+ /* We need both RelationRelationName tuples. */
+ relRelation = heap_openr(RelationRelationName, RowExclusiveLock);
+
+ reltup[0] = SearchSysCacheCopy(RELOID,
+ ObjectIdGetDatum(r1),
+ 0, 0, 0);
+ if (!HeapTupleIsValid(reltup[0]))
+ elog(ERROR, "CLUSTER: Cannot find tuple for relation %u", r1);
+ reltup[1] = SearchSysCacheCopy(RELOID,
+ ObjectIdGetDatum(r2),
+ 0, 0, 0);
+ if (!HeapTupleIsValid(reltup[1]))
+ elog(ERROR, "CLUSTER: Cannot find tuple for relation %u", r2);
+
+ /* Actually swap the filenodes */
+
+ tempRFNode = ((Form_pg_class) GETSTRUCT(reltup[0]))->relfilenode;
+ ((Form_pg_class) GETSTRUCT(reltup[0]))->relfilenode =
+ ((Form_pg_class) GETSTRUCT(reltup[1]))->relfilenode;
+ ((Form_pg_class) GETSTRUCT(reltup[1]))->relfilenode = tempRFNode;
+
+ /* Update the RelationRelationName tuples */
+ simple_heap_update(relRelation, &reltup[1]->t_self, reltup[1]);
+ simple_heap_update(relRelation, &reltup[0]->t_self, reltup[0]);
+
+ /* Keep system catalogs current. */
+ CatalogOpenIndices(Num_pg_class_indices, Name_pg_class_indices, irels);
+ CatalogIndexInsert(irels, Num_pg_class_indices, relRelation, reltup[0]);
+ CatalogIndexInsert(irels, Num_pg_class_indices, relRelation, reltup[1]);
+ CatalogCloseIndices(Num_pg_class_indices, irels);
+
+ CommandCounterIncrement();
+
+ heap_close(relRelation, NoLock);
+ heap_freetuple(reltup[0]);
+ heap_freetuple(reltup[1]);
+
}
Tom Lane writes:
Also, is the new relfilenode somehow guaranteed to
not be assigned to another relation (pg_class tuple, I think)?I've been wondering about that myself. We might have to add a unique
index on pg_class.relfilenode to ensure this; otherwise, after OID
wraparound there would be no guarantees.
I've never been happy with the current setup. It's much too tempting to
think file name = OID, both in the backend code and by external onlookers,
especially since it's currently rare/impossible(?) for them to be
different.
It would be a lot clearer if relfilenode were replaced by an integer
version, starting at 0, and the heap files were named "OID_VERSION".
(In related news, how about filling up the oid/relfilenode numbers with
zeros on the left, so a directory listing would reflect the numerical
order?)
--
Peter Eisentraut peter_e@gmx.net
Peter Eisentraut wrote:
Tom Lane writes:
Also, is the new relfilenode somehow guaranteed to
not be assigned to another relation (pg_class tuple, I think)?I've been wondering about that myself. We might have to add a unique
index on pg_class.relfilenode to ensure this; otherwise, after OID
wraparound there would be no guarantees.I've never been happy with the current setup. It's much too tempting to
think file name = OID, both in the backend code and by external onlookers,
especially since it's currently rare/impossible(?) for them to be
different.
Yea, only reindex and cluster change them. Problem is we already have
oid as a nice unique number ready for use. I don't see a huge advantage
of improving it.
It would be a lot clearer if relfilenode were replaced by an integer
version, starting at 0, and the heap files were named "OID_VERSION".
Problem there is that we can't have relfilenode as an int unless we take
the table oid and sequence number and merge them on the fly in the
backend. Would be nice for admins, though, so the oid would be there.
I thought WAL liked the relfilenode as a single number.
(In related news, how about filling up the oid/relfilenode numbers with
zeros on the left, so a directory listing would reflect the numerical
order?)
Problem there is that we increase the size of much of the directory
lookups. Not sure if it is worth worrying about.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Peter Eisentraut <peter_e@gmx.net> writes:
It would be a lot clearer if relfilenode were replaced by an integer
version, starting at 0, and the heap files were named "OID_VERSION".
The reason to not do that is that the bufmgr and levels below would now
need to pass around three numbers, not two, to identify physical files.
We already beat this topic to death a year ago, I'm not eager to re-open
it.
regards, tom lane
On Mon, 15 Jul 2002, Bruce Momjian wrote:
(In related news, how about filling up the oid/relfilenode numbers with
zeros on the left, so a directory listing would reflect the numerical
order?)Problem there is that we increase the size of much of the directory
lookups. Not sure if it is worth worrying about.
Probably not such a big deal, since most systems will be reading
a full block (8K or 16K under *BSD) to load the directory information
anyway. Doing it in hex would give you only 8-char filenames, anyway.
cjs
--
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC
Curt Sampson wrote:
On Mon, 15 Jul 2002, Bruce Momjian wrote:
(In related news, how about filling up the oid/relfilenode numbers with
zeros on the left, so a directory listing would reflect the numerical
order?)Problem there is that we increase the size of much of the directory
lookups. Not sure if it is worth worrying about.Probably not such a big deal, since most systems will be reading
a full block (8K or 16K under *BSD) to load the directory information
anyway. Doing it in hex would give you only 8-char filenames, anyway.
Yes, hex may be interesting as a more compact, consistent format. We
need to change the docs so oid2name and queries convert to hex on
output.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <pgman@candle.pha.pa.us> writes:
(In related news, how about filling up the oid/relfilenode numbers with
zeros on the left, so a directory listing would reflect the numerical
order?)
Yes, hex may be interesting as a more compact, consistent format. We
need to change the docs so oid2name and queries convert to hex on
output.
I don't really see the value-added here. If we had made this decision
before releasing 7.1, I'd not have objected; but at this point we're
talking about breaking oid2name and any similar scripts that people
may have developed, for what's really a *very* marginal gain. Who cares
whether a directory listing reflects numerical order?
regards, tom lane
Tom Lane wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
(In related news, how about filling up the oid/relfilenode numbers with
zeros on the left, so a directory listing would reflect the numerical
order?)Yes, hex may be interesting as a more compact, consistent format. We
need to change the docs so oid2name and queries convert to hex on
output.I don't really see the value-added here. If we had made this decision
before releasing 7.1, I'd not have objected; but at this point we're
talking about breaking oid2name and any similar scripts that people
may have developed, for what's really a *very* marginal gain. Who cares
whether a directory listing reflects numerical order?
I don't see the big value either, just brainstorming.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026