WIP: pre-upgrade page reservation

Started by Zdenek Kotalaabout 17 years ago5 messages
#1Zdenek Kotala
Zdenek.Kotala@Sun.COM
1 attachment(s)

I attached contrib module which is base of preupgrade script. It should be part
of 8.4, but it will be required for 8.4->8.5 upgrade.

This part contains space reservation for heap/toast relations. The idea is that
relation is read and each block is checked if there is enough free space. Tuples
which will not be visible after upgrade are not count. If there is no space,
then simple_heap_update on tuple(s) until we release enough space.

BTree space reservation is more complicated. I plan to use _bt_split and split
page to two half pages with following code:

firstright = _bt_findsplitloc(rel, page, InvalidOffsetNumber, 0,&newitemonleft);
_bt_split(rel, buffer, firstright, InvalidOffsetNumber, 0, NULL,newitemonleft);
_bt_insert_parent(rel, buffer, rbuffer, stack, is_root, is_only);

Because both functions (_bt_findsplintloc, _bt_split) expect that we want to
insert new item, It will requires modification to accept InvalidOffsetNumber.

Another problem is to build stack which require to use deep tree scan. I hope
that it will not require exclusive lock on index.

I'm not yet look on hash, gist and gin. I think that hash index should be easy,
because index tuples can be moved into new bucket page. (Note: general problem
with hash index is still bitmap pages).

I guess solution for Gist index should be similar to BTree, but I don't have any
idea about GIN.

Comments, ideas, better solutions?

thanks Zdenek

PS: This patch requires previous patch which implemented space reservation
functionality.

Attachments:

preupgrade.patchtext/x-diff; name=preupgrade.patchDownload
diff -r 84e2e9c42ef7 contrib/pg_upgrade/Makefile
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/contrib/pg_upgrade/Makefile	Fri Dec 12 11:32:03 2008 +0100
@@ -0,0 +1,24 @@
+#-------------------------------------------------------------------------
+#
+# pg_upgrade Makefile
+#
+# $PostgreSQL:  $
+#
+#-------------------------------------------------------------------------
+
+MODULE_big	= pg_upgrade
+OBJS		= pg_upgrade.o rs_heap.o rs_nbtree.o
+DATA_built	= pg_upgrade.sql
+#DATA      	= uninstall_pgstattuple.sql
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_upgrade
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
+
diff -r 84e2e9c42ef7 contrib/pg_upgrade/pg_upgrade.c
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/contrib/pg_upgrade/pg_upgrade.c	Fri Dec 12 11:32:03 2008 +0100
@@ -0,0 +1,133 @@
+#include "postgres.h"
+
+#include "fmgr.h"
+#include "funcapi.h"
+#include "access/clog.h"
+#include "access/heapam.h"
+#include "access/htup.h"
+#include "catalog/namespace.h"
+#include "storage/bufmgr.h"
+#include "storage/bufpage.h"
+#include "storage/relfilenode.h"
+#include "utils/builtins.h"
+#include "utils/rel.h"
+
+#include "miscadmin.h"
+
+
+PG_MODULE_MAGIC;
+
+
+PG_FUNCTION_INFO_V1(rscheck_rel_by_oid);
+PG_FUNCTION_INFO_V1(rscheck_rel_by_name);
+
+/* 
+ * Return list of visible items, its count and size. We don't need count
+ * dead and deleted tuples.
+ * List is ended with zeroes 
+ */
+HeapTuple
+page_get_visible_items(Page page, BlockNumber block, int *items, Size *size, bool isHeap)
+{
+	OffsetNumber 	i;
+	OffsetNumber 	maxoff;
+	ItemId			iid;
+	HeapTupleHeader tph;
+	HeapTuple		vi_list = NULL;
+	int				count = 0;
+	Size			occupied = 0;
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	if(isHeap)
+		vi_list = palloc0(sizeof(HeapTupleData)*(maxoff+1));
+		
+	for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+	{
+		iid = PageGetItemId(page, i);
+		if(ItemIdIsNormal(iid))
+		{
+			if(isHeap)
+			{
+				XLogRecPtr lsn;
+				tph = (HeapTupleHeader) PageGetItem(page, iid);
+
+				/* we need to count only really visible tuples. Other should be removed during page conversion */
+				if( TRANSACTION_STATUS_COMMITTED != TransactionIdGetStatus(HeapTupleHeaderGetXmin(tph), &lsn) ||
+					TRANSACTION_STATUS_COMMITTED == TransactionIdGetStatus(HeapTupleHeaderGetXmax(tph), &lsn) )
+				continue;
+
+				vi_list[count].t_data = (HeapTupleHeader) PageGetItem(page, iid);
+				vi_list[count].t_len = ItemIdGetLength(iid);
+				ItemPointerSet(&(vi_list[count].t_self), block, i);
+			}	
+			occupied += MAXALIGN(ItemIdGetLength(iid));
+			count++;
+		}
+	}
+
+	if(items != NULL)
+		*items = count;
+	if(size != NULL)
+		*size = occupied;
+
+	return vi_list;
+}
+
+int
+rscheck_rel(Relation rel)
+{
+	uint32		blkcnt;
+	uint32 		modified = 0;          
+
+	blkcnt = RelationGetNumberOfBlocks(rel);
+//	elog(NOTICE,"Start reserved space check - relation %s",rel->rd_rel->relname);
+	elog(NOTICE,"Total blocks for processing: %i", blkcnt);
+
+	switch(rel->rd_rel->relkind)
+	{
+		case 'r' :
+		case 't' : rs_heap(rel);
+				   break;
+		default : elog(ERROR, "Cannot reserve a space. Unsupported relation kind.");
+	}
+	elog(NOTICE,"Total modified blocks: %i", modified);
+
+	return 0;
+}
+
+Datum
+rscheck_rel_by_name(PG_FUNCTION_ARGS)
+{
+	text       *relname = PG_GETARG_TEXT_P(0);
+	RangeVar   *relrv;
+	Relation    rel;
+
+	if (!superuser())
+		ereport(ERROR,
+		(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+		(errmsg("must be superuser to use rs_check functions"))));
+
+	/* open relation */
+	relrv = makeRangeVarFromNameList(textToQualifiedNameList(relname));
+	rel = relation_openrv(relrv, AccessExclusiveLock);
+
+	return rscheck_rel(rel);
+}
+
+Datum
+rscheck_rel_by_oid(PG_FUNCTION_ARGS)
+{
+	Relation    rel;
+
+	if (!superuser())
+		ereport(ERROR,
+		(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+		(errmsg("must be superuser to use rs_check functions"))));
+
+	Oid	relID = PG_GETARG_OID(0);
+	rel = relation_open(relID,  AccessExclusiveLock);
+
+	return rscheck_rel(rel);
+}
+
diff -r 84e2e9c42ef7 contrib/pg_upgrade/pg_upgrade.sql.in
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/contrib/pg_upgrade/pg_upgrade.sql.in	Fri Dec 12 11:32:03 2008 +0100
@@ -0,0 +1,8 @@
+CREATE OR REPLACE FUNCTION rs_check(IN relid oid)
+RETURNS int
+AS '$libdir/pg_upgrade','rscheck_rel_by_oid' LANGUAGE C STRICT;
+
+CREATE OR REPLACE FUNCTION rs_check(IN relname text)
+RETURNS int
+AS '$libdir/pg_upgrade','rscheck_rel_by_name' LANGUAGE C STRICT;
+
diff -r 84e2e9c42ef7 contrib/pg_upgrade/rs_check.h
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/contrib/pg_upgrade/rs_check.h	Fri Dec 12 11:32:03 2008 +0100
@@ -0,0 +1,8 @@
+#include "access/htup.h"
+#include "storage/bufpage.h"
+#include "utils/rel.h"
+
+extern HeapTuple page_get_visible_items(Page page, BlockNumber block, int *items, Size *size, bool isHeap);
+extern int rs_heap(Relation rel);
+extern int rs_nbtree(Relation rel);
+
diff -r 84e2e9c42ef7 contrib/pg_upgrade/rs_heap.c
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/contrib/pg_upgrade/rs_heap.c	Fri Dec 12 11:32:03 2008 +0100
@@ -0,0 +1,79 @@
+#include "postgres.h"
+
+#include "access/clog.h"
+#include "access/heapam.h"
+#include "catalog/namespace.h"
+#include "storage/bufmgr.h"
+#include "storage/relfilenode.h"
+#include "rs_check.h"
+
+void
+rs_heap_move_tuple(Relation rel, HeapTuple vi_list, int requested_size)
+{
+	HeapTuple tuple;
+	int n;
+
+	for(n = 0; requested_size > 0 && vi_list[n].t_data != NULL; n++)
+	{
+		tuple = heap_copytuple(&vi_list[n]);
+		simple_heap_update(rel, &tuple->t_self, tuple);
+		pfree(tuple);
+		requested_size -= MAXALIGN(vi_list[n].t_len);
+	}
+}
+
+int
+rs_heap(Relation rel)
+{
+	BlockNumber	blkno;
+	uint32		blkcnt, upgraded = 0;          
+	Buffer		buffer;
+	int			items;
+	Size		space_occupied;
+	Size		space_max;
+	Size		space_reserved;
+
+	blkcnt = RelationGetNumberOfBlocks(rel);
+
+	Assert(rel->rd_rel->relkind == 'r' || rel->rd_rel->relkind == 't');
+
+	for(blkno = 0; blkno < blkcnt; blkno++)
+	{
+		HeapTuple vi;
+
+		if( (blkno+1) % 1000 == 0 )
+			elog(NOTICE,"%i blocks has been processed.", blkno+1);
+
+		buffer = ReadBuffer(rel, blkno);
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+
+		vi = page_get_visible_items(BufferGetPage(buffer),
+									blkno,
+									&items, &space_occupied,
+									true); 
+
+		space_max = PageGetMaxDataSpace(BufferGetPage(buffer));
+
+		space_reserved = RelationGetReservedSpacePerPage(rel)+
+						 items*RelationGetReservedSpacePerTuple(rel);
+
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+		if(space_occupied+space_reserved > space_max)
+		{
+			elog(NOTICE,"Block %i needs cleanup. %i+%i>%i", blkno, 
+					space_occupied, space_reserved,space_max );
+
+			upgraded++;
+			rs_heap_move_tuple(rel, vi, space_occupied+space_reserved-space_max);
+		}
+
+		if(vi != NULL) pfree(vi);
+		ReleaseBuffer(buffer);
+	} 
+	relation_close(rel,AccessExclusiveLock);
+	elog(NOTICE,"Total upgraded blocks: %i", upgraded);
+
+	return 0;
+}
+
diff -r 84e2e9c42ef7 contrib/pg_upgrade/rs_nbtree.c
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/contrib/pg_upgrade/rs_nbtree.c	Fri Dec 12 11:32:03 2008 +0100
@@ -0,0 +1,58 @@
+#include "postgres.h"
+
+#include "rs_check.h"
+
+int rs_nbtree(Relation rel)
+{
+	elog(ERROR,"Reservation space is not supported for BTree.");
+	return 0;
+}
+
+/*
+ * _bt_resever_space -- reserve space on page for future in-place upgrade
+ */
+/*
+void
+_bt_reservespace(Relation rel, Buffer buffer)
+{
+	BTPageOpaque lpageop;
+	bool		is_root;
+	bool		is_only;
+	bool		newitemonleft;
+	Buffer		rbuffer;
+	Page		page;
+ 	BTStack     stack;
+
+
+	page = BufferGetPage(buffer);
+	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+	is_root = P_ISROOT(lpageop);
+	is_only = P_LEFTMOST(lpageop) && P_RIGHTMOST(lpageop);
+
+*/	/* Choose the split point */
+//	firstright = _bt_findsplitloc(rel, page,
+//								  InvalidOffsetNumber, 0,
+//								  &newitemonleft);
+
+	/* split the buffer into left and right halves */
+	//rbuffer = _bt_split(rel, buffer, firstright,
+		//			 InvalidOffsetNumber, 0, NULL, newitemonleft);
+
+	/*----------
+	 * By here,
+	 *
+	 *		+  our target page has been split;
+	 *		+  the original tuple has been inserted;
+	 *		+  we have write locks on both the old (left half)
+	 *		   and new (right half) buffers, after the split; and
+	 *		+  we know the key we want to insert into the parent
+	 *		   (it's the "high key" on the left child page).
+	 *
+	 * We're ready to do the parent insertion.  We need to hold onto the
+	 * locks for the child pages until we locate the parent, but we can
+	 * release them before doing the actual insertion (see Lehman and Yao
+	 * for the reasoning).
+	 *----------
+	 */
+//	_bt_insert_parent(rel, buffer, rbuffer, stack, is_root, is_only);
+//}
#2Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Zdenek Kotala (#1)
Re: WIP: pre-upgrade page reservation

Zdenek Kotala wrote:

BTree space reservation is more complicated.

Do you need to pre-reserve the space for b-tree? I think you can just
split it at upgrade, in the new version. The problem with doing that for
heaps is that to move a heap tuple you need to update the index
pointers, but for indexes there's no such restriction.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#3Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Heikki Linnakangas (#2)
Re: WIP: pre-upgrade page reservation

Heikki Linnakangas napsal(a):

Zdenek Kotala wrote:

BTree space reservation is more complicated.

Do you need to pre-reserve the space for b-tree? I think you can just
split it at upgrade, in the new version. The problem with doing that for
heaps is that to move a heap tuple you need to update the index
pointers, but for indexes there's no such restriction.

The problem is that I need to know parent and modify parent as well. But you
don't know what is your parent node. You need to know root and go from root.
It is why I think that it is not doable online.

Correct me if I'm wrong.

thanks Zdenek

#4Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Zdenek Kotala (#3)
Re: WIP: pre-upgrade page reservation

Zdenek Kotala wrote:

Heikki Linnakangas napsal(a):

Zdenek Kotala wrote:

BTree space reservation is more complicated.

Do you need to pre-reserve the space for b-tree? I think you can just
split it at upgrade, in the new version. The problem with doing that
for heaps is that to move a heap tuple you need to update the index
pointers, but for indexes there's no such restriction.

The problem is that I need to know parent and modify parent as well. But
you don't know what is your parent node. You need to know root and go
from root.
It is why I think that it is not doable online.

Oh, you're planning to walk the B-tree in index order, not physical
order, so that you always have the stack for inserting the parents? You
don't necessarily need the stack, if you're not worried about
performance. _bt_insert_parent will scan the next level up to find the
parent in that case. That's slow, but so is walking the B-tree, and I'd
expect it to be rare that you need to split b-tree pages at upgrade anyway.

(I still think you're distracted, BTW. There's zero evidence that we'll
need any of this for the 8.4->8.5 upgrade. And if we do, we don't know
for sure that this will solve the problem, whatever the problem is.)

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#5Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Heikki Linnakangas (#4)
Re: WIP: pre-upgrade page reservation

Heikki Linnakangas napsal(a):

Zdenek Kotala wrote:

Heikki Linnakangas napsal(a):

Zdenek Kotala wrote:

BTree space reservation is more complicated.

Do you need to pre-reserve the space for b-tree? I think you can just
split it at upgrade, in the new version. The problem with doing that
for heaps is that to move a heap tuple you need to update the index
pointers, but for indexes there's no such restriction.

The problem is that I need to know parent and modify parent as well.
But you don't know what is your parent node. You need to know root and
go from root.
It is why I think that it is not doable online.

Oh, you're planning to walk the B-tree in index order, not physical
order, so that you always have the stack for inserting the parents?

Yes, it was a idea.

You
don't necessarily need the stack, if you're not worried about
performance. _bt_insert_parent will scan the next level up to find the
parent in that case. That's slow, but so is walking the B-tree, and I'd
expect it to be rare that you need to split b-tree pages at upgrade anyway.

Cool. I overlooked it.

(I still think you're distracted, BTW. There's zero evidence that we'll
need any of this for the 8.4->8.5 upgrade. And if we do, we don't know
for sure that this will solve the problem, whatever the problem is.)

We made a decision in a previous thread that we need space reservation when we
want to have CRC field in page header to prevent space expansion problem in page
conversion during upgrade. I think that currently we know what is necessary for
8.2->8.3/4 upgrade. What problems we can expect. We don't know if these kind of
changes happen in future or not. We know only about CRC at this moment. But I
supposed to prepare PostgreSQL to deal with all this issues. I'm not much happy
with idea to backport a lot of code to older version.

IIRC, we don't plan to backport space reservation back into 8.2, because ...
Why it should be accepted for 8.4 when 8.5 will be released?

Maybe I miss something or maybe I have lost in mailing thread and opinions ...

Zdenek