PG_PAGE_LAYOUT_VERSION 5 - time for change

Started by Zdenek Kotalaabout 17 years ago23 messages

Zdenek.Kotala@Sun.COM

about 17 years ago

It seems that we are going to bump Page Layout Version to version 5 (see CRC
patch for detail). Maybe it is good time to do some other changes. There is a
list of ideas (please, do not beat me :-). Some of them we discussed in Prato
and Greg maybe have more.

1) HeapTupleHeader modification

typedef struct HeapTupleFields
{
TransactionId t_xmin; /* inserting xact ID */
TransactionId t_xmax; /* deleting or locking xact ID */

union
{
CommandId t_cid;
TransactionId t_xvac; /* VACUUM FULL xact ID */
} t_field3;
uint16 t_infomask;
} HeapTupleFields;

typedef struct HeapTupleHeaderData
{
union
{
HeapTupleFields t_heap;
DatumTupleFields t_datum;
} t_choice;

ItemPointerData t_ctid; /* current TID of this or newer tuple */

/* Fields below here must match MinimalTupleData! */

uint16 t_infomask2;
uint8 t_hoff;

/* ^ - 23 bytes - ^ */

bits8 t_bits[1];
} HeapTupleHeaderData;

This also requires shuffle flags between infomask and infomask2. infomask2
should have only flag: HASNULL,HASOID,HASVARWIDTH and HASEXTERNAL And minimal
tuple does not need infomask field which will contains only transaction hint
bits. Unfortunately, structure alligment is not much friendly.

2) Add page type (e.g. btree) and subtype (e.g. metapage) flag into page header.
I think It will be useful when we will use share buffer for clog.

3) TOAST modification
a) TOAST table per attribute
b) replace chunk id with offset+variable chunk size
c) add column identification into first chunk

Thats all. I think infomask/infomask2 shuffle flag should be done. TOAST
modification complicates in-place upgrade.

Comments other ideas?

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql

Gregory Stark

stark@enterprisedb.com

about 17 years ago

In reply to: Zdenek Kotala (#1)

Re: PG_PAGE_LAYOUT_VERSION 5 - time for change

Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:

3) TOAST modification
a) TOAST table per attribute
b) replace chunk id with offset+variable chunk size
c) add column identification into first chunk

Thats all. I think infomask/infomask2 shuffle flag should be done. TOAST
modification complicates in-place upgrade.

I don't think TOAST table per attribute is feasible You would end up with
thousands of toast tables. It might be interesting as an option if you plan to
drop the column but I don't see it as terribly interesting.

What seemed to make sense to me for solving your problem was including the
type oid in the toast chunks. I suppose attribute number might be just as good
-- it would let you save upgrading chunks for dropped columns at the expense
of having to look up the column info first.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's Slony Replication support!

Tom Lane

tgl@sss.pgh.pa.us

about 17 years ago

In reply to: Zdenek Kotala (#1)

Re: PG_PAGE_LAYOUT_VERSION 5 - time for change

Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:

1) HeapTupleHeader modification

typedef struct HeapTupleFields
{
TransactionId t_xmin; /* inserting xact ID */
TransactionId t_xmax; /* deleting or locking xact ID */

union
{
CommandId t_cid;
TransactionId t_xvac; /* VACUUM FULL xact ID */
} t_field3;
uint16 t_infomask;
} HeapTupleFields;

This is unworkable (hint: the compiler will decide sizeof the struct
must be a multiple of 4). I am also frightened to death of the proposal
to swap various bits around between infomask and infomask2 --- that is
*guaranteed* to break code silently. And you didn't explain exactly
what it buys, anyway. Not space savings in the Datum form; alignment
issues will prevent that.

2) Add page type (e.g. btree) and subtype (e.g. metapage) flag into page header.
I think It will be useful when we will use share buffer for clog.

I think this is a pretty bad idea, because it'll eat space on every page
for data that is only useful to indexes. I don't believe that clog will
find it interesting either. To share buffers with clog will require
a change in buffer lookup tags, not buffer contents.

3) TOAST modification
a) TOAST table per attribute
b) replace chunk id with offset+variable chunk size
c) add column identification into first chunk

I don't like 3a any more than Greg does. 3b sounds good until you
reflect that a genuinely variable chunk size would preclude random
access to sub-ranges of a toast value. A column ID might be worth
adding for robustness purposes, though reducing the toast chunk payload
size to make that possible will cause you fits for in-place upgrade.

regards, tom lane

Gregory Stark

stark@enterprisedb.com

about 17 years ago

In reply to: Tom Lane (#3)

Re: PG_PAGE_LAYOUT_VERSION 5 - time for change

Tom Lane <tgl@sss.pgh.pa.us> writes:

2) Add page type (e.g. btree) and subtype (e.g. metapage) flag into page header.
I think It will be useful when we will use share buffer for clog.

I think this is a pretty bad idea, because it'll eat space on every page
for data that is only useful to indexes. I don't believe that clog will
find it interesting either. To share buffers with clog will require
a change in buffer lookup tags, not buffer contents.

Another example application which came to mind, if we ever wanted to do
something like retail vacuum, pruning, or hint bit setting from bgwriter it
would have to know how to tell heap pages apart from index pages. I'm not sure
whether that would have to be on the page or if it could be in the buffertag
as well?

If we do decide we want to do this it wouldn't have to be very much space. 16
page types with 16 subtypes each would be plenty which would fit on a single
byte.

3) TOAST modification
a) TOAST table per attribute
b) replace chunk id with offset+variable chunk size
c) add column identification into first chunk

I don't like 3a any more than Greg does. 3b sounds good until you
reflect that a genuinely variable chunk size would preclude random
access to sub-ranges of a toast value.

Hm, Heikki had me convinced it wouldn't but now that I try to explain it I
can't get it to work. I think the idea is you start a scan at the desired
offset and scan until you reach a chunk which overruns the end of the desired
piece. However you really need to start scanning at the last chunk *prior* to
the desired offset.

I think you can actually do this with btrees but I don't know if our apis
support it. If you scan to find the first chunk > the desired offset and then
scan backwards one tuple you should be looking at the chunk in which the
desired offset lies.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's Slony Replication support!

Tom Lane

tgl@sss.pgh.pa.us

about 17 years ago

In reply to: Gregory Stark (#4)

Re: PG_PAGE_LAYOUT_VERSION 5 - time for change

Gregory Stark <stark@enterprisedb.com> writes:

Tom Lane <tgl@sss.pgh.pa.us> writes:

... 3b sounds good until you
reflect that a genuinely variable chunk size would preclude random
access to sub-ranges of a toast value.

Hm, Heikki had me convinced it wouldn't but now that I try to explain it I
can't get it to work. I think the idea is you start a scan at the desired
offset and scan until you reach a chunk which overruns the end of the desired
piece. However you really need to start scanning at the last chunk *prior* to
the desired offset.

Yeah, that was my conclusion too.

I think you can actually do this with btrees but I don't know if our apis
support it. If you scan to find the first chunk > the desired offset and then
scan backwards one tuple you should be looking at the chunk in which the
desired offset lies.

Well, that might work but it would typically cost you an extra fetch.
Do we really have a use case for variable chunk size that is worth the
cost?

regards, tom lane

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 17 years ago

In reply to: Tom Lane (#5)

Re: PG_PAGE_LAYOUT_VERSION 5 - time for change

Tom Lane wrote:

Gregory Stark <stark@enterprisedb.com> writes:

Tom Lane <tgl@sss.pgh.pa.us> writes:

... 3b sounds good until you
reflect that a genuinely variable chunk size would preclude random
access to sub-ranges of a toast value.

Hm, Heikki had me convinced it wouldn't but now that I try to explain it I
can't get it to work. I think the idea is you start a scan at the desired
offset and scan until you reach a chunk which overruns the end of the desired
piece. However you really need to start scanning at the last chunk *prior* to
the desired offset.

Yeah, that was my conclusion too.

Hmm, you're right. I think it can be made to work by storing the *end*
offset of each chunk. To find the chunk containing offset X, search for
the first chunk with end_offset > X.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Zdenek Kotala

Zdenek.Kotala@Sun.COM

about 17 years ago

In reply to: Gregory Stark (#2)

Re: PG_PAGE_LAYOUT_VERSION 5 - time for change

Gregory Stark napsal(a):

Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:

3) TOAST modification
a) TOAST table per attribute
b) replace chunk id with offset+variable chunk size
c) add column identification into first chunk

Thats all. I think infomask/infomask2 shuffle flag should be done. TOAST
modification complicates in-place upgrade.

I don't think TOAST table per attribute is feasible You would end up with
thousands of toast tables. It might be interesting as an option if you plan to
drop the column but I don't see it as terribly interesting.

Yeah, I could not remember what was a problem with this.

What seemed to make sense to me for solving your problem was including the
type oid in the toast chunks. I suppose attribute number might be just as good
-- it would let you save upgrading chunks for dropped columns at the expense
of having to look up the column info first.

It does not solve my problem now. Because I need it solve for old version of
PostgreSQL as well. But it should help in the future and also vacuum can easy
clean chunks related to dropped columns.

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql

Zdenek Kotala

Zdenek.Kotala@Sun.COM

about 17 years ago

In reply to: Tom Lane (#3)

Re: PG_PAGE_LAYOUT_VERSION 5 - time for change

Tom Lane napsal(a):

Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:

1) HeapTupleHeader modification

typedef struct HeapTupleFields
{
TransactionId t_xmin; /* inserting xact ID */
TransactionId t_xmax; /* deleting or locking xact ID */

union
{
CommandId t_cid;
TransactionId t_xvac; /* VACUUM FULL xact ID */
} t_field3;
uint16 t_infomask;
} HeapTupleFields;

This is unworkable (hint: the compiler will decide sizeof the struct
must be a multiple of 4). I am also frightened to death of the proposal
to swap various bits around between infomask and infomask2 --- that is
*guaranteed* to break code silently.

Uh? If flags shuffle breaks code that is not good for in-place upgrade anyway.
Do you mean something specific? I already transform all access to FLAGS into
functions.

And you didn't explain exactly what it buys, anyway. Not space savings
in the Datum form; alignment issues will prevent that.

OK. The idea is to consolidate structures. Idea is to have basic structure for data:

typedef struct DataHeaderData
uint16 t_infomask2;
uint8 t_hoff;
bits8 t_bits[1];
} DataHeaderData

which is correspond with minimal tuple an it is also useful for index tuple.
If I understand correctly then other (transaction) information is not useful in
executor (exclude when they are explicitly mentioned in select)
I'm not sure but I think we can store composite types without typid and typmod
and it save some bytes. After that we can have structure e.g.
VisibilityTupleHeader, DatumTupleHeader, IndexTupleHeader. And data on disk will
be stored:

VisibilityTupleHeaderData|DataHeaderData|Data....
IndexTupleHeader|DataHeaderData|Data....

It has problem with aligment but visibility or index data could be place into
line item pointer (IIRC somebody suggested it for vacuum improvement). And
HeapTupleData structure should be extended:

t_data - pointer on DataHeaderData
t_type - type of data header
t_header - pointer to Visibility/Datum/Index header

The main idea behind is to have stable,general and minimalistic DataHeader
structure.

It is just idea without deep examination. It seems to me as a good idea how to
save a memory footprint as well, but maybe I'm wrong.

Zdenek

Tom Lane

tgl@sss.pgh.pa.us

about 17 years ago

In reply to: Zdenek Kotala (#8)

Re: PG_PAGE_LAYOUT_VERSION 5 - time for change

Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:

I'm not sure but I think we can store composite types without typid and typmod

No, we can't. At least, tuple header structure is not the reason why not.

regards, tom lane

#10

Tom Lane

tgl@sss.pgh.pa.us

about 17 years ago

In reply to: Heikki Linnakangas (#6)

Re: PG_PAGE_LAYOUT_VERSION 5 - time for change

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

Hmm, you're right. I think it can be made to work by storing the *end*
offset of each chunk. To find the chunk containing offset X, search for
the first chunk with end_offset > X.

Yeah, that seems like it would work, and it would disentangle us
altogether from needing a hard-wired chunk size. The only downside is
that it'd be a pain to convert in-place. However, if we are also going
to add identifying information to the toast chunks (like the owning
column's number or datatype), then you could tell whether a toast chunk
had been converted by checking t_natts. So in principle a toast table
could be converted a page at a time. If the converted data didn't fit
you could push one of the chunks out to some new page of the file.

On the whole I like this a lot better than Zdenek's original proposal
http://archives.postgresql.org/pgsql-hackers/2008-10/msg00556.php
which didn't seem to me to solve much of anything.

regards, tom lane

#11

Zdenek Kotala

Zdenek.Kotala@Sun.COM

about 17 years ago

In reply to: Tom Lane (#10)

Re: PG_PAGE_LAYOUT_VERSION 5 - time for change

Tom Lane napsal(a):

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

Hmm, you're right. I think it can be made to work by storing the *end*
offset of each chunk. To find the chunk containing offset X, search for
the first chunk with end_offset > X.

Yeah, that seems like it would work, and it would disentangle us
altogether from needing a hard-wired chunk size. The only downside is
that it'd be a pain to convert in-place. However, if we are also going
to add identifying information to the toast chunks (like the owning
column's number or datatype), then you could tell whether a toast chunk
had been converted by checking t_natts. So in principle a toast table
could be converted a page at a time. If the converted data didn't fit
you could push one of the chunks out to some new page of the file.

Yeah it was, main intention. Problem is toast index, but It is common problem
not only for toast tables.

On the whole I like this a lot better than Zdenek's original proposal
http://archives.postgresql.org/pgsql-hackers/2008-10/msg00556.php
which didn't seem to me to solve much of anything.

Agree. This approach is much better. It add more complexity now for converting
chunk from old to the new version. But it add a benefit - for example vacuum can
remove data from dropped columns and so on.

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql

#12

Alvaro Herrera

alvherre@commandprompt.com

about 17 years ago

In reply to: Heikki Linnakangas (#6)

Re: PG_PAGE_LAYOUT_VERSION 5 - time for change

Heikki Linnakangas wrote:

Hmm, you're right. I think it can be made to work by storing the *end*
offset of each chunk. To find the chunk containing offset X, search for
the first chunk with end_offset > X.

FWIW I'm trying to do this. So far I've managed to make the basic thing
work, and I'm about to have a look at the slice interface.

(Quick note so that nobody wastes their time doing the same thing)

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#13

Zdenek Kotala

Zdenek.Kotala@Sun.COM

about 17 years ago

In reply to: Alvaro Herrera (#12)

Re: PG_PAGE_LAYOUT_VERSION 5 - time for change

Alvaro Herrera napsal(a):

Heikki Linnakangas wrote:

Hmm, you're right. I think it can be made to work by storing the *end*
offset of each chunk. To find the chunk containing offset X, search for
the first chunk with end_offset > X.

FWIW I'm trying to do this. So far I've managed to make the basic thing
work, and I'm about to have a look at the slice interface.

(Quick note so that nobody wastes their time doing the same thing)

Thanks I'm now busy with space reservation development and it really helps to
have everything ready in time.

Thanks Zdenek

#14

Alvaro Herrera

alvherre@commandprompt.com

about 17 years ago

In reply to: Zdenek Kotala (#13)

1 attachment(s)

toast by chunk-end (was Re: PG_PAGE_LAYOUT_VERSION 5 - time for change)

Zdenek Kotala wrote:

Alvaro Herrera napsal(a):

Heikki Linnakangas wrote:

Hmm, you're right. I think it can be made to work by storing the
*end* offset of each chunk. To find the chunk containing offset X,
search for the first chunk with end_offset > X.

FWIW I'm trying to do this. So far I've managed to make the basic thing
work, and I'm about to have a look at the slice interface.

Okay, so this seems to work. It's missing writing the sanity checks on
the returned data, and a look at the SGML docs to see if anything needs
updating. I'm also going to recheck code comments that may need
updates.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Attachments:

toast-chunkend.patchtext/x-diff; charset=us-asciiDownload

Index: src/backend/access/heap/tuptoaster.c
===================================================================
RCS file: /home/alvherre/Code/cvs/pgsql/src/backend/access/heap/tuptoaster.c,v
retrieving revision 1.91
diff -c -p -r1.91 tuptoaster.c
*** src/backend/access/heap/tuptoaster.c	6 Nov 2008 20:51:14 -0000	1.91
--- src/backend/access/heap/tuptoaster.c	17 Nov 2008 20:49:41 -0000
*************** toast_save_datum(Relation rel, Datum val
*** 1134,1140 ****
  		int32		align_it;	/* ensure struct is aligned well enough */
  	}			chunk_data;
  	int32		chunk_size;
! 	int32		chunk_seq = 0;
  	char	   *data_p;
  	int32		data_todo;
  	Pointer		dval = DatumGetPointer(value);
--- 1134,1140 ----
  		int32		align_it;	/* ensure struct is aligned well enough */
  	}			chunk_data;
  	int32		chunk_size;
! 	int32		data_done = 0;
  	char	   *data_p;
  	int32		data_todo;
  	Pointer		dval = DatumGetPointer(value);
*************** toast_save_datum(Relation rel, Datum val
*** 1208,1214 ****
  		/*
  		 * Build a tuple and store it
  		 */
! 		t_values[1] = Int32GetDatum(chunk_seq++);
  		SET_VARSIZE(&chunk_data, chunk_size + VARHDRSZ);
  		memcpy(VARDATA(&chunk_data), data_p, chunk_size);
  		toasttup = heap_form_tuple(toasttupDesc, t_values, t_isnull);
--- 1208,1214 ----
  		/*
  		 * Build a tuple and store it
  		 */
! 		t_values[1] = Int32GetDatum(data_done + chunk_size);
  		SET_VARSIZE(&chunk_data, chunk_size + VARHDRSZ);
  		memcpy(VARDATA(&chunk_data), data_p, chunk_size);
  		toasttup = heap_form_tuple(toasttupDesc, t_values, t_isnull);
*************** toast_save_datum(Relation rel, Datum val
*** 1237,1242 ****
--- 1237,1243 ----
  		 */
  		data_todo -= chunk_size;
  		data_p += chunk_size;
+ 		data_done += chunk_size;
  	}
  
  	/*
*************** toast_fetch_datum(struct varlena * attr)
*** 1336,1343 ****
  	struct varlena *result;
  	struct varatt_external toast_pointer;
  	int32		ressize;
! 	int32		residx,
! 				nextidx;
  	int32		numchunks;
  	Pointer		chunk;
  	bool		isnull;
--- 1337,1344 ----
  	struct varlena *result;
  	struct varatt_external toast_pointer;
  	int32		ressize;
! 	int32		endoff,
! 				prevend;
  	int32		numchunks;
  	Pointer		chunk;
  	bool		isnull;
*************** toast_fetch_datum(struct varlena * attr)
*** 1373,1394 ****
  				ObjectIdGetDatum(toast_pointer.va_valueid));
  
  	/*
! 	 * Read the chunks by index
  	 *
! 	 * Note that because the index is actually on (valueid, chunkidx) we will
! 	 * see the chunks in chunkidx order, even though we didn't explicitly ask
  	 * for it.
  	 */
! 	nextidx = 0;
  
  	toastscan = systable_beginscan_ordered(toastrel, toastidx,
  										   SnapshotToast, 1, &toastkey);
  	while ((ttup = systable_getnext_ordered(toastscan, ForwardScanDirection)) != NULL)
  	{
  		/*
! 		 * Have a chunk, extract the sequence number and the data
  		 */
! 		residx = DatumGetInt32(fastgetattr(ttup, 2, toasttupDesc, &isnull));
  		Assert(!isnull);
  		chunk = DatumGetPointer(fastgetattr(ttup, 3, toasttupDesc, &isnull));
  		Assert(!isnull);
--- 1374,1395 ----
  				ObjectIdGetDatum(toast_pointer.va_valueid));
  
  	/*
! 	 * Read the chunks by chuck end position
  	 *
! 	 * Note that because the index is actually on (valueid, chunk-end) we will
! 	 * see the chunks in chunk-end order, even though we didn't explicitly ask
  	 * for it.
  	 */
! 	prevend = 0;
  
  	toastscan = systable_beginscan_ordered(toastrel, toastidx,
  										   SnapshotToast, 1, &toastkey);
  	while ((ttup = systable_getnext_ordered(toastscan, ForwardScanDirection)) != NULL)
  	{
  		/*
! 		 * Have a chunk, extract its end offset and the data
  		 */
! 		endoff = DatumGetInt32(fastgetattr(ttup, 2, toasttupDesc, &isnull));
  		Assert(!isnull);
  		chunk = DatumGetPointer(fastgetattr(ttup, 3, toasttupDesc, &isnull));
  		Assert(!isnull);
*************** toast_fetch_datum(struct varlena * attr)
*** 1416,1472 ****
  		/*
  		 * Some checks on the data we've found
  		 */
! 		if (residx != nextidx)
! 			elog(ERROR, "unexpected chunk number %d (expected %d) for toast value %u in %s",
! 				 residx, nextidx,
! 				 toast_pointer.va_valueid,
! 				 RelationGetRelationName(toastrel));
! 		if (residx < numchunks - 1)
! 		{
! 			if (chunksize != TOAST_MAX_CHUNK_SIZE)
! 				elog(ERROR, "unexpected chunk size %d (expected %d) in chunk %d of %d for toast value %u in %s",
! 					 chunksize, (int) TOAST_MAX_CHUNK_SIZE,
! 					 residx, numchunks,
! 					 toast_pointer.va_valueid,
! 					 RelationGetRelationName(toastrel));
! 		}
! 		else if (residx == numchunks - 1)
! 		{
! 			if ((residx * TOAST_MAX_CHUNK_SIZE + chunksize) != ressize)
! 				elog(ERROR, "unexpected chunk size %d (expected %d) in final chunk %d for toast value %u in %s",
! 					 chunksize,
! 					 (int) (ressize - residx * TOAST_MAX_CHUNK_SIZE),
! 					 residx,
! 					 toast_pointer.va_valueid,
! 					 RelationGetRelationName(toastrel));
! 		}
! 		else
! 			elog(ERROR, "unexpected chunk number %d (out of range %d..%d) for toast value %u in %s",
! 				 residx,
! 				 0, numchunks - 1,
  				 toast_pointer.va_valueid,
  				 RelationGetRelationName(toastrel));
  
  		/*
  		 * Copy the data into proper place in our result
  		 */
! 		memcpy(VARDATA(result) + residx * TOAST_MAX_CHUNK_SIZE,
  			   chunkdata,
  			   chunksize);
  
! 		nextidx++;
  	}
  
  	/*
- 	 * Final checks that we successfully fetched the datum
- 	 */
- 	if (nextidx != numchunks)
- 		elog(ERROR, "missing chunk number %d for toast value %u in %s",
- 			 nextidx,
- 			 toast_pointer.va_valueid,
- 			 RelationGetRelationName(toastrel));
- 
- 	/*
  	 * End scan and close relations
  	 */
  	systable_endscan_ordered(toastscan);
--- 1417,1439 ----
  		/*
  		 * Some checks on the data we've found
  		 */
! 		if (endoff <= prevend)
! 			elog(ERROR, "unexpected chunk end position %d (expected > %d) for toast value %u in %s",
! 				 endoff, prevend,
  				 toast_pointer.va_valueid,
  				 RelationGetRelationName(toastrel));
  
  		/*
  		 * Copy the data into proper place in our result
  		 */
! 		memcpy(VARDATA(result) + prevend,
  			   chunkdata,
  			   chunksize);
  
! 		prevend = endoff;
  	}
  
  	/*
  	 * End scan and close relations
  	 */
  	systable_endscan_ordered(toastscan);
*************** toast_fetch_datum_slice(struct varlena *
*** 1488,1515 ****
  {
  	Relation	toastrel;
  	Relation	toastidx;
! 	ScanKeyData toastkey[3];
! 	int			nscankeys;
  	SysScanDesc toastscan;
  	HeapTuple	ttup;
  	TupleDesc	toasttupDesc;
  	struct varlena *result;
  	struct varatt_external toast_pointer;
  	int32		attrsize;
! 	int32		residx;
! 	int32		nextidx;
! 	int			numchunks;
! 	int			startchunk;
! 	int			endchunk;
! 	int32		startoffset;
! 	int32		endoffset;
! 	int			totalchunks;
! 	Pointer		chunk;
! 	bool		isnull;
! 	char	   *chunkdata;
! 	int32		chunksize;
! 	int32		chcpystrt;
! 	int32		chcpyend;
  
  	Assert(VARATT_IS_EXTERNAL(attr));
  
--- 1455,1468 ----
  {
  	Relation	toastrel;
  	Relation	toastidx;
! 	ScanKeyData toastkey[2];
  	SysScanDesc toastscan;
  	HeapTuple	ttup;
  	TupleDesc	toasttupDesc;
  	struct varlena *result;
  	struct varatt_external toast_pointer;
  	int32		attrsize;
! 	int32		dstoffset;
  
  	Assert(VARATT_IS_EXTERNAL(attr));
  
*************** toast_fetch_datum_slice(struct varlena *
*** 1523,1529 ****
  	Assert(!VARATT_EXTERNAL_IS_COMPRESSED(toast_pointer));
  
  	attrsize = toast_pointer.va_extsize;
- 	totalchunks = ((attrsize - 1) / TOAST_MAX_CHUNK_SIZE) + 1;
  
  	if (sliceoffset >= attrsize)
  	{
--- 1476,1481 ----
*************** toast_fetch_datum_slice(struct varlena *
*** 1544,1610 ****
  	if (length == 0)
  		return result;			/* Can save a lot of work at this point! */
  
- 	startchunk = sliceoffset / TOAST_MAX_CHUNK_SIZE;
- 	endchunk = (sliceoffset + length - 1) / TOAST_MAX_CHUNK_SIZE;
- 	numchunks = (endchunk - startchunk) + 1;
- 
- 	startoffset = sliceoffset % TOAST_MAX_CHUNK_SIZE;
- 	endoffset = (sliceoffset + length - 1) % TOAST_MAX_CHUNK_SIZE;
- 
  	/*
  	 * Open the toast relation and its index
  	 */
  	toastrel = heap_open(toast_pointer.va_toastrelid, AccessShareLock);
! 	toasttupDesc = toastrel->rd_att;
  	toastidx = index_open(toastrel->rd_rel->reltoastidxid, AccessShareLock);
  
  	/*
! 	 * Setup a scan key to fetch from the index. This is either two keys or
! 	 * three depending on the number of chunks.
  	 */
  	ScanKeyInit(&toastkey[0],
  				(AttrNumber) 1,
  				BTEqualStrategyNumber, F_OIDEQ,
  				ObjectIdGetDatum(toast_pointer.va_valueid));
  
  	/*
! 	 * Use equality condition for one chunk, a range condition otherwise:
! 	 */
! 	if (numchunks == 1)
! 	{
! 		ScanKeyInit(&toastkey[1],
! 					(AttrNumber) 2,
! 					BTEqualStrategyNumber, F_INT4EQ,
! 					Int32GetDatum(startchunk));
! 		nscankeys = 2;
! 	}
! 	else
! 	{
! 		ScanKeyInit(&toastkey[1],
! 					(AttrNumber) 2,
! 					BTGreaterEqualStrategyNumber, F_INT4GE,
! 					Int32GetDatum(startchunk));
! 		ScanKeyInit(&toastkey[2],
! 					(AttrNumber) 2,
! 					BTLessEqualStrategyNumber, F_INT4LE,
! 					Int32GetDatum(endchunk));
! 		nscankeys = 3;
! 	}
! 
! 	/*
! 	 * Read the chunks by index
  	 *
! 	 * The index is on (valueid, chunkidx) so they will come in order
  	 */
! 	nextidx = startchunk;
  	toastscan = systable_beginscan_ordered(toastrel, toastidx,
! 										   SnapshotToast, nscankeys, toastkey);
  	while ((ttup = systable_getnext_ordered(toastscan, ForwardScanDirection)) != NULL)
  	{
  		/*
! 		 * Have a chunk, extract the sequence number and the data
  		 */
! 		residx = DatumGetInt32(fastgetattr(ttup, 2, toasttupDesc, &isnull));
  		Assert(!isnull);
  		chunk = DatumGetPointer(fastgetattr(ttup, 3, toasttupDesc, &isnull));
  		Assert(!isnull);
--- 1496,1542 ----
  	if (length == 0)
  		return result;			/* Can save a lot of work at this point! */
  
  	/*
  	 * Open the toast relation and its index
  	 */
  	toastrel = heap_open(toast_pointer.va_toastrelid, AccessShareLock);
! 	toasttupDesc = RelationGetDescr(toastrel);
  	toastidx = index_open(toastrel->rd_rel->reltoastidxid, AccessShareLock);
  
  	/*
! 	 * Setup a scan key to fetch from the index.
  	 */
  	ScanKeyInit(&toastkey[0],
  				(AttrNumber) 1,
  				BTEqualStrategyNumber, F_OIDEQ,
  				ObjectIdGetDatum(toast_pointer.va_valueid));
+ 	ScanKeyInit(&toastkey[1],
+ 				(AttrNumber) 2,
+ 				BTGreaterStrategyNumber, F_INT4GT,
+ 				Int32GetDatum(sliceoffset));
  
  	/*
! 	 * Read the chunks by end offset
  	 *
! 	 * The index is on (valueid, chunk-end) so they will come in order
  	 */
! 	dstoffset = 0;
  	toastscan = systable_beginscan_ordered(toastrel, toastidx,
! 										   SnapshotToast, 2, toastkey);
  	while ((ttup = systable_getnext_ordered(toastscan, ForwardScanDirection)) != NULL)
  	{
+ 		uint32	srcstart;
+ 		uint32	chunkend;
+ 		uint32	copylength;
+ 		Pointer	chunk;
+ 		bool	isnull;
+ 		char   *chunkdata;
+ 		int32	chunksize;
+ 
  		/*
! 		 * Have a chunk, extract the end offset and the data
  		 */
! 		chunkend = DatumGetInt32(fastgetattr(ttup, 2, toasttupDesc, &isnull));
  		Assert(!isnull);
  		chunk = DatumGetPointer(fastgetattr(ttup, 3, toasttupDesc, &isnull));
  		Assert(!isnull);
*************** toast_fetch_datum_slice(struct varlena *
*** 1629,1694 ****
  			chunkdata = NULL;
  		}
  
  		/*
  		 * Some checks on the data we've found
  		 */
! 		if ((residx != nextidx) || (residx > endchunk) || (residx < startchunk))
! 			elog(ERROR, "unexpected chunk number %d (expected %d) for toast value %u in %s",
! 				 residx, nextidx,
! 				 toast_pointer.va_valueid,
! 				 RelationGetRelationName(toastrel));
! 		if (residx < totalchunks - 1)
! 		{
! 			if (chunksize != TOAST_MAX_CHUNK_SIZE)
! 				elog(ERROR, "unexpected chunk size %d (expected %d) in chunk %d of %d for toast value %u in %s when fetching slice",
! 					 chunksize, (int) TOAST_MAX_CHUNK_SIZE,
! 					 residx, totalchunks,
! 					 toast_pointer.va_valueid,
! 					 RelationGetRelationName(toastrel));
! 		}
! 		else if (residx == totalchunks - 1)
! 		{
! 			if ((residx * TOAST_MAX_CHUNK_SIZE + chunksize) != attrsize)
! 				elog(ERROR, "unexpected chunk size %d (expected %d) in final chunk %d for toast value %u in %s when fetching slice",
! 					 chunksize,
! 					 (int) (attrsize - residx * TOAST_MAX_CHUNK_SIZE),
! 					 residx,
! 					 toast_pointer.va_valueid,
! 					 RelationGetRelationName(toastrel));
! 		}
! 		else
! 			elog(ERROR, "unexpected chunk number %d (out of range %d..%d) for toast value %u in %s",
! 				 residx,
! 				 0, totalchunks - 1,
! 				 toast_pointer.va_valueid,
! 				 RelationGetRelationName(toastrel));
  
  		/*
  		 * Copy the data into proper place in our result
  		 */
! 		chcpystrt = 0;
! 		chcpyend = chunksize - 1;
! 		if (residx == startchunk)
! 			chcpystrt = startoffset;
! 		if (residx == endchunk)
! 			chcpyend = endoffset;
! 
! 		memcpy(VARDATA(result) +
! 			   (residx * TOAST_MAX_CHUNK_SIZE - sliceoffset) + chcpystrt,
! 			   chunkdata + chcpystrt,
! 			   (chcpyend - chcpystrt) + 1);
  
! 		nextidx++;
  	}
  
  	/*
  	 * Final checks that we successfully fetched the datum
  	 */
! 	if (nextidx != (endchunk + 1))
! 		elog(ERROR, "missing chunk number %d for toast value %u in %s",
! 			 nextidx,
! 			 toast_pointer.va_valueid,
! 			 RelationGetRelationName(toastrel));
  
  	/*
  	 * End scan and close relations
--- 1561,1598 ----
  			chunkdata = NULL;
  		}
  
+ #if 0
  		/*
  		 * Some checks on the data we've found
  		 */
! #endif
  
  		/*
  		 * Copy the data into proper place in our result
  		 */
! 		if (dstoffset == 0)	/* first chunk; skip unneeded bytes */
! 			srcstart = sliceoffset - chunkend + chunksize;
! 		else
! 			srcstart = 0;
! 
! 		copylength = Min(chunksize - srcstart, length);
! 
! 		memcpy(VARDATA(result) + dstoffset,
! 			   chunkdata + srcstart,
! 			   copylength);
! 
! 		length -= copylength;
! 		dstoffset += copylength;
  
! 		if (length == 0)
! 			break;
  	}
  
+ #if 0
  	/*
  	 * Final checks that we successfully fetched the datum
  	 */
! #endif
  
  	/*
  	 * End scan and close relations
Index: src/backend/catalog/toasting.c
===================================================================
RCS file: /home/alvherre/Code/cvs/pgsql/src/backend/catalog/toasting.c,v
retrieving revision 1.11
diff -c -p -r1.11 toasting.c
*** src/backend/catalog/toasting.c	1 Sep 2008 20:42:43 -0000	1.11
--- src/backend/catalog/toasting.c	14 Nov 2008 23:42:02 -0000
*************** create_toast_table(Relation rel, Oid toa
*** 157,163 ****
  					   OIDOID,
  					   -1, 0);
  	TupleDescInitEntry(tupdesc, (AttrNumber) 2,
! 					   "chunk_seq",
  					   INT4OID,
  					   -1, 0);
  	TupleDescInitEntry(tupdesc, (AttrNumber) 3,
--- 157,163 ----
  					   OIDOID,
  					   -1, 0);
  	TupleDescInitEntry(tupdesc, (AttrNumber) 2,
! 					   "chunk_end",
  					   INT4OID,
  					   -1, 0);
  	TupleDescInitEntry(tupdesc, (AttrNumber) 3,

#15

Zdenek Kotala

Zdenek.Kotala@Sun.COM

about 17 years ago

In reply to: Alvaro Herrera (#14)

Re: toast by chunk-end (was Re: PG_PAGE_LAYOUT_VERSION 5 - time for change)

Alvaro Herrera napsal(a):

Zdenek Kotala wrote:

Alvaro Herrera napsal(a):

Heikki Linnakangas wrote:

Hmm, you're right. I think it can be made to work by storing the
*end* offset of each chunk. To find the chunk containing offset X,
search for the first chunk with end_offset > X.

FWIW I'm trying to do this. So far I've managed to make the basic thing
work, and I'm about to have a look at the slice interface.

Okay, so this seems to work. It's missing writing the sanity checks on
the returned data, and a look at the SGML docs to see if anything needs
updating. I'm also going to recheck code comments that may need
updates.

Hi Alvaro,

Just a very quick look on your patch. See my comments:

1) TOAST_MAX_CHUNK_SIZE should be removed from controldata structure.

2) PG_PAGE_LAYOUT_VERSION should be bump

3) the other main idea of toast redesign has been to add colnum information to
each chunk.

If I'm thinking more about it, it solves one problem but add another - index
update when page layout is converted during a read. And there are another issues
which we need to solve - I will send new mail.

Zdenek

#16

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 17 years ago

In reply to: Zdenek Kotala (#15)

Re: toast by chunk-end (was Re: PG_PAGE_LAYOUT_VERSION 5 - time for change)

Zdenek Kotala wrote:

Just a very quick look on your patch. See my comments:

...

2) PG_PAGE_LAYOUT_VERSION should be bump

The patch doesn't change the page layout AFAICS.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#17

Zdenek Kotala

Zdenek.Kotala@Sun.COM

about 17 years ago

In reply to: Heikki Linnakangas (#16)

Re: toast by chunk-end (was Re: PG_PAGE_LAYOUT_VERSION 5 - time for change)

Heikki Linnakangas napsal(a):

Zdenek Kotala wrote:

Just a very quick look on your patch. See my comments:

...

2) PG_PAGE_LAYOUT_VERSION should be bump

The patch doesn't change the page layout AFAICS.

It is good question what is and what is not page layout. I think that toast
implementation is a member of page layout. OK it is called page layout but
better name should be On Disk Format (ODF). You will not able to read 8.3
toasted table in 8.4.

Zdenek

#18

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 17 years ago

In reply to: Zdenek Kotala (#17)

Re: toast by chunk-end (was Re: PG_PAGE_LAYOUT_VERSION 5 - time for change)

Zdenek Kotala wrote:

Heikki Linnakangas napsal(a):

Zdenek Kotala wrote:

Just a very quick look on your patch. See my comments:

...

2) PG_PAGE_LAYOUT_VERSION should be bump

The patch doesn't change the page layout AFAICS.

It is good question what is and what is not page layout. I think that
toast implementation is a member of page layout. OK it is called page
layout but better name should be On Disk Format (ODF). You will not able
to read 8.3 toasted table in 8.4.

It's clearly just a catalog change; the number and meaning of attributes
has changed, and that's reflected in CATALOG_VERSION_NO.

We need to be pragmatic, though, and think about how the conversion
would work, and if the version number change would help or hurt that
process. I'm not clear how we would handle the toast table change. If
we're going to handle it by retoasting all attributes when the main heap
page is read in, then I suppose we'd actually change the version number
of the *heap* page, not toast table pages, when the heap page is
retoasted. However, if you want to do it toast-page at a time, or
toast-tuple at a time, you can just look at the number of attributes on
the toast tuple to determine which format it's in.

Note that bumping the version number is not free. We haven't made any
changes in 8.4 this far that would require bumping it. If we do bump it,
the next version with online-upgrade support will need to deal with it,
if only to increment and write back the page.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#19

Zdenek Kotala

Zdenek.Kotala@Sun.COM

about 17 years ago

In reply to: Heikki Linnakangas (#18)

Re: toast by chunk-end (was Re: PG_PAGE_LAYOUT_VERSION 5 - time for change)

Heikki Linnakangas napsal(a):

Zdenek Kotala wrote:

Heikki Linnakangas napsal(a):

Zdenek Kotala wrote:

Just a very quick look on your patch. See my comments:

...

2) PG_PAGE_LAYOUT_VERSION should be bump

The patch doesn't change the page layout AFAICS.

It is good question what is and what is not page layout. I think that
toast implementation is a member of page layout. OK it is called page
layout but better name should be On Disk Format (ODF). You will not
able to read 8.3 toasted table in 8.4.

It's clearly just a catalog change; the number and meaning of attributes
has changed, and that's reflected in CATALOG_VERSION_NO.

By by opinion it is not only catalog change. Probably you are right that it is
not part of page layout version. However, It changed column meaning on data
tables. You need to convert whole toast table and reindex toast table indext. It
is something what you cannot do online or you can but you need to exclusive lock
on toast table.

We need to be pragmatic, though, and think about how the conversion
would work, and if the version number change would help or hurt that
process. I'm not clear how we would handle the toast table change. If
we're going to handle it by retoasting all attributes when the main heap
page is read in, then I suppose we'd actually change the version number
of the *heap* page, not toast table pages, when the heap page is
retoasted. However, if you want to do it toast-page at a time, or
toast-tuple at a time, you can just look at the number of attributes on
the toast tuple to determine which format it's in.

I'm trying to write down a toast conversion concept. It looks like that
it is more complex that I expected.

Note that bumping the version number is not free. We haven't made any
changes in 8.4 this far that would require bumping it. If we do bump it,
the next version with online-upgrade support will need to deal with it,
if only to increment and write back the page.

Yes, I know about it. But I'm afraid that 8.3->8.4 in-place upgrade will not work.

Zdenek

#20

Zdenek Kotala

Zdenek.Kotala@Sun.COM

about 17 years ago

In reply to: Heikki Linnakangas (#18)

Re: toast by chunk-end (was Re: PG_PAGE_LAYOUT_VERSION 5 - time for change)

Heikki Linnakangas napsal(a):

Zdenek Kotala wrote:

Heikki Linnakangas napsal(a):

Zdenek Kotala wrote:

Just a very quick look on your patch. See my comments:

...

2) PG_PAGE_LAYOUT_VERSION should be bump

The patch doesn't change the page layout AFAICS.

It is good question what is and what is not page layout. I think that
toast implementation is a member of page layout. OK it is called page
layout but better name should be On Disk Format (ODF). You will not
able to read 8.3 toasted table in 8.4.

It's clearly just a catalog change; the number and meaning of attributes
has changed, and that's reflected in CATALOG_VERSION_NO.

If I'm thinking more, it is not probably CATALOG_VERSION_NO as well. Because
toast table is created on demand. It is not in BKI.

Maybe we should add something like TOAST_VERSION.

Do we bump catalog version when AM bump version?

Zdenek

#21

Alvaro Herrera

alvherre@commandprompt.com

about 17 years ago

In reply to: Zdenek Kotala (#20)

Re: Re: toast by chunk-end (was Re: PG_PAGE_LAYOUT_VERSION 5 - time for change)

Zdenek Kotala wrote:

If I'm thinking more, it is not probably CATALOG_VERSION_NO as well.
Because toast table is created on demand. It is not in BKI.

It's not catversion in the sense that there's no catalog change, but it
certainly requires a catversion bump due to internal changes.
Otherwise, developers who have working data directories today will see
weird errors when they update to a CVS version after this commit.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#22

Tom Lane

tgl@sss.pgh.pa.us

about 17 years ago

In reply to: Alvaro Herrera (#21)

Re: Re: toast by chunk-end (was Re: PG_PAGE_LAYOUT_VERSION 5 - time for change)

Alvaro Herrera <alvherre@commandprompt.com> writes:

Zdenek Kotala wrote:

If I'm thinking more, it is not probably CATALOG_VERSION_NO as well.
Because toast table is created on demand. It is not in BKI.

It's not catversion in the sense that there's no catalog change, but it
certainly requires a catversion bump due to internal changes.
Otherwise, developers who have working data directories today will see
weird errors when they update to a CVS version after this commit.

Yes. The real purpose of catversion is to keep developers from wasting
time using an incompatible data directory.

As far as the point at hand goes: the original discussion about this
assumed that we'd add at least one "identity" column to toast tables,
which would allow the t_natts of a toast tuple to effectively serve
as a version number. So that fixes the problem of how to know what
you are looking at. What it doesn't solve is the problem of how to
know what range of index values to search for in a partial-fetch
operation. If you just scan what would be the expected range of
converted chunk positions, you might miss all the old-format entries.

Anyone have a clue on that?

regards, tom lane

#23

Zdenek Kotala

Zdenek.Kotala@Sun.COM

about 17 years ago

In reply to: Alvaro Herrera (#21)

Re: Re: toast by chunk-end (was Re: PG_PAGE_LAYOUT_VERSION 5 - time for change)

Alvaro Herrera napsal(a):

Zdenek Kotala wrote:

If I'm thinking more, it is not probably CATALOG_VERSION_NO as well.
Because toast table is created on demand. It is not in BKI.

It's not catversion in the sense that there's no catalog change, but it
certainly requires a catversion bump due to internal changes.
Otherwise, developers who have working data directories today will see
weird errors when they update to a CVS version after this commit.

I understand it but from upgrade point of view it is confusing. When you upgrade
catalog then you catalog will not correspond with toast table structure and
there is no clue if toast table is or is not already converted or which toast
table structure is used.

Zdenek