Arbitrary tuple size
Well,
doing arbitrary tuple size should be as generic as possible.
Thus I think the best place to do it is down in the heapam
routines (heap_fetch(), heap_getnext(), heap_insert(), ...).
I'm not 100% sure but nothing should access a heap relation
going around them. Anyway, if there are places, then it's
time to clean them up.
What about adding one more ItemPointerData to the tuple
header which holds the ctid of a DATA continuation tuple. If
a tuple doesn't fit into one block, this will tell where to
get the next chunk of tuple data building a chain until an
invalid ctid is found. The continuation tuples can have a
negative t_natts to be easily identified and ignored by
scanning routines.
By doing it this way we could also sqeeze out some currently
wasted space. All tuples that get inserted/updated are added
to the end of the relation. If a tuple currently doesn't fit
into the freespace of the actual last block, that freespace
is wasted and the tuple is placed into a new allocated block
at the end. So if there is 5K freespace and another 5.5K
tuple is added, the relation grows effectively by 10.5K!
I'm not sure how to handle this with vacuum, but I believe
Vadim is able to put some well placed goto's that make it.
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#========================================= wieck@debis.com (Jan Wieck) #
Jan Wieck wrote:
What about adding one more ItemPointerData to the tuple
header which holds the ctid of a DATA continuation tuple. If
Oh no. Fortunately we need not in this: we can just add new flag
to t_infomask and add continuation tid at the end of tuple chunk.
Ok?
a tuple doesn't fit into one block, this will tell where to
get the next chunk of tuple data building a chain until an
invalid ctid is found. The continuation tuples can have a
negative t_natts to be easily identified and ignored by
scanning routines.
...
I'm not sure how to handle this with vacuum, but I believe
Vadim is able to put some well placed goto's that make it.
-:)))
Ok, ok - I have great number of goto-s in my pocket -:)
Vadim
Well,
doing arbitrary tuple size should be as generic as possible.
Thus I think the best place to do it is down in the heapam
routines (heap_fetch(), heap_getnext(), heap_insert(), ...).
I'm not 100% sure but nothing should access a heap relation
going around them. Anyway, if there are places, then it's
time to clean them up.What about adding one more ItemPointerData to the tuple
header which holds the ctid of a DATA continuation tuple. If
a tuple doesn't fit into one block, this will tell where to
get the next chunk of tuple data building a chain until an
invalid ctid is found. The continuation tuples can have a
negative t_natts to be easily identified and ignored by
scanning routines.By doing it this way we could also sqeeze out some currently
wasted space. All tuples that get inserted/updated are added
to the end of the relation. If a tuple currently doesn't fit
into the freespace of the actual last block, that freespace
is wasted and the tuple is placed into a new allocated block
at the end. So if there is 5K freespace and another 5.5K
tuple is added, the relation grows effectively by 10.5K!I'm not sure how to handle this with vacuum, but I believe
Vadim is able to put some well placed goto's that make it.
I agree this is the way to go. There is nothing I can think of that is
limited to how large a tuple can be. It is just accessing it from the
heap routines that is the problem. If the tuple is alloc'ed to be used,
we can paste together the parts on disk and return one tuple. If they
are accessing the buffer copy directly, we would have to be smart about
going off the end of the disk copy and moving to the next segment.
The code is very clear now about accessing tuples or tuple copies.
Hopefully locking will not be an issue because you only need to lock the
main tuple. No one is going to see the secondary part of the tuple.
If Vadim can do MVCC, he certainly can handle this, with the help of
goto. :-)
--
Bruce Momjian | http://www.op.net/~candle
maillist@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Jan Wieck wrote:
What about adding one more ItemPointerData to the tuple
header which holds the ctid of a DATA continuation tuple. IfOh no. Fortunately we need not in this: we can just add new flag
to t_infomask and add continuation tid at the end of tuple chunk.
Ok?
Sounds good. I would rather not add stuff to the tuple header if we can
prevent it.
I'm not sure how to handle this with vacuum, but I believe
Vadim is able to put some well placed goto's that make it.-:)))
Ok, ok - I have great number of goto-s in my pocket -:)
I can send you more.
--
Bruce Momjian | http://www.op.net/~candle
maillist@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian wrote:
-:)))
Ok, ok - I have great number of goto-s in my pocket -:)I can send you more.
I have some cheap, spare longjmp()'s over here - anyone need
them? :-)
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#========================================= wieck@debis.com (Jan Wieck) #
Bruce Momjian wrote:
What about adding one more ItemPointerData to the tuple
header which holds the ctid of a DATA continuation tuple. If
a tuple doesn't fit into one block, this will tell where to
get the next chunk of tuple data building a chain until an
invalid ctid is found. The continuation tuples can have a
negative t_natts to be easily identified and ignored by
scanning routines.
Yes, Vadim, putting a flag into the bits already there to
tell it is much better. The information that a particular
tuple is an extension tuple should also go there instead of
misusing t_natts.
I agree this is the way to go. There is nothing I can think of that is
limited to how large a tuple can be. It is just accessing it from the
heap routines that is the problem. If the tuple is alloc'ed to be used,
we can paste together the parts on disk and return one tuple. If they
are accessing the buffer copy directly, we would have to be smart about
going off the end of the disk copy and moving to the next segment.
Who's accessing tuple attributes directly inside the buffer
copy (not only the header which will still be unsplitted at
the top of the chain)?
Aren't these situations where it is done restricted to system
catalogs? I think we can live with the restriction that the
tuple split will not be available for system relations
because the only place where the limit hit us is pg_rewrite
and that can be handled by redesigning the storage of rules
which is already required by the rule recompilation TODO.
I can't think that anywhere in the code a buffer from a user
relation (except for sequences and that's another story) is
accessed that clumsy.
The code is very clear now about accessing tuples or tuple copies.
Hopefully locking will not be an issue because you only need to lock the
main tuple. No one is going to see the secondary part of the tuple.
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#========================================= wieck@debis.com (Jan Wieck) #
Bruce Momjian wrote:
I agree this is the way to go. There is nothing I can think of that is
limited to how large a tuple can be.
Outch - I can.
Having an index on a varlen field that now doesn't fit any
more into an index block. Wouldn't this cause problems? Well
it's bad database design to index fields that will receive
that long data because indexing them will blow up the
database but it must work anyway.
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#========================================= wieck@debis.com (Jan Wieck) #
wieck@debis.com (Jan Wieck) writes:
Bruce Momjian wrote:
I agree this is the way to go. There is nothing I can think of that is
limited to how large a tuple can be.
Outch - I can.
Having an index on a varlen field that now doesn't fit any
more into an index block. Wouldn't this cause problems?
Aren't index tuples still tuples? Can't they be split just like
regular tuples?
regards, tom lane
Import Notes
Reply to msg id not found: YourmessageofThu8Jul1999191717+0200m112HnR-0003ktC@orion.SAPserv.Hamburg.dsh.de | Resolved by subject fallback
Tom Lane wrote:
wieck@debis.com (Jan Wieck) writes:
Bruce Momjian wrote:
I agree this is the way to go. There is nothing I can think of that is
limited to how large a tuple can be.Outch - I can.
Having an index on a varlen field that now doesn't fit any
more into an index block. Wouldn't this cause problems?Aren't index tuples still tuples? Can't they be split just like
regular tuples?
Don't know, maybe.
While looking for some places where tuple data might be
accessed directly inside of the buffers I've searched for
WriteBuffer() and friends. These are mostly used in the index
access methods and some other places where I expected them,
so index AM's have at least to be carefully visited when
implementing tuple split.
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#========================================= wieck@debis.com (Jan Wieck) #
I wrote:
Tom Lane wrote:
Aren't index tuples still tuples? Can't they be split just like
regular tuples?Don't know, maybe.
Actually we have some problems with indices on text
attributes when the content exceeds HALF of the blocksize:
FATAL 1: btree: failed to add item to the page
It crashes the backend AND seems to corrupt the index! Looks
to me that at least the btree code needs to be able to store
at minimum two items into one block and painfully fails if it
can't.
And just another one:
pgsql=> create table t1 (a int4, b char(4000));
CREATE
pgsql=> create index t1_b on t1 (b);
CREATE
pgsql=> insert into t1 values (1, 'a');
TRAP: Failed Assertion("!(( itid)->lp_flags & 0x01):",
File: "nbtinsert.c", Line: 361)
Bruce: One more TODO item!
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#========================================= wieck@debis.com (Jan Wieck) #
Going toward >8k tuples would be really good, but I suspect we may
some difficulties with LO stuffs once we implement it. Also it seems
that it's not worth to adapt LOs with newly designed tuples. I think
the design of current LOs are so broken that we need to redesign them.
o it's slow: accessing a LO need a open() that is not cheap. creating
many LOs makes data/base/DBNAME/ directory fat.
o it consumes lots of i-nodes
o it breaks the tuple abstraction: this makes difficult to maintain
the code.
I would propose followings for the new version of LO:
o create a new data type that represents the LO
o when defining the LO data type in a table, it actually points to a
LO "body" in another place where it is physically stored.
o the storage for LO bodies would be a hidden table that contains
several LOs, not single one.
o we can have several tables for the LO bodies. Probably a LO body
table for each corresponding table (where LO data type is defined) is
appropreate.
o it would be nice to place a LO table on a separate
directory/partition from the original table where LO data type is
defined, since a LO body table could become huge.
Comments? Opinions?
---
Tatsuo Ishii
Import Notes
Reply to msg id not found: YourmessageofThu08Jul1999201402+0200.m112IgM-0003ktC@orion.SAPserv.Hamburg.dsh.de | Resolved by subject fallback
Jan Wieck wrote:
Bruce Momjian wrote:
I agree this is the way to go. There is nothing I can think of that is
limited to how large a tuple can be.Outch - I can.
Having an index on a varlen field that now doesn't fit any
more into an index block. Wouldn't this cause problems? Well
it's bad database design to index fields that will receive
that long data because indexing them will blow up the
database but it must work anyway.
Seems that in other DBMSes len of index tuple is more restricted
than len of heap one. So I think we shouldn't worry about this case.
Vadim
At 10:12 9/07/99 +0900, Tatsuo Ishii wrote:
o create a new data type that represents the LO
o when defining the LO data type in a table, it actually points to a
LO "body" in another place where it is physically stored.
Much as the purist in me hates concept of hard links (as in Leon's suggestions), this *may* be a good application for them. Certainly that's how Dec(Oracle)/Rdb does it. Since most blobs will be totally rewritten when they are updated, this represents a slightly smaller problem in terms of MVCC.
o we can have several tables for the LO bodies. Probably a LO body
table for each corresponding table (where LO data type is defined) is
appropreate.
Did you mean a table for each field? Or a table for each table (which may have more than 1 LO field). See comments below.
o it would be nice to place a LO table on a separate
directory/partition from the original table where LO data type is
defined, since a LO body table could become huge.
I would very much like to see the ability to have multi-file databases and tables - ie. the ability to store and table or index in a separate file. Perhaps with a user-defined partitioning function for table rows. The idea being that:
1. User specifies that a table can be stored in one of (say) three files.
2. When a record is first stored, the partitioning function is called to determine the file 'storage area' to use. [or a random selection method is used]
If you are going to allow LOs to be stored in multiple files, it seems a pity not to add some or all of this feature.
Additionally, the issue of pg_dump support for LOs needs to be addressed.
That's sbout it for me,
Philip Warner.
----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.C.N. 008 659 498) | /(@) ______---_
Tel: +61-03-5367 7422 | _________ \
Fax: +61-03-5367 7430 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/
I agree this is the way to go. There is nothing I can think of that is
limited to how large a tuple can be. It is just accessing it from the
heap routines that is the problem. If the tuple is alloc'ed to be used,
we can paste together the parts on disk and return one tuple. If they
are accessing the buffer copy directly, we would have to be smart about
going off the end of the disk copy and moving to the next segment.Who's accessing tuple attributes directly inside the buffer
copy (not only the header which will still be unsplitted at
the top of the chain)?
Every call to heap_getnext(), for one. It locks the buffer, and returns
a pointer to the tuple. The next heap_getnext(), or heap_endscan()
releases the lock. The cost of returning every tuple as palloc'ed
memory would be huge. We may be able to get away with just returning
palloc'ed stuff for long tuples. That may be a simple, clean solution
that would be isolated.
In fact, if we want a copy, we call heap_copytuple() to return a
palloc'ed copy. This interface has been cleaned up so it should be
clear what is happening. The old code was messy about this.
See my comments from heap_fetch(), which does require the user to supply
a buffer variable, so they can unlock it when they are done. The old
code allowed you to pass a NULL as a buffer pointer, so there was no
locking done, and that is bad!
---------------------------------------------------------------------------
/* ----------------
* heap_fetch - retrive tuple with tid
*
* Currently ignores LP_IVALID during processing!
*
* Because this is not part of a scan, there is no way to
* automatically lock/unlock the shared buffers.
* For this reason, we require that the user retrieve the buffer
* value, and they are required to BufferRelease() it when they
* are done. If they want to make a copy of it before releasing it,
* they can call heap_copytyple().
* ----------------
*/
void
heap_fetch(Relation relation,
Snapshot snapshot,
HeapTuple tuple,
Buffer *userbuf)
--
Bruce Momjian | http://www.op.net/~candle
maillist@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Aren't index tuples still tuples? Can't they be split just like
regular tuples?Don't know, maybe.
While looking for some places where tuple data might be
accessed directly inside of the buffers I've searched for
WriteBuffer() and friends. These are mostly used in the index
access methods and some other places where I expected them,
so index AM's have at least to be carefully visited when
implementing tuple split.
See my recent mail. heap_getnext and heap_fetch(). Can't get lower
access than that.
--
Bruce Momjian | http://www.op.net/~candle
maillist@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
I knew there had to be a reason that some tests where BLCKSZ/2 and some
BLCKSZ.
Added to TODO:
* Allow index on tuple greater than 1/2 block size
Seems we have to allow columns over 1/2 block size for now. Most people
wouln't index on them.
Don't know, maybe.
Actually we have some problems with indices on text
attributes when the content exceeds HALF of the blocksize:FATAL 1: btree: failed to add item to the page
It crashes the backend AND seems to corrupt the index! Looks
to me that at least the btree code needs to be able to store
at minimum two items into one block and painfully fails if it
can't.And just another one:
pgsql=> create table t1 (a int4, b char(4000));
CREATE
pgsql=> create index t1_b on t1 (b);
CREATE
pgsql=> insert into t1 values (1, 'a');TRAP: Failed Assertion("!(( itid)->lp_flags & 0x01):",
File: "nbtinsert.c", Line: 361)Bruce: One more TODO item!
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#========================================= wieck@debis.com (Jan Wieck) #
--
Bruce Momjian | http://www.op.net/~candle
maillist@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
If we get wide tuples, we could just throw all large objects into one
table, and have an on it. We can then vacuum it to compact space, etc.
Going toward >8k tuples would be really good, but I suspect we may
some difficulties with LO stuffs once we implement it. Also it seems
that it's not worth to adapt LOs with newly designed tuples. I think
the design of current LOs are so broken that we need to redesign them.o it's slow: accessing a LO need a open() that is not cheap. creating
many LOs makes data/base/DBNAME/ directory fat.o it consumes lots of i-nodes
o it breaks the tuple abstraction: this makes difficult to maintain
the code.I would propose followings for the new version of LO:
o create a new data type that represents the LO
o when defining the LO data type in a table, it actually points to a
LO "body" in another place where it is physically stored.o the storage for LO bodies would be a hidden table that contains
several LOs, not single one.o we can have several tables for the LO bodies. Probably a LO body
table for each corresponding table (where LO data type is defined) is
appropreate.o it would be nice to place a LO table on a separate
directory/partition from the original table where LO data type is
defined, since a LO body table could become huge.Comments? Opinions?
---
Tatsuo Ishii
--
Bruce Momjian | http://www.op.net/~candle
maillist@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
If we get wide tuples, we could just throw all large objects into one
table, and have an on it. We can then vacuum it to compact space, etc.
I thought about that too. But if a table contains lots of LOs,
scanning of it will take for a long time. On the otherhand, if LOs are
stored outside the table, scanning time will be shorter as long as we
don't need to read the content of each LO type field.
--
Tatsuo Ishii
Import Notes
Reply to msg id not found: YourmessageofFri09Jul1999004530-0400.199907090445.AAA08440@candle.pha.pa.us | Resolved by subject fallback
Bruce Momjian wrote:
If we get wide tuples, we could just throw all large objects into one
table, and have an on it. We can then vacuum it to compact space, etc.
Storing 2Gb LO in table is not good thing.
Vadim
If we get wide tuples, we could just throw all large objects into one
table, and have an on it. We can then vacuum it to compact space, etc.I thought about that too. But if a table contains lots of LOs,
scanning of it will take for a long time. On the otherhand, if LOs are
stored outside the table, scanning time will be shorter as long as we
don't need to read the content of each LO type field.
Use an index to get to the LO's in the table.
--
Bruce Momjian | http://www.op.net/~candle
maillist@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026