Arbitrary tuple size

Started by Nonameover 26 years ago33 messages
#1Noname
wieck@debis.com

Well,

doing arbitrary tuple size should be as generic as possible.
Thus I think the best place to do it is down in the heapam
routines (heap_fetch(), heap_getnext(), heap_insert(), ...).
I'm not 100% sure but nothing should access a heap relation
going around them. Anyway, if there are places, then it's
time to clean them up.

What about adding one more ItemPointerData to the tuple
header which holds the ctid of a DATA continuation tuple. If
a tuple doesn't fit into one block, this will tell where to
get the next chunk of tuple data building a chain until an
invalid ctid is found. The continuation tuples can have a
negative t_natts to be easily identified and ignored by
scanning routines.

By doing it this way we could also sqeeze out some currently
wasted space. All tuples that get inserted/updated are added
to the end of the relation. If a tuple currently doesn't fit
into the freespace of the actual last block, that freespace
is wasted and the tuple is placed into a new allocated block
at the end. So if there is 5K freespace and another 5.5K
tuple is added, the relation grows effectively by 10.5K!

I'm not sure how to handle this with vacuum, but I believe
Vadim is able to put some well placed goto's that make it.

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#========================================= wieck@debis.com (Jan Wieck) #

#2Vadim Mikheev
vadim@krs.ru
In reply to: Noname (#1)
Re: [HACKERS] Arbitrary tuple size

Jan Wieck wrote:

What about adding one more ItemPointerData to the tuple
header which holds the ctid of a DATA continuation tuple. If

Oh no. Fortunately we need not in this: we can just add new flag
to t_infomask and add continuation tid at the end of tuple chunk.
Ok?

a tuple doesn't fit into one block, this will tell where to
get the next chunk of tuple data building a chain until an
invalid ctid is found. The continuation tuples can have a
negative t_natts to be easily identified and ignored by
scanning routines.

...

I'm not sure how to handle this with vacuum, but I believe
Vadim is able to put some well placed goto's that make it.

-:)))
Ok, ok - I have great number of goto-s in my pocket -:)

Vadim

#3Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Noname (#1)
Re: [HACKERS] Arbitrary tuple size

Well,

doing arbitrary tuple size should be as generic as possible.
Thus I think the best place to do it is down in the heapam
routines (heap_fetch(), heap_getnext(), heap_insert(), ...).
I'm not 100% sure but nothing should access a heap relation
going around them. Anyway, if there are places, then it's
time to clean them up.

What about adding one more ItemPointerData to the tuple
header which holds the ctid of a DATA continuation tuple. If
a tuple doesn't fit into one block, this will tell where to
get the next chunk of tuple data building a chain until an
invalid ctid is found. The continuation tuples can have a
negative t_natts to be easily identified and ignored by
scanning routines.

By doing it this way we could also sqeeze out some currently
wasted space. All tuples that get inserted/updated are added
to the end of the relation. If a tuple currently doesn't fit
into the freespace of the actual last block, that freespace
is wasted and the tuple is placed into a new allocated block
at the end. So if there is 5K freespace and another 5.5K
tuple is added, the relation grows effectively by 10.5K!

I'm not sure how to handle this with vacuum, but I believe
Vadim is able to put some well placed goto's that make it.

I agree this is the way to go. There is nothing I can think of that is
limited to how large a tuple can be. It is just accessing it from the
heap routines that is the problem. If the tuple is alloc'ed to be used,
we can paste together the parts on disk and return one tuple. If they
are accessing the buffer copy directly, we would have to be smart about
going off the end of the disk copy and moving to the next segment.

The code is very clear now about accessing tuples or tuple copies.
Hopefully locking will not be an issue because you only need to lock the
main tuple. No one is going to see the secondary part of the tuple.

If Vadim can do MVCC, he certainly can handle this, with the help of
goto. :-)

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#4Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Vadim Mikheev (#2)
Re: [HACKERS] Arbitrary tuple size

Jan Wieck wrote:

What about adding one more ItemPointerData to the tuple
header which holds the ctid of a DATA continuation tuple. If

Oh no. Fortunately we need not in this: we can just add new flag
to t_infomask and add continuation tid at the end of tuple chunk.
Ok?

Sounds good. I would rather not add stuff to the tuple header if we can
prevent it.

I'm not sure how to handle this with vacuum, but I believe
Vadim is able to put some well placed goto's that make it.

-:)))
Ok, ok - I have great number of goto-s in my pocket -:)

I can send you more.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#5Noname
wieck@debis.com
In reply to: Bruce Momjian (#4)
Re: [HACKERS] Arbitrary tuple size

Bruce Momjian wrote:

-:)))
Ok, ok - I have great number of goto-s in my pocket -:)

I can send you more.

I have some cheap, spare longjmp()'s over here - anyone need
them? :-)

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#========================================= wieck@debis.com (Jan Wieck) #

#6Noname
wieck@debis.com
In reply to: Bruce Momjian (#3)
Re: [HACKERS] Arbitrary tuple size

Bruce Momjian wrote:

What about adding one more ItemPointerData to the tuple
header which holds the ctid of a DATA continuation tuple. If
a tuple doesn't fit into one block, this will tell where to
get the next chunk of tuple data building a chain until an
invalid ctid is found. The continuation tuples can have a
negative t_natts to be easily identified and ignored by
scanning routines.

Yes, Vadim, putting a flag into the bits already there to
tell it is much better. The information that a particular
tuple is an extension tuple should also go there instead of
misusing t_natts.

I agree this is the way to go. There is nothing I can think of that is
limited to how large a tuple can be. It is just accessing it from the
heap routines that is the problem. If the tuple is alloc'ed to be used,
we can paste together the parts on disk and return one tuple. If they
are accessing the buffer copy directly, we would have to be smart about
going off the end of the disk copy and moving to the next segment.

Who's accessing tuple attributes directly inside the buffer
copy (not only the header which will still be unsplitted at
the top of the chain)?

Aren't these situations where it is done restricted to system
catalogs? I think we can live with the restriction that the
tuple split will not be available for system relations
because the only place where the limit hit us is pg_rewrite
and that can be handled by redesigning the storage of rules
which is already required by the rule recompilation TODO.

I can't think that anywhere in the code a buffer from a user
relation (except for sequences and that's another story) is
accessed that clumsy.

The code is very clear now about accessing tuples or tuple copies.
Hopefully locking will not be an issue because you only need to lock the
main tuple. No one is going to see the secondary part of the tuple.

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#========================================= wieck@debis.com (Jan Wieck) #

#7Noname
wieck@debis.com
In reply to: Bruce Momjian (#3)
Re: [HACKERS] Arbitrary tuple size

Bruce Momjian wrote:

I agree this is the way to go. There is nothing I can think of that is
limited to how large a tuple can be.

Outch - I can.

Having an index on a varlen field that now doesn't fit any
more into an index block. Wouldn't this cause problems? Well
it's bad database design to index fields that will receive
that long data because indexing them will blow up the
database but it must work anyway.

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#========================================= wieck@debis.com (Jan Wieck) #

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noname (#7)
Re: [HACKERS] Arbitrary tuple size

wieck@debis.com (Jan Wieck) writes:

Bruce Momjian wrote:

I agree this is the way to go. There is nothing I can think of that is
limited to how large a tuple can be.

Outch - I can.

Having an index on a varlen field that now doesn't fit any
more into an index block. Wouldn't this cause problems?

Aren't index tuples still tuples? Can't they be split just like
regular tuples?

regards, tom lane

#9Noname
wieck@debis.com
In reply to: Tom Lane (#8)
Re: [HACKERS] Arbitrary tuple size

Tom Lane wrote:

wieck@debis.com (Jan Wieck) writes:

Bruce Momjian wrote:

I agree this is the way to go. There is nothing I can think of that is
limited to how large a tuple can be.

Outch - I can.

Having an index on a varlen field that now doesn't fit any
more into an index block. Wouldn't this cause problems?

Aren't index tuples still tuples? Can't they be split just like
regular tuples?

Don't know, maybe.

While looking for some places where tuple data might be
accessed directly inside of the buffers I've searched for
WriteBuffer() and friends. These are mostly used in the index
access methods and some other places where I expected them,
so index AM's have at least to be carefully visited when
implementing tuple split.

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#========================================= wieck@debis.com (Jan Wieck) #

#10Noname
wieck@debis.com
In reply to: Noname (#9)
Re: [HACKERS] Arbitrary tuple size

I wrote:

Tom Lane wrote:

Aren't index tuples still tuples? Can't they be split just like
regular tuples?

Don't know, maybe.

Actually we have some problems with indices on text
attributes when the content exceeds HALF of the blocksize:

FATAL 1: btree: failed to add item to the page

It crashes the backend AND seems to corrupt the index! Looks
to me that at least the btree code needs to be able to store
at minimum two items into one block and painfully fails if it
can't.

And just another one:

pgsql=> create table t1 (a int4, b char(4000));
CREATE
pgsql=> create index t1_b on t1 (b);
CREATE
pgsql=> insert into t1 values (1, 'a');

TRAP: Failed Assertion("!(( itid)->lp_flags & 0x01):",
File: "nbtinsert.c", Line: 361)

Bruce: One more TODO item!

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#========================================= wieck@debis.com (Jan Wieck) #

#11Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Noname (#10)
Re: [HACKERS] Arbitrary tuple size

Going toward >8k tuples would be really good, but I suspect we may
some difficulties with LO stuffs once we implement it. Also it seems
that it's not worth to adapt LOs with newly designed tuples. I think
the design of current LOs are so broken that we need to redesign them.

o it's slow: accessing a LO need a open() that is not cheap. creating
many LOs makes data/base/DBNAME/ directory fat.

o it consumes lots of i-nodes

o it breaks the tuple abstraction: this makes difficult to maintain
the code.

I would propose followings for the new version of LO:

o create a new data type that represents the LO

o when defining the LO data type in a table, it actually points to a
LO "body" in another place where it is physically stored.

o the storage for LO bodies would be a hidden table that contains
several LOs, not single one.

o we can have several tables for the LO bodies. Probably a LO body
table for each corresponding table (where LO data type is defined) is
appropreate.

o it would be nice to place a LO table on a separate
directory/partition from the original table where LO data type is
defined, since a LO body table could become huge.

Comments? Opinions?
---
Tatsuo Ishii

#12Vadim Mikheev
vadim@krs.ru
In reply to: Noname (#7)
Re: [HACKERS] Arbitrary tuple size

Jan Wieck wrote:

Bruce Momjian wrote:

I agree this is the way to go. There is nothing I can think of that is
limited to how large a tuple can be.

Outch - I can.

Having an index on a varlen field that now doesn't fit any
more into an index block. Wouldn't this cause problems? Well
it's bad database design to index fields that will receive
that long data because indexing them will blow up the
database but it must work anyway.

Seems that in other DBMSes len of index tuple is more restricted
than len of heap one. So I think we shouldn't worry about this case.

Vadim

#13Philip Warner
pjw@rhyme.com.au
In reply to: Tatsuo Ishii (#11)
Re: [HACKERS] Arbitrary tuple size

At 10:12 9/07/99 +0900, Tatsuo Ishii wrote:

o create a new data type that represents the LO

o when defining the LO data type in a table, it actually points to a
LO "body" in another place where it is physically stored.

Much as the purist in me hates concept of hard links (as in Leon's suggestions), this *may* be a good application for them. Certainly that's how Dec(Oracle)/Rdb does it. Since most blobs will be totally rewritten when they are updated, this represents a slightly smaller problem in terms of MVCC.

o we can have several tables for the LO bodies. Probably a LO body
table for each corresponding table (where LO data type is defined) is
appropreate.

Did you mean a table for each field? Or a table for each table (which may have more than 1 LO field). See comments below.

o it would be nice to place a LO table on a separate
directory/partition from the original table where LO data type is
defined, since a LO body table could become huge.

I would very much like to see the ability to have multi-file databases and tables - ie. the ability to store and table or index in a separate file. Perhaps with a user-defined partitioning function for table rows. The idea being that:

1. User specifies that a table can be stored in one of (say) three files.
2. When a record is first stored, the partitioning function is called to determine the file 'storage area' to use. [or a random selection method is used]

If you are going to allow LOs to be stored in multiple files, it seems a pity not to add some or all of this feature.

Additionally, the issue of pg_dump support for LOs needs to be addressed.

That's sbout it for me,

Philip Warner.

----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.C.N. 008 659 498) | /(@) ______---_
Tel: +61-03-5367 7422 | _________ \
Fax: +61-03-5367 7430 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/

#14Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Noname (#6)
Re: [HACKERS] Arbitrary tuple size

I agree this is the way to go. There is nothing I can think of that is
limited to how large a tuple can be. It is just accessing it from the
heap routines that is the problem. If the tuple is alloc'ed to be used,
we can paste together the parts on disk and return one tuple. If they
are accessing the buffer copy directly, we would have to be smart about
going off the end of the disk copy and moving to the next segment.

Who's accessing tuple attributes directly inside the buffer
copy (not only the header which will still be unsplitted at
the top of the chain)?

Every call to heap_getnext(), for one. It locks the buffer, and returns
a pointer to the tuple. The next heap_getnext(), or heap_endscan()
releases the lock. The cost of returning every tuple as palloc'ed
memory would be huge. We may be able to get away with just returning
palloc'ed stuff for long tuples. That may be a simple, clean solution
that would be isolated.

In fact, if we want a copy, we call heap_copytuple() to return a
palloc'ed copy. This interface has been cleaned up so it should be
clear what is happening. The old code was messy about this.

See my comments from heap_fetch(), which does require the user to supply
a buffer variable, so they can unlock it when they are done. The old
code allowed you to pass a NULL as a buffer pointer, so there was no
locking done, and that is bad!

---------------------------------------------------------------------------

/* ----------------
* heap_fetch - retrive tuple with tid
*
* Currently ignores LP_IVALID during processing!
*
* Because this is not part of a scan, there is no way to
* automatically lock/unlock the shared buffers.
* For this reason, we require that the user retrieve the buffer
* value, and they are required to BufferRelease() it when they
* are done. If they want to make a copy of it before releasing it,
* they can call heap_copytyple().
* ----------------
*/
void
heap_fetch(Relation relation,
Snapshot snapshot,
HeapTuple tuple,
Buffer *userbuf)

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#15Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Noname (#9)
Re: [HACKERS] Arbitrary tuple size

Aren't index tuples still tuples? Can't they be split just like
regular tuples?

Don't know, maybe.

While looking for some places where tuple data might be
accessed directly inside of the buffers I've searched for
WriteBuffer() and friends. These are mostly used in the index
access methods and some other places where I expected them,
so index AM's have at least to be carefully visited when
implementing tuple split.

See my recent mail. heap_getnext and heap_fetch(). Can't get lower
access than that.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#16Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Noname (#10)
Re: [HACKERS] Arbitrary tuple size

I knew there had to be a reason that some tests where BLCKSZ/2 and some
BLCKSZ.

Added to TODO:

* Allow index on tuple greater than 1/2 block size

Seems we have to allow columns over 1/2 block size for now. Most people
wouln't index on them.

Don't know, maybe.

Actually we have some problems with indices on text
attributes when the content exceeds HALF of the blocksize:

FATAL 1: btree: failed to add item to the page

It crashes the backend AND seems to corrupt the index! Looks
to me that at least the btree code needs to be able to store
at minimum two items into one block and painfully fails if it
can't.

And just another one:

pgsql=> create table t1 (a int4, b char(4000));
CREATE
pgsql=> create index t1_b on t1 (b);
CREATE
pgsql=> insert into t1 values (1, 'a');

TRAP: Failed Assertion("!(( itid)->lp_flags & 0x01):",
File: "nbtinsert.c", Line: 361)

Bruce: One more TODO item!

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#========================================= wieck@debis.com (Jan Wieck) #

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#17Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Tatsuo Ishii (#11)
Re: [HACKERS] Arbitrary tuple size

If we get wide tuples, we could just throw all large objects into one
table, and have an on it. We can then vacuum it to compact space, etc.

Going toward >8k tuples would be really good, but I suspect we may
some difficulties with LO stuffs once we implement it. Also it seems
that it's not worth to adapt LOs with newly designed tuples. I think
the design of current LOs are so broken that we need to redesign them.

o it's slow: accessing a LO need a open() that is not cheap. creating
many LOs makes data/base/DBNAME/ directory fat.

o it consumes lots of i-nodes

o it breaks the tuple abstraction: this makes difficult to maintain
the code.

I would propose followings for the new version of LO:

o create a new data type that represents the LO

o when defining the LO data type in a table, it actually points to a
LO "body" in another place where it is physically stored.

o the storage for LO bodies would be a hidden table that contains
several LOs, not single one.

o we can have several tables for the LO bodies. Probably a LO body
table for each corresponding table (where LO data type is defined) is
appropreate.

o it would be nice to place a LO table on a separate
directory/partition from the original table where LO data type is
defined, since a LO body table could become huge.

Comments? Opinions?
---
Tatsuo Ishii

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#18Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Bruce Momjian (#17)
Re: [HACKERS] Arbitrary tuple size

If we get wide tuples, we could just throw all large objects into one
table, and have an on it. We can then vacuum it to compact space, etc.

I thought about that too. But if a table contains lots of LOs,
scanning of it will take for a long time. On the otherhand, if LOs are
stored outside the table, scanning time will be shorter as long as we
don't need to read the content of each LO type field.
--
Tatsuo Ishii

#19Vadim Mikheev
vadim@krs.ru
In reply to: Bruce Momjian (#17)
Re: [HACKERS] Arbitrary tuple size

Bruce Momjian wrote:

If we get wide tuples, we could just throw all large objects into one
table, and have an on it. We can then vacuum it to compact space, etc.

Storing 2Gb LO in table is not good thing.

Vadim

#20Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Tatsuo Ishii (#18)
Re: [HACKERS] Arbitrary tuple size

If we get wide tuples, we could just throw all large objects into one
table, and have an on it. We can then vacuum it to compact space, etc.

I thought about that too. But if a table contains lots of LOs,
scanning of it will take for a long time. On the otherhand, if LOs are
stored outside the table, scanning time will be shorter as long as we
don't need to read the content of each LO type field.

Use an index to get to the LO's in the table.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#21Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Vadim Mikheev (#19)
Re: [HACKERS] Arbitrary tuple size

Bruce Momjian wrote:

If we get wide tuples, we could just throw all large objects into one
table, and have an on it. We can then vacuum it to compact space, etc.

Storing 2Gb LO in table is not good thing.

Vadim

Ah, but we have segemented tables now. It will auto-split at 1 gig.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#22Vadim Mikheev
vadim@krs.ru
In reply to: Bruce Momjian (#21)
Re: [HACKERS] Arbitrary tuple size

Bruce Momjian wrote:

Bruce Momjian wrote:

If we get wide tuples, we could just throw all large objects into one
table, and have an on it. We can then vacuum it to compact space, etc.

Storing 2Gb LO in table is not good thing.

Vadim

Ah, but we have segemented tables now. It will auto-split at 1 gig.

Well, now consider update of 2Gb row!
I worry not due to non-overwriting but about writing
2Gb log record to WAL - we'll not be able to do it, sure.

Isn't it why Informix restrict tuple len to 32k only?
And the same is what Oracle does.
Both of them have ability to use > 1 page for single row,
but they have this restriction anyway.

I don't like _arbitrary_ tuple size.
I vote for some limit. 32K or 64K, at max.

Vadim

#23Hannu Krosing
hannu@trust.ee
In reply to: Bruce Momjian (#21)
Re: [HACKERS] Arbitrary tuple size

Vadim Mikheev wrote:

Bruce Momjian wrote:

Bruce Momjian wrote:

If we get wide tuples, we could just throw all large objects into one
table, and have an on it. We can then vacuum it to compact space, etc.

Storing 2Gb LO in table is not good thing.

Vadim

Ah, but we have segemented tables now. It will auto-split at 1 gig.

Well, now consider update of 2Gb row!
I worry not due to non-overwriting but about writing
2Gb log record to WAL - we'll not be able to do it, sure.

Can't we write just some kind of diff (only changed pages) in WAL,
either starting at some thresold or just based the seek/write logic of
LOs?

It will add complexity, but having some arbitrary limits seems very
wrong.

It will also make indexing LOs more complex, but as we don't currently
index
them anyway, its not a big problem yet.

Setting the limit higher (like 16M, where all my current LOs would fit
:) )
is just postponing the problems. Does "who will need more than 640k of
RAM"
sound familiar ?

Isn't it why Informix restrict tuple len to 32k only?
And the same is what Oracle does.

Does anyone know what the limit for Oracle8i is ? As they advertise it
as a
replacement file system among other things, I guess it can't be too low
-
I suspect 2G at minimum

Both of them have ability to use > 1 page for single row,
but they have this restriction anyway.

I don't like _arbitrary_ tuple size.

Why not ?

IMHO we should allow _arbitrary_ (in reality probably <= MAXINT), but
optimize for some known size and tell the users that if they exceed it
the performance would suffer.

So when I have 99% of my LOs in 10k-80k range but have a few 512k-2M
ones
I can just live with the bigger ones having bad performance instead
implementing an additional LO manager in the frontend too.

I vote for some limit.

Why limit ?

32K or 64K, at max.

Why so low ? Please make it at least configurable, preferrably at
runtime.

And if you go that way, please assume this limit (in code) for tuple
size only,
and not in FE/BE protocol - it will make it easier for someone to fix
the backend
to work with larger ones later

The LOs should remain load-on-demant anyway, just it should be made more
transparent
for end-users.

Show quoted text

Vadim

#24Noname
wieck@debis.com
In reply to: Tatsuo Ishii (#11)
Re: [HACKERS] Arbitrary tuple size

Tatsuo Ishii wrote:

Going toward >8k tuples would be really good, but I suspect we may
some difficulties with LO stuffs once we implement it. Also it seems
that it's not worth to adapt LOs with newly designed tuples. I think
the design of current LOs are so broken that we need to redesign them.

[... LO stuff deleted ...]

I wasn't talking about a new datatype that can exceed the
tuple limit. The general tuple split I want will also handle
it if a row with 40 text attributes of each 1K gets stored.
That's something different.

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#========================================= wieck@debis.com (Jan Wieck) #

#25Noname
wieck@debis.com
In reply to: Vadim Mikheev (#22)
Re: [HACKERS] Arbitrary tuple size

Vadim wrote:

Bruce Momjian wrote:

Bruce Momjian wrote:

If we get wide tuples, we could just throw all large objects into one
table, and have an on it. We can then vacuum it to compact space, etc.

Storing 2Gb LO in table is not good thing.

Vadim

Ah, but we have segemented tables now. It will auto-split at 1 gig.

Well, now consider update of 2Gb row!
I worry not due to non-overwriting but about writing
2Gb log record to WAL - we'll not be able to do it, sure.

Isn't it why Informix restrict tuple len to 32k only?
And the same is what Oracle does.
Both of them have ability to use > 1 page for single row,
but they have this restriction anyway.

I don't like _arbitrary_ tuple size.
I vote for some limit. 32K or 64K, at max.

To have some limit seems reasonable for me (I've also read
the other comments). When dealing with regular tuples, first
off the query to insert or update them will get read into
memory. Next the querytree with the Const vars is built,
rewritten, planned. Then the tuple is built in memory and
maybe somewhere else copied (fulltext index trigger). So the
amount of memory will be allocated many times!

There is some natural limit on the tuple size depending on
the available swapspace. Not everyone has multiple GB
swapspace setup. Making it a well known hard limit that
doesn't hurt even if 20 backends do things simultaneously is
better.

I vote for a limit too. 64K should be enough.

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#========================================= wieck@debis.com (Jan Wieck) #

#26Zeugswetter Andreas
andreas.zeugswetter@telecom.at
In reply to: Noname (#25)
Re: [HACKERS] Arbitrary tuple size

I knew there had to be a reason that some tests where BLCKSZ/2 and some
BLCKSZ.

Added to TODO:

* Allow index on tuple greater than 1/2 block size

Seems we have to allow columns over 1/2 block size for now. Most people
wouln't index on them.

Since an index header page has to point to at least 2 other leaf or
header pages, it stores at least 2 keys per page.

I would alter the todo to say:

* fix btree to give a useful elog when key > 1/2 (page - overhead) and not
abort

to fix the:

FATAL 1: btree: failed to add item to the page

A key of more than 4k will want a more efficient index type than btree
for such data anyway.

Andreas

#27Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Vadim Mikheev (#22)
Re: [HACKERS] Arbitrary tuple size

Ah, but we have segemented tables now. It will auto-split at 1 gig.

Well, now consider update of 2Gb row!
I worry not due to non-overwriting but about writing
2Gb log record to WAL - we'll not be able to do it, sure.

Isn't it why Informix restrict tuple len to 32k only?
And the same is what Oracle does.
Both of them have ability to use > 1 page for single row,
but they have this restriction anyway.

I don't like _arbitrary_ tuple size.
I vote for some limit. 32K or 64K, at max.

Yes, but having it all in one table prevents fopen() call for every
access, inode use for every large object, and allows vacuum to clean up
multiple versions. Just an idea. I realized your point.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#28Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Hannu Krosing (#23)
Re: [HACKERS] Arbitrary tuple size

Well, now consider update of 2Gb row!
I worry not due to non-overwriting but about writing
2Gb log record to WAL - we'll not be able to do it, sure.

Can't we write just some kind of diff (only changed pages) in WAL,
either starting at some thresold or just based the seek/write logic of
LOs?

It will add complexity, but having some arbitrary limits seems very
wrong.

It will also make indexing LOs more complex, but as we don't currently
index
them anyway, its not a big problem yet.

Well, we do indexing of large objects by using the OS directory code to
find a given directory entry.

Why not ?

IMHO we should allow _arbitrary_ (in reality probably <= MAXINT), but
optimize for some known size and tell the users that if they exceed it
the performance would suffer.

If they go over a certain size, they can decide to store it in the file
system, as many users are doing now.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#29Brook Milligan
brook@trillium.NMSU.Edu
In reply to: Noname (#25)
Re: [HACKERS] Arbitrary tuple size

I don't like _arbitrary_ tuple size.
I vote for some limit. 32K or 64K, at max.

To have some limit seems reasonable for me (I've also read
the other comments). When dealing with regular tuples, first

Isn't anything other than arbitrary sizes just making us encounter the
same problem later. Clearly, there are real hardware limits, but we
shouldn't build that into the code. It seems to me the solution is to
have arbitrary (e.g., hardware driven) limits, document what is
necessary to support certain operations, and let the fanatics buy
mega-systems if they need to support huge tuples. As long as the code
is optimized for more reasonable situations, there should be no
penalty.

Cheers,
Brook

#30Zeugswetter Andreas
andreas.zeugswetter@telecom.at
In reply to: Brook Milligan (#29)
Re: [HACKERS] Arbitrary tuple size

Storing 2Gb LO in table is not good thing.

at least vacuum and sequential scan will need to read it,
so I agree storing a large LO in the row is a no-no.

Well, now consider update of 2Gb row!
I worry not due to non-overwriting but about writing
2Gb log record to WAL - we'll not be able to do it, sure.

This is imho no different than with an external LO, since
for a rollforward we need the new value one way or another.
I don't see a special problem other than performance.

Informix has many ways to configure LO storage, 2 of which are:
1. store LO in the Tablespace (then all changes are written
to the Transaction log directly, and all LO IO is buffered)
LO's are always stored on separate pages in this tablespace,
and not with the row.
2. store LO in a separate blobspace
What Informix then does is to not write LO changes to the log,
only a reference, and the process that backs up the logs
then also reads the new LO's and writes them to tape.
In this setup all LO IO bypasses the bufferpool and is
synchronous.

Can't we write just some kind of diff (only changed pages) in WAL,
either starting at some thresold or just based the seek/write logic of
LOs?

It will add complexity, but having some arbitrary limits seems very
wrong.

The same holds true for the whole row. Only the changed columns
would need to go to the log. Consider a refcount and a large text column.
We would not want to log the text column with 4k if only the 4 byte refcount
changed.

Andreas

#31Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Zeugswetter Andreas (#26)
Re: [HACKERS] Arbitrary tuple size

Done. Thanks.

I knew there had to be a reason that some tests where BLCKSZ/2 and some
BLCKSZ.

Added to TODO:

* Allow index on tuple greater than 1/2 block size

Seems we have to allow columns over 1/2 block size for now. Most people
wouln't index on them.

Since an index header page has to point to at least 2 other leaf or
header pages, it stores at least 2 keys per page.

I would alter the todo to say:

* fix btree to give a useful elog when key > 1/2 (page - overhead) and not
abort

to fix the:

FATAL 1: btree: failed to add item to the page

A key of more than 4k will want a more efficient index type than btree
for such data anyway.

Andreas

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#32The Hermit Hacker
scrappy@hub.org
In reply to: Vadim Mikheev (#22)
Re: [HACKERS] Arbitrary tuple size

On Fri, 9 Jul 1999, Vadim Mikheev wrote:

Bruce Momjian wrote:

Bruce Momjian wrote:

If we get wide tuples, we could just throw all large objects into one
table, and have an on it. We can then vacuum it to compact space, etc.

Storing 2Gb LO in table is not good thing.

Vadim

Ah, but we have segemented tables now. It will auto-split at 1 gig.

Well, now consider update of 2Gb row!
I worry not due to non-overwriting but about writing
2Gb log record to WAL - we'll not be able to do it, sure.

What I'm kinda curious about is *why* you would want to store a LO in the
table in the first place? And, consequently, as Bruce had
suggested...index it? Unless something has changed recently that I
totally missed, the only time the index would be used is if a query was
based on a) start of string (ie. ^<string>) or b) complete string (ie.
^<string>$) ...

So what benefit would an index be on a LO?

Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

#33Philip Warner
pjw@rhyme.com.au
In reply to: The Hermit Hacker (#32)
Re: [HACKERS] Arbitrary tuple size

At 09:04 28/07/99 -0300, The Hermit Hacker wrote:

On Fri, 9 Jul 1999, Vadim Mikheev wrote:

Bruce Momjian wrote:

Bruce Momjian wrote:

If we get wide tuples, we could just throw all large objects into one
table, and have an on it. We can then vacuum it to compact space,

etc.

Storing 2Gb LO in table is not good thing.

Vadim

Ah, but we have segemented tables now. It will auto-split at 1 gig.

Well, now consider update of 2Gb row!
I worry not due to non-overwriting but about writing
2Gb log record to WAL - we'll not be able to do it, sure.

What I'm kinda curious about is *why* you would want to store a LO in the
table in the first place? And, consequently, as Bruce had
suggested...index it? Unless something has changed recently that I
totally missed, the only time the index would be used is if a query was
based on a) start of string (ie. ^<string>) or b) complete string (ie.
^<string>$) ...

So what benefit would an index be on a LO?

Some systems (Dec RDB) won't even let you index the contents of an LO.
Anyone know what other systems do?

Also, to repeat question from an earlier post: is there a plan for the BLOB
implementation that is available for comment/contribution?

----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.C.N. 008 659 498) | /(@) ______---_
Tel: +61-03-5367 7422 | _________ \
Fax: +61-03-5367 7430 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/