update on TOAST status

Started by Nonameover 25 years ago35 messages
#1Noname
JanWieck@t-online.de

FYI,

during the day I committed a couple of changes to TOAST.

- Secondary relations for the toaster (to move off values)
are now automatically created during CREATE TABLE, ALTER
TABLE ... ADD COLUMN and SELECT ... INTO, whenever the
first toastable attribute appears in the table schema.

- The TOAST tables are now of kind RELKIND_TOASTVALUE.

- TOAST tables cannot be vacuumed separately. They are
allways vacuumend if their master table is, while VACUUM
still holds the lock on the master table.

- VACUUM doesn't propagate ANALYZE to TOAST tables.
Statistics for them are needless because the toast access
is allways hardcoded indexed.

- TOAST tables are protected against manual INSERT, UPDATE
and DELETE operations. SELECT is still possible for
debugging purposes. The name of the TOAST table is
pg_toast_<oid-of-master>.

- The chunk_data attribute has been changed to type bytea.

For now, "lztext" is the only test candidate datatype to
invoke the toaster. It can hold up to multi-megabytes now.
But be warned, this datatype will disappear as soon as "text"
is toastable.

Next I'll make pg_dump TOAST-safe. Will only take a couple of
minutes I think.

Toast tables aren't automatically created for system
catalogs. Thus I'll add

ALTER TABLE pg_rewrite CREATE TOAST TABLE

to initdb. So we'll get unlimited view complexity for free.
As soon as arrays are toastable, we might want to add
pg_class because of relacl too.

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#2Chris Bitmead
chrisb@nimrod.itg.telstra.com.au
In reply to: Noname (#1)
Re: update on TOAST status

Jan Wieck wrote:

Toast tables aren't automatically created for system
catalogs. Thus I'll add

ALTER TABLE pg_rewrite CREATE TOAST TABLE

to initdb. So we'll get unlimited view complexity for free.
As soon as arrays are toastable, we might want to add
pg_class because of relacl too.

Why would we want system catalogs toastable?

#3Mike Mascari
mascarm@mascari.com
In reply to: Noname (#1)
Re: update on TOAST status

Chris Bitmead wrote:

Jan Wieck wrote:

Toast tables aren't automatically created for system
catalogs. Thus I'll add

ALTER TABLE pg_rewrite CREATE TOAST TABLE

to initdb. So we'll get unlimited view complexity for free.
As soon as arrays are toastable, we might want to add
pg_class because of relacl too.

Why would we want system catalogs toastable?

I assume this will allow for Views with large rewrite rules which
currently are limited in size.

Mike Mascari

#4Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Noname (#1)
Re: update on TOAST status'

- VACUUM doesn't propagate ANALYZE to TOAST tables.
Statistics for them are needless because the toast access
is allways hardcoded indexed.

I don't think statistics are insignificant for TOASTed columns. If I
say col=3, the optimizer uses that information for estimating the number
of rows returned, and figuring out the type of join and order of join to
perform, not just for "use index, don't use index" decisions.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#5Noname
JanWieck@t-online.de
In reply to: Bruce Momjian (#4)
Re: update on TOAST status'

Bruce Momjian wrote:

- VACUUM doesn't propagate ANALYZE to TOAST tables.
Statistics for them are needless because the toast access
is allways hardcoded indexed.

I don't think statistics are insignificant for TOASTed columns. If I
say col=3, the optimizer uses that information for estimating the number
of rows returned, and figuring out the type of join and order of join to
perform, not just for "use index, don't use index" decisions.

Ask your boys to give you a training session for "reading"
when they go to bed tonight - and greet them from the "police
officer" :-)

I said "to TOAST tables", not "TOASTed columns".

Their master tables will allways have the statistics,
including those for toasted columns, if you ask for them via
ANALYZE.

In normal operation, noone would ever know if a TOAST table
is accessed during his query - not even the planner or
optimmizer. It's totally transparent and the only one
accessing the TOAST tables is the toaster himself - and he
knows what he does.

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#6Noname
JanWieck@t-online.de
In reply to: Mike Mascari (#3)
Re: update on TOAST status

Mike Mascari wrote:

Chris Bitmead wrote:

Jan Wieck wrote:

Toast tables aren't automatically created for system
catalogs. Thus I'll add

ALTER TABLE pg_rewrite CREATE TOAST TABLE

to initdb. So we'll get unlimited view complexity for free.
As soon as arrays are toastable, we might want to add
pg_class because of relacl too.

Why would we want system catalogs toastable?

I assume this will allow for Views with large rewrite rules which
currently are limited in size.

Absolutely correnct.

With the code in place (after a few more fixes) I was able to
create a "SELECT *" view from a 681 attribute table. The
resulting rule is about 170K! And more complex things are
possible too now, because the rewrite rule size is not
limited any longer (as long as you have enough CPU, ram and
disk space).

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#4)
Re: update on TOAST status'

Bruce Momjian <pgman@candle.pha.pa.us> writes:

- VACUUM doesn't propagate ANALYZE to TOAST tables.
Statistics for them are needless because the toast access
is allways hardcoded indexed.

I don't think statistics are insignificant for TOASTed columns.

He didn't say that! I think what he meant is that there's no need for
statistics associated with the TOAST table itself, and AFAICS that's OK.

BTW, I have thought of a potential problem with indexes on toasted
columns. As I understand Jan's current thinking, the idea is

1. During storage of the tuple in the main table, any oversize fields
get compressed/moved off.

2. The toasted item in the finished main tuple gets handed to the index
routines to be stored in the index.

Now, storing the toasted item in the index tuple seems fine, but what
I do not like here is the implication that all the comparisons needed
to find where to *put* the index tuple are done using a pretoasted
value. That seems to imply dozens of redundant decompressions/fetches,
another one for each key comparison we have to do.

Jan, do you have a way around this that I missed?

One simple answer that might help for other scenarios too is to keep
a small cache of the last few values that had to be untoasted. Maybe
we only need it for moved-off values --- it could be that decompression
is fast enough that we should just do it over rather than trying to
cache.

regards, tom lane

#8Noname
JanWieck@t-online.de
In reply to: Tom Lane (#7)
Re: update on TOAST status'

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

- VACUUM doesn't propagate ANALYZE to TOAST tables.
Statistics for them are needless because the toast access
is allways hardcoded indexed.

I don't think statistics are insignificant for TOASTed columns.

He didn't say that! I think what he meant is that there's no need for
statistics associated with the TOAST table itself, and AFAICS that's OK.

BTW, I have thought of a potential problem with indexes on toasted
columns. As I understand Jan's current thinking, the idea is

1. During storage of the tuple in the main table, any oversize fields
get compressed/moved off.

2. The toasted item in the finished main tuple gets handed to the index
routines to be stored in the index.

Right.

Now, storing the toasted item in the index tuple seems fine, but what
I do not like here is the implication that all the comparisons needed
to find where to *put* the index tuple are done using a pretoasted
value. That seems to imply dozens of redundant decompressions/fetches,
another one for each key comparison we have to do.

Dozens - right.

I just did a little gdb session tracing a

SELECT ... WHERE toasted = 'xxx'

The table has 151 rows and an index on 'toasted'. It needed 6
fetches of the attribute. Better than good, because 2^6 is
only 64, so btree did a perfect job. Anyhow, in the case of a
real TOASTed (read burned) value, it'd mean 6 index scans to
recreate the on disk stored representation plus 6
decompression loops to get the plain one to compare against.
What the hell would an "IN (SELECT ...)" cause?

Jan, do you have a way around this that I missed?

One simple answer that might help for other scenarios too is to keep
a small cache of the last few values that had to be untoasted. Maybe
we only need it for moved-off values --- it could be that decompression
is fast enough that we should just do it over rather than trying to
cache.

I'm still argueing that indexing huge values is a hint for a
misleading schema. If this is true, propagating toasted
attributes into indices is a dead end street and I'd have to
change the heap-access<->toaster interface so that the
modified (stored) main tuple isn't visible to the following
code (that does the index inserts).

What is the value of supporting index tuples >2K? Support of
braindead schemas? I can live withoout it!

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#9Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Noname (#5)
Re: update on TOAST status'

Bruce Momjian wrote:

- VACUUM doesn't propagate ANALYZE to TOAST tables.
Statistics for them are needless because the toast access
is allways hardcoded indexed.

I don't think statistics are insignificant for TOASTed columns. If I
say col=3, the optimizer uses that information for estimating the number
of rows returned, and figuring out the type of join and order of join to
perform, not just for "use index, don't use index" decisions.

Ask your boys to give you a training session for "reading"
when they go to bed tonight - and greet them from the "police
officer" :-)

Sure.

I said "to TOAST tables", not "TOASTed columns".

Their master tables will allways have the statistics,
including those for toasted columns, if you ask for them via
ANALYZE.

In normal operation, noone would ever know if a TOAST table
is accessed during his query - not even the planner or
optimmizer. It's totally transparent and the only one
accessing the TOAST tables is the toaster himself - and he
knows what he does.

Oh, sure, got it. It is the toast table that doesn't need stats.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noname (#8)
Re: update on TOAST status'

JanWieck@t-online.de (Jan Wieck) writes:

Tom Lane wrote:

One simple answer that might help for other scenarios too is to keep
a small cache of the last few values that had to be untoasted. Maybe
we only need it for moved-off values --- it could be that decompression
is fast enough that we should just do it over rather than trying to
cache.

I'm still argueing that indexing huge values is a hint for a
misleading schema. If this is true, propagating toasted
attributes into indices is a dead end street and I'd have to
change the heap-access<->toaster interface so that the
modified (stored) main tuple isn't visible to the following
code (that does the index inserts).

But you'll notice that is *not* what I suggested. A detoasted-value
cache could be useful in more situations than just an index lookup.
I don't necessarily say we've got to have it in 7.1, but let's keep
the idea in mind in case we start finding there is a bottleneck here.

What is the value of supporting index tuples >2K?

If you're toasting the whole main tuple down to <2K, you might find
yourself toasting individual fields that are a good bit less than
that. So I don't think indexing a toasted value will be all that
unusual.

But this is all speculation for now. Let's get it working bulletproof
for 7.1, and then worry about speedups after we know they are needed.

regards, tom lane

#11Noname
JanWieck@t-online.de
In reply to: Tom Lane (#10)
Re: update on TOAST status'

Tom Lane wrote:

JanWieck@t-online.de (Jan Wieck) writes:

Tom Lane wrote:

One simple answer that might help for other scenarios too is to keep
a small cache of the last few values that had to be untoasted. Maybe
we only need it for moved-off values --- it could be that decompression
is fast enough that we should just do it over rather than trying to
cache.

I'm still argueing that indexing huge values is a hint for a
misleading schema. If this is true, propagating toasted
attributes into indices is a dead end street and I'd have to
change the heap-access<->toaster interface so that the
modified (stored) main tuple isn't visible to the following
code (that does the index inserts).

But you'll notice that is *not* what I suggested. A detoasted-value

Haven't missed it in the first read - of course.

cache could be useful in more situations than just an index lookup.
I don't necessarily say we've got to have it in 7.1, but let's keep
the idea in mind in case we start finding there is a bottleneck here.

What is the value of supporting index tuples >2K?

If you're toasting the whole main tuple down to <2K, you might find
yourself toasting individual fields that are a good bit less than
that. So I don't think indexing a toasted value will be all that
unusual.

Exactly that's why I'm asking if we wouldn't be better off by
limiting index tuples to (blocksize - overhead) / 4 and
allways store plain, untoasted values in indices.

I've asked now a couple of times "who really has the need for
indexing huge values"? All responses I got so far where of
the kind "would be nice if we support it" or "I don't like
such restrictions". But noone really said "I need it".

But this is all speculation for now. Let's get it working bulletproof
for 7.1, and then worry about speedups after we know they are needed.

Let me speculate too a little.

The experience I have up to now is that the saved time from
requiring less blocks in the buffer cache outweights the cost
of decompression. Especially with our algorithm, because it
is byte oriented (instead of huffman coding beeing based on a
bit stream), causing it to be extremely fast on
decompression. And the technique of moving off values from
the main heap causes the main tuples to be much smaller. As
long as the toasted values aren't used in qualification or
joining, only their references move around through the
various executor steps, and only those values that are part
of the final result set need to be fetched when sending them
to the client.

Given a limited amount of total memory available for one
running postmaster, we save alot of disk I/O and hold more
values in their compressed format in the shared buffers. With
the limit on total memory, the size of the buffer cache must
be lowered by the size of the new detoasted cache, and that
only if we make it shared too. Given further an average of
50% compression ratio (what's not unlikely with typical input
like html pages), one cached detoasted value would require
two compressed ones to go away.

Wouldn't really surprise me if we gain speed from it in the
average query. Even if some operations might slow down
(sorting on maybe toasted fields).

We need to see some results and wait for reports for this.
But we know already that it can cause trouble with indexed
fields, because these are likely to be used for comparision
during scans. So do we want to have indices storing plain
values allways and limit them in the index-tuple size or not?

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noname (#11)
Re: update on TOAST status'

JanWieck@t-online.de (Jan Wieck) writes:

... So do we want to have indices storing plain
values allways and limit them in the index-tuple size or not?

I think not: it will be seen as a robustness failure, even (or
especially) if it doesn't happen often. I can see the bug reports now:
"Hey! I tried to insert a long value in my field, and it didn't work!
I thought you'd fixed this bug?"

You make good arguments that we shouldn't be too concerned about the
speed of access to toasted index values, and I'm willing to accept
that point of view (at least till we have hard evidence about it).
But when I say "it should be bulletproof" I mean it should *work*,
without imposing arbitrary limits on the user. Arbitrary limits are
exactly what we are trying to eliminate.

regards, tom lane

#13Noname
JanWieck@t-online.de
In reply to: Tom Lane (#12)
Re: update on TOAST status'

Tom Lane wrote:

JanWieck@t-online.de (Jan Wieck) writes:

... So do we want to have indices storing plain
values allways and limit them in the index-tuple size or not?

I think not: it will be seen as a robustness failure, even (or
especially) if it doesn't happen often. I can see the bug reports now:
"Hey! I tried to insert a long value in my field, and it didn't work!
I thought you'd fixed this bug?"

You make good arguments that we shouldn't be too concerned about the
speed of access to toasted index values, and I'm willing to accept
that point of view (at least till we have hard evidence about it).
But when I say "it should be bulletproof" I mean it should *work*,
without imposing arbitrary limits on the user. Arbitrary limits are
exactly what we are trying to eliminate.

After debugging something I thought was a bug in the toaster,
I've found something really causing headaches.

TOAST AS IS IS NOT CAPABLE OF HOLDING INDEXED VALUES!

It appears that brtee indices (at least) can keep references
to old toast values that survive a VACUUM! Seems these
references live in nodes actually not referring to a heap
tuple any more, but used during tree traversal in
comparisions. As if an index tuple delete from a btree not
necessarily causes the index value to disappear from the
btree completely. It'll never be returned by an index scan,
but the value is still there somewhere.

Everything is OK with this up to a VACUUM run. The toaster
uses SnapShotAny to fetch toast values. So an external value
can be fetched by the toaster even if it is already deleted
and committed. If he has a reference somewhere, he has
allways a share or higher lock on the main relation
preventing VACUUM to mangle up the toast relation (I moved
toast relation vacuuming into the lock time of the main table
recently).

But in the above case it is already vacuumed and not present
any more. Now the btree traversal needs to compare against a
value, long gone to the bit heaven, and that cannot work with
the toast architecture.

Seems the designs of btree and toast are colliding. As soon
as "text" is toastable, this'll hurt - be warned.

AFAICS, we need to detoast values for index inserts allways
and have another toaster inside the index access methods at
some day. In the meantime we should decide a safe maximum
index tuple size and emit an explanative error message on the
attempt to insert oversized index entries instead of possibly
corrupting the index.

Comment!

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#14Philip Warner
pjw@rhyme.com.au
In reply to: Noname (#13)
Re: update on TOAST status'

At 14:02 11/07/00 +0200, Jan Wieck wrote:

AFAICS, we need to detoast values for index inserts allways
and have another toaster inside the index access methods at
some day.

We might not need it...at least not in the furst pass.

In the meantime we should decide a safe maximum
index tuple size and emit an explanative error message on the
attempt to insert oversized index entries instead of possibly
corrupting the index.

Can I suggest that we also put out a warning when defining an index using a
field with a (potentially) unlimited size? Indexing a text field will
mostly be a bizarre thing to do, but, eg, indexing the first 255 chars of a
text field (via substr) might not be.

----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.C.N. 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/

#15Noname
JanWieck@t-online.de
In reply to: Philip Warner (#14)
Re: update on TOAST status'

Philip Warner wrote:

At 14:02 11/07/00 +0200, Jan Wieck wrote:

AFAICS, we need to detoast values for index inserts allways
and have another toaster inside the index access methods at
some day.

We might not need it...at least not in the furst pass.

The thing is actually broken and needs a fix. As soon as
"text" is toastable, it can happen everywhere that text is
toasted even if it's actual plain value would perfectly fit
into an index tuple. Think of a table with 20 text columns,
where the indexed one has a 1024 bytes value, while all
others hold 512 bytes. In that case, the indexed one is the
biggest and get's toasted first. And if all the data is of
nature that compression doesn't gain enough, it might still
be the biggest one after that step and will be considered for
move off ... boom.

We can't let this in in the first pass!

In the meantime we should decide a safe maximum
index tuple size and emit an explanative error message on the
attempt to insert oversized index entries instead of possibly
corrupting the index.

Can I suggest that we also put out a warning when defining an index using a
field with a (potentially) unlimited size? Indexing a text field will
mostly be a bizarre thing to do, but, eg, indexing the first 255 chars of a
text field (via substr) might not be.

Marking it BOLD somewhere in the release notes, the CREATE
INDEX doc and some other places should be enough. Such a
message at every CREATE INDEX is annoying.

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#16Philip Warner
pjw@rhyme.com.au
In reply to: Noname (#15)
Re: update on TOAST status'

At 14:38 11/07/00 +0200, Jan Wieck wrote:

Can I suggest that we also put out a warning when defining an index using a
field with a (potentially) unlimited size? Indexing a text field will
mostly be a bizarre thing to do, but, eg, indexing the first 255 chars of a
text field (via substr) might not be.

Marking it BOLD somewhere in the release notes, the CREATE
INDEX doc and some other places should be enough. Such a
message at every CREATE INDEX is annoying.

The suggestion was only if the index contained a text, lztext etc field,
but no problem. The way I read your suggestion was that I'd get a real
error when doing an insert if the text was too large.

----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.C.N. 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/

#17Noname
JanWieck@t-online.de
In reply to: Philip Warner (#16)
Re: update on TOAST status'

Philip Warner wrote:

At 14:38 11/07/00 +0200, Jan Wieck wrote:

Can I suggest that we also put out a warning when defining an index using a
field with a (potentially) unlimited size? Indexing a text field will
mostly be a bizarre thing to do, but, eg, indexing the first 255 chars of a
text field (via substr) might not be.

Marking it BOLD somewhere in the release notes, the CREATE
INDEX doc and some other places should be enough. Such a
message at every CREATE INDEX is annoying.

The suggestion was only if the index contained a text, lztext etc field,
but no problem. The way I read your suggestion was that I'd get a real
error when doing an insert if the text was too large.

Yes, that's what I'm after. It's too fragile IMHO to check on
multi column indices with char(n) or so if resulting index
tuples will fit in the future.

The atttypmod field on NUMERIC columns for example doesn't
tell the easy way how big the internal representation might
grow. And what about variable size user defined types that
are marked toastable? Can you estimate the maximum internal
storage size for them?

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#18Philip Warner
pjw@rhyme.com.au
In reply to: Noname (#17)
Re: update on TOAST status'

At 15:08 11/07/00 +0200, Jan Wieck wrote:

The atttypmod field on NUMERIC columns for example doesn't
tell the easy way how big the internal representation might
grow. And what about variable size user defined types that
are marked toastable? Can you estimate the maximum internal
storage size for them?

Well, uncompressed size would be a good upper estimate, since you may be
passed already compressed data...

----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.C.N. 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/

#19Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Philip Warner (#18)
AW: update on TOAST status'

It appears that brtee indices (at least) can keep references
to old toast values that survive a VACUUM! Seems these
references live in nodes actually not referring to a heap
tuple any more, but used during tree traversal in
comparisions. As if an index tuple delete from a btree not
necessarily causes the index value to disappear from the
btree completely. It'll never be returned by an index scan,
but the value is still there somewhere.

Would it be possible to actually delete those entries during vacuum ?
I guess that would be an overall win, no ?

Andreas

#20Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Zeugswetter Andreas SB (#19)
AW: update on TOAST status'

Philip Warner wrote:

At 14:02 11/07/00 +0200, Jan Wieck wrote:

AFAICS, we need to detoast values for index inserts allways
and have another toaster inside the index access methods at
some day.

We might not need it...at least not in the furst pass.

The thing is actually broken and needs a fix. As soon as
"text" is toastable, it can happen everywhere that text is
toasted even if it's actual plain value would perfectly fit
into an index tuple. Think of a table with 20 text columns,
where the indexed one has a 1024 bytes value, while all
others hold 512 bytes. In that case, the indexed one is the
biggest and get's toasted first. And if all the data is of
nature that compression doesn't gain enough, it might still
be the biggest one after that step and will be considered for
move off ... boom.

We can't let this in in the first pass!

Have you added a minimum size for a value to actually be considered
for toasting ? Imho some lower border between 64 and 256 bytes per value
would be useful, not only the row size.

Imho the same logic should apply to choose wheather a fixed (max) size
column
qualifys for toasting (e.g. varchar(32) should never qualify for toasting)

Andreas

#21Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Zeugswetter Andreas SB (#19)
Re: AW: update on TOAST status'

[ Charset ISO-8859-1 unsupported, converting... ]

It appears that brtee indices (at least) can keep references
to old toast values that survive a VACUUM! Seems these
references live in nodes actually not referring to a heap
tuple any more, but used during tree traversal in
comparisions. As if an index tuple delete from a btree not
necessarily causes the index value to disappear from the
btree completely. It'll never be returned by an index scan,
but the value is still there somewhere.

Would it be possible to actually delete those entries during vacuum ?
I guess that would be an overall win, no ?

Seems that is the only good solution, or somehow link vacuum of TOAST
tables to index so these TOAST values are not removed.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noname (#13)
Re: update on TOAST status'

JanWieck@t-online.de (Jan Wieck) writes:

After debugging something I thought was a bug in the toaster,
I've found something really causing headaches.
TOAST AS IS IS NOT CAPABLE OF HOLDING INDEXED VALUES!
It appears that brtee indices (at least) can keep references
to old toast values that survive a VACUUM! Seems these
references live in nodes actually not referring to a heap
tuple any more, but used during tree traversal in
comparisions. As if an index tuple delete from a btree not
necessarily causes the index value to disappear from the
btree completely. It'll never be returned by an index scan,
but the value is still there somewhere.

Oooh, nasty. Probably the keys you are looking at are in upper-
level btree pages and indicate the ranges of keys found in lower
pages, rather than being pointers to real tuples.

One answer is to rebuild indexes from scratch during VACUUM,
before we vacuum the TOAST relation. We've been talking about
doing that for a long time. Maybe it's time to bite the bullet
and do it. (Of course that means fixing the relation-versioning
problem, which it seems we don't have a consensus on yet...)

Seems the designs of btree and toast are colliding. As soon
as "text" is toastable, this'll hurt - be warned.

Text *is* marked toastable in current CVS...

AFAICS, we need to detoast values for index inserts allways
and have another toaster inside the index access methods at
some day. In the meantime we should decide a safe maximum
index tuple size and emit an explanative error message on the
attempt to insert oversized index entries instead of possibly
corrupting the index.

I don't like that --- seems it would put a definite crimp in the
whole point of TOAST, which is not to have arbitrary limits on field
sizes.

regards, tom lane

#23Noname
JanWieck@t-online.de
In reply to: Zeugswetter Andreas SB (#19)
Re: AW: update on TOAST status'

Zeugswetter Andreas SB wrote:

It appears that brtee indices (at least) can keep references
to old toast values that survive a VACUUM! Seems these
references live in nodes actually not referring to a heap
tuple any more, but used during tree traversal in
comparisions. As if an index tuple delete from a btree not
necessarily causes the index value to disappear from the
btree completely. It'll never be returned by an index scan,
but the value is still there somewhere.

Would it be possible to actually delete those entries during vacuum ?
I guess that would be an overall win, no ?

Seems I explained it a little confusing or am confused by it
myself.

Either way, VACUUM does DELETE those from the indices! But
btree is a Balanced Tree, and ISTM that it sometimes decides
to keep a deleted node just to have trees balanced and to
decide on it whether to go left or right. An index scan will
never return those nodes, but exactly at the time btree needs
to decide left/right, it calls the type specific CMP function
and that in turn invokes the toast fetch.

A pure btree does not have the need for it, but we're using a
high concurrency optimized version called nbtree. That one
seems to do so.

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#24Noname
JanWieck@t-online.de
In reply to: Zeugswetter Andreas SB (#20)
Re: AW: update on TOAST status'

Zeugswetter Andreas SB wrote:

Philip Warner wrote:

At 14:02 11/07/00 +0200, Jan Wieck wrote:

AFAICS, we need to detoast values for index inserts allways
and have another toaster inside the index access methods at
some day.

We might not need it...at least not in the furst pass.

The thing is actually broken and needs a fix. As soon as
"text" is toastable, it can happen everywhere that text is
toasted even if it's actual plain value would perfectly fit
into an index tuple. Think of a table with 20 text columns,
where the indexed one has a 1024 bytes value, while all
others hold 512 bytes. In that case, the indexed one is the
biggest and get's toasted first. And if all the data is of
nature that compression doesn't gain enough, it might still
be the biggest one after that step and will be considered for
move off ... boom.

We can't let this in in the first pass!

Have you added a minimum size for a value to actually be considered
for toasting ? Imho some lower border between 64 and 256 bytes per value
would be useful, not only the row size.

Imho the same logic should apply to choose wheather a fixed (max) size
column
qualifys for toasting (e.g. varchar(32) should never qualify for toasting)

The toaster simply tries to keep all heap tuples smaller than
2K (MaxTupleSize / 4). It does so by first compressing all
compressable ones, then moving off. Both steps allways pick
the attributes in descending size order.

It does not know anything about indices (and that wouldn't
help either because indices could be created later as done in
every dump).

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#25Noname
JanWieck@t-online.de
In reply to: Tom Lane (#22)
Re: update on TOAST status'

Tom Lane wrote:

JanWieck@t-online.de (Jan Wieck) writes:

After debugging something I thought was a bug in the toaster,
I've found something really causing headaches.
TOAST AS IS IS NOT CAPABLE OF HOLDING INDEXED VALUES!
It appears that brtee indices (at least) can keep references
to old toast values that survive a VACUUM! Seems these
references live in nodes actually not referring to a heap
tuple any more, but used during tree traversal in
comparisions. As if an index tuple delete from a btree not
necessarily causes the index value to disappear from the
btree completely. It'll never be returned by an index scan,
but the value is still there somewhere.

Oooh, nasty. Probably the keys you are looking at are in upper-
level btree pages and indicate the ranges of keys found in lower
pages, rather than being pointers to real tuples.

So our btree implementation is closer to an ISAM file
organization than to a real tree? Anyway, either one or the
other is the reason that an attempt to insert a new value
results in an lztext_cmp() call that cannot be resolved due
to a missing toast value.

I added some checks to the detoaster just to throw an
elog(ERROR) instead of a coredump in such a case earlier
today.

One answer is to rebuild indexes from scratch during VACUUM,
before we vacuum the TOAST relation. We've been talking about
doing that for a long time. Maybe it's time to bite the bullet
and do it. (Of course that means fixing the relation-versioning
problem, which it seems we don't have a consensus on yet...)

Doesn't matter if we do it before or after, because the main
heap shouldn't contain any more toast references to deleted
(later to be vacuumed) toast entries at that time.

Anyway, it's a nice idea that should solve the problem. For
indices, which can allways be rebuilt from the heap data, I
don't see such a big need for the versioning. Only that a
partially rebuilt index (rebuild crashed in the middle) needs
another vacuum before the the DB is accessible again. How
often does that happen?

So why not having vacuum truncating the index file to zero
and rebuilding it from scratch in place? Can anyone access an
index while vacuum has a lock on it's heap?

Seems the designs of btree and toast are colliding. As soon
as "text" is toastable, this'll hurt - be warned.

Text *is* marked toastable in current CVS...

Whow - haven't noticed.

Will run my tests against text ... parallel. Does it have any
impact on the regression test execution time? Does any toast
table (that should now be there in the regression DB) loose
it's zero size during the tests?

AFAICS, we need to detoast values for index inserts allways
and have another toaster inside the index access methods at
some day. In the meantime we should decide a safe maximum
index tuple size and emit an explanative error message on the
attempt to insert oversized index entries instead of possibly
corrupting the index.

I don't like that --- seems it would put a definite crimp in the
whole point of TOAST, which is not to have arbitrary limits on field
sizes.

If we can solve it, let's do so. If we cannot, let's restrict
it for 7.1.

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noname (#25)
Re: update on TOAST status'

JanWieck@t-online.de (Jan Wieck) writes:

Tom Lane wrote:

One answer is to rebuild indexes from scratch during VACUUM,
before we vacuum the TOAST relation. We've been talking about
doing that for a long time. Maybe it's time to bite the bullet
and do it. (Of course that means fixing the relation-versioning
problem, which it seems we don't have a consensus on yet...)

Doesn't matter if we do it before or after, because the main
heap shouldn't contain any more toast references to deleted
(later to be vacuumed) toast entries at that time.

No, we must fix the indexes first, so that they contain no bogus
values if we fail while vacuuming the TOAST relation.

Anyway, it's a nice idea that should solve the problem. For
indices, which can allways be rebuilt from the heap data, I
don't see such a big need for the versioning. Only that a
partially rebuilt index (rebuild crashed in the middle) needs
another vacuum before the the DB is accessible again. How
often does that happen?

If it happens just once on one of your system-table indices, you
won't be happy. We've sweated hard to make VACUUM crash-safe,
and I don't want to throw that away because of TOAST.

Text *is* marked toastable in current CVS...

Whow - haven't noticed.

Will run my tests against text ... parallel. Does it have any
impact on the regression test execution time? Does any toast
table (that should now be there in the regression DB) loose
it's zero size during the tests?

Yes, there are some nonzero-size toast files in there. Haven't
tried to run any timing tests...

regards, tom lane

#27Noname
JanWieck@t-online.de
In reply to: Tom Lane (#26)
Re: update on TOAST status'

tOM lANE wrote:

JanWieck@t-online.de (Jan Wieck) writes:

Tom Lane wrote:

One answer is to rebuild indexes from scratch during VACUUM,
before we vacuum the TOAST relation. We've been talking about
doing that for a long time. Maybe it's time to bite the bullet
and do it. (Of course that means fixing the relation-versioning
problem, which it seems we don't have a consensus on yet...)

Doesn't matter if we do it before or after, because the main
heap shouldn't contain any more toast references to deleted
(later to be vacuumed) toast entries at that time.

No, we must fix the indexes first, so that they contain no bogus
values if we fail while vacuuming the TOAST relation.

Got me.

Anyway, it's a nice idea that should solve the problem. For
indices, which can allways be rebuilt from the heap data, I
don't see such a big need for the versioning. Only that a
partially rebuilt index (rebuild crashed in the middle) needs
another vacuum before the the DB is accessible again. How
often does that happen?

If it happens just once on one of your system-table indices, you
won't be happy. We've sweated hard to make VACUUM crash-safe,
and I don't want to throw that away because of TOAST.

Alternatively we could go for both methods. Does any system
catalog have an index on a varlena field? If not, we could do
the classic vacuum on anything that is either a catalog or a
table that doesn't have a toast relation. Then do the lazy
reindex from scratch on anything left.

Text *is* marked toastable in current CVS...

Whow - haven't noticed.

Will run my tests against text ... parallel. Does it have any
impact on the regression test execution time? Does any toast
table (that should now be there in the regression DB) loose
it's zero size during the tests?

Yes, there are some nonzero-size toast files in there. Haven't
tried to run any timing tests...

No, there aren't. All you've seen are their indices of 16K
each. But my tests, formerly using lztext, ran smooth with
text.

I've looked at textout() and, well, your style of detoasting
arguments looks alot better and easier. From the way it's
implemented I assume the per tuple memory context is done
too, no?

Jan

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noname (#27)
Re: update on TOAST status'

JanWieck@t-online.de (Jan Wieck) writes:

I've looked at textout() and, well, your style of detoasting
arguments looks alot better and easier. From the way it's
implemented I assume the per tuple memory context is done
too, no?

Not yet --- I'm running regress tests on it right now, though.
You're right that I'm assuming the function routines can leak
memory without trouble.

(We might need to avoid leaks in the comparison routines that are used
for indexes, but otherwise I think this scheme will work comfortably.)

regards, tom lane

#29Noname
JanWieck@t-online.de
In reply to: Tom Lane (#28)
Re: update on TOAST status'

Tom Lane wrote:

JanWieck@t-online.de (Jan Wieck) writes:

I've looked at textout() and, well, your style of detoasting
arguments looks alot better and easier. From the way it's
implemented I assume the per tuple memory context is done
too, no?

Not yet --- I'm running regress tests on it right now, though.
You're right that I'm assuming the function routines can leak
memory without trouble.

(We might need to avoid leaks in the comparison routines that are used
for indexes, but otherwise I think this scheme will work comfortably.)

That sounds bad. At least not very good.

So we better add a PG_FREEARG_xxx(ptr, argno) macro that does
the pfree if the pointer is different from the one in the
argument.

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noname (#29)
Re: update on TOAST status'

JanWieck@t-online.de (Jan Wieck) writes:

(We might need to avoid leaks in the comparison routines that are used
for indexes, but otherwise I think this scheme will work comfortably.)

That sounds bad. At least not very good.

So we better add a PG_FREEARG_xxx(ptr, argno) macro that does
the pfree if the pointer is different from the one in the
argument.

Yes, I already borrowed that idea from your original code. I don't
like it a whole lot, but as long as the need for it is confined to
the indexable comparison operators I think we can tolerate it.

The alternative is to hack up the index search routines (and also
tuplesort.c, and perhaps other places?) to maintain a short-term memory
context for evaluating comparison operators, and reset said context
fairly frequently. That might be doable but I haven't yet looked into
what it would take.

I'm hoping to commit what I have this evening...

regards, tom lane

#31Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Tom Lane (#30)
AW: update on TOAST status'

I don't like that --- seems it would put a definite crimp in the
whole point of TOAST, which is not to have arbitrary limits on field
sizes.

If we can solve it, let's do so. If we cannot, let's restrict
it for 7.1.

How are you doing the index toasting currently ? Is it on the same
line as table toasting ? That is: toast some index column values if the key
exceeds 2k ?

Andreas

#32Noname
JanWieck@t-online.de
In reply to: Zeugswetter Andreas SB (#31)
Re: AW: update on TOAST status'

Zeugswetter Andreas SB wrote:

I don't like that --- seems it would put a definite crimp in the
whole point of TOAST, which is not to have arbitrary limits on field
sizes.

If we can solve it, let's do so. If we cannot, let's restrict
it for 7.1.

How are you doing the index toasting currently ? Is it on the same
line as table toasting ? That is: toast some index column values if the key
exceeds 2k ?

The current CVS is broken in that area. You'll notice as soon
as you have many huge "text" values in an index, update them,
vacuum and continue to update.

The actual behaviour of the toaster is to toast each tuple
until it has a delicious looking, brown and crispy surface.
The indicator for beeing delicious is that it shrank below
MaxTupleSize/4 - that's a little less than 2K in a default 8K
blocksize setup.

It then sticks the new tuple into the HeapTuple's t_data
pointer.

Index inserts are allways done after heap_insert() or
heap_update(). At that time, the index tuples will be built
from the values found in the now replaced heap tuple. And
since the heap tuple found now is allways smaller than 2K,
any combination of attributes out of it must be too (it's
impossible to specify one and the same attribute multiple
times in one index).

So the indices simply inherit the toasting result. If a value
got compressed, the index will store the compressed format.
If it got moved off, the index will hold the toast entry
reference for it.

One of the biggest advantages is this: In the old system, an
indexed column of 2K caused 2K be stored in the heap plus 2K
stored in the index. Plus all the 2K instances in upper index
block range specs. Now, the heap and the index will only
hold references or compressed items.

Absolutely no problem for compressed items. All information
to recreate the original value is in the Datum itself.

For external stored ones, the reference tells the OIDs of the
secondary relation and it's index (where to find the data of
this entry), a unique identifier of the item (another OID)
and some other info. So the reference contains all the
information required to fetch the data just by looking at the
reference. And since the detoaster scans the secondary
relation with a visibility of SnapShotAny, it'll succeed to
find them even if they've been deleted long ago by another
committed transaction. So index traversal will succeed on
that in any case.

What I didn't knew at the time of implementation is, that
btree indices can keep such a reference in upper level blocks
range specifications even after a vacuum successfully deleted
the index tuple holding the reference itself. That's the
current pity.

Thus, if vacuum finally removed deleted tuples from the
secondary relations (after the heap and index have been
vacuumed), the detoaster cannot find those entries,
referenced by upper index blocks, any more.

Maybe we could propagate key range changes into upper blocks
at index_delete() time. Will look at the btree code now.

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#33Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Noname (#32)
AW: AW: update on TOAST status'

The actual behaviour of the toaster is to toast each tuple
until it has a delicious looking, brown and crispy surface.
The indicator for beeing delicious is that it shrank below
MaxTupleSize/4 - that's a little less than 2K in a default 8K
blocksize setup.

So the indices simply inherit the toasting result. If a value
got compressed, the index will store the compressed format.
If it got moved off, the index will hold the toast entry
reference for it.

Ok, this is where I would probably rethink the behavior.
Would it be possible to choose which columns need to
stay "moved off" in the index on the basis that the key size stays
below page size / 4 ? Thus if a key fits inside the 2k
you don't store the reference, but the compressed values
(even if they stay moved off for the heap table).

The timings you did only involved heap tuples not index.
My guess would be, that there is a similar tradeoff in the index.
Fetching toast values for header pages in an index seems like
a very expensive operation, because it needs to be performed for
every index access even if the searched value is not toasted
but falls into this range.

Of course this does not solve the "toast value" for key already
deleted, but it would lower the probability somewhat.

Andreas

#34Noname
JanWieck@t-online.de
In reply to: Noname (#32)
Re: AW: update on TOAST status'

I wrote:

Maybe we could propagate key range changes into upper blocks
at index_delete() time. Will look at the btree code now.

After looking at the vacuum code it doesn't seem to be a good
idea. Doing so would require to traverse the btree first,
while the current implementation just grabs the block by
index ctid and pulls out the tuple. I would expect it to
significantly slow down vacuum again - what we all don't
want.

So the only way left is recreating the indices from scratch
and moving the new ones into place.

But in contrast to things like column dropping, this would
have to happen on every vacuum run for alot of tables.

Isn't it appropriate to have a specialized version of it for
this case instead of waiting for a general relation
versioning?

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#35Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noname (#34)
Re: AW: update on TOAST status'

JanWieck@t-online.de (Jan Wieck) writes:

So the only way left is recreating the indices from scratch
and moving the new ones into place.
But in contrast to things like column dropping, this would
have to happen on every vacuum run for alot of tables.
Isn't it appropriate to have a specialized version of it for
this case instead of waiting for a general relation
versioning?

I don't see a "specialized" way that would be any different in
performance from a "generalized" solution. The hard part AFAICT is how
does a newly-started backend discover the current version numbers for
the critical system tables and indexes. To do versioning of system
indexes at all, we need a full-fledged solution.

But as you pointed out before, none of the system indexes are on
toastable datatypes. (I just checked --- the only index opclasses used
in template1 are: int2_ops int4_ops oid_ops char_ops oidvector_ops
name_ops.) Maybe we could have an interim solution using the old method
for system indexes and a drop-and-rebuild approach for user indexes.
A crash partway through rebuild would leave you with a busted index,
but maybe WAL could take care of redoing the index build after restart.
(Of course, if the index build failure is reproducible, you're in
big trouble...)

I don't *like* that approach a whole lot; it's ugly and doesn't sound
all that reliable. But if we don't want to deal with relation
versioning for 7.1, maybe it's the only way for now.

regards, tom lane