update on TOAST status
FYI,
during the day I committed a couple of changes to TOAST.
- Secondary relations for the toaster (to move off values)
are now automatically created during CREATE TABLE, ALTER
TABLE ... ADD COLUMN and SELECT ... INTO, whenever the
first toastable attribute appears in the table schema.
- The TOAST tables are now of kind RELKIND_TOASTVALUE.
- TOAST tables cannot be vacuumed separately. They are
allways vacuumend if their master table is, while VACUUM
still holds the lock on the master table.
- VACUUM doesn't propagate ANALYZE to TOAST tables.
Statistics for them are needless because the toast access
is allways hardcoded indexed.
- TOAST tables are protected against manual INSERT, UPDATE
and DELETE operations. SELECT is still possible for
debugging purposes. The name of the TOAST table is
pg_toast_<oid-of-master>.
- The chunk_data attribute has been changed to type bytea.
For now, "lztext" is the only test candidate datatype to
invoke the toaster. It can hold up to multi-megabytes now.
But be warned, this datatype will disappear as soon as "text"
is toastable.
Next I'll make pg_dump TOAST-safe. Will only take a couple of
minutes I think.
Toast tables aren't automatically created for system
catalogs. Thus I'll add
ALTER TABLE pg_rewrite CREATE TOAST TABLE
to initdb. So we'll get unlimited view complexity for free.
As soon as arrays are toastable, we might want to add
pg_class because of relacl too.
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #
Jan Wieck wrote:
Toast tables aren't automatically created for system
catalogs. Thus I'll addALTER TABLE pg_rewrite CREATE TOAST TABLE
to initdb. So we'll get unlimited view complexity for free.
As soon as arrays are toastable, we might want to add
pg_class because of relacl too.
Why would we want system catalogs toastable?
Chris Bitmead wrote:
Jan Wieck wrote:
Toast tables aren't automatically created for system
catalogs. Thus I'll addALTER TABLE pg_rewrite CREATE TOAST TABLE
to initdb. So we'll get unlimited view complexity for free.
As soon as arrays are toastable, we might want to add
pg_class because of relacl too.Why would we want system catalogs toastable?
I assume this will allow for Views with large rewrite rules which
currently are limited in size.
Mike Mascari
- VACUUM doesn't propagate ANALYZE to TOAST tables.
Statistics for them are needless because the toast access
is allways hardcoded indexed.
I don't think statistics are insignificant for TOASTed columns. If I
say col=3, the optimizer uses that information for estimating the number
of rows returned, and figuring out the type of join and order of join to
perform, not just for "use index, don't use index" decisions.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian wrote:
- VACUUM doesn't propagate ANALYZE to TOAST tables.
Statistics for them are needless because the toast access
is allways hardcoded indexed.I don't think statistics are insignificant for TOASTed columns. If I
say col=3, the optimizer uses that information for estimating the number
of rows returned, and figuring out the type of join and order of join to
perform, not just for "use index, don't use index" decisions.
Ask your boys to give you a training session for "reading"
when they go to bed tonight - and greet them from the "police
officer" :-)
I said "to TOAST tables", not "TOASTed columns".
Their master tables will allways have the statistics,
including those for toasted columns, if you ask for them via
ANALYZE.
In normal operation, noone would ever know if a TOAST table
is accessed during his query - not even the planner or
optimmizer. It's totally transparent and the only one
accessing the TOAST tables is the toaster himself - and he
knows what he does.
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #
Mike Mascari wrote:
Chris Bitmead wrote:
Jan Wieck wrote:
Toast tables aren't automatically created for system
catalogs. Thus I'll addALTER TABLE pg_rewrite CREATE TOAST TABLE
to initdb. So we'll get unlimited view complexity for free.
As soon as arrays are toastable, we might want to add
pg_class because of relacl too.Why would we want system catalogs toastable?
I assume this will allow for Views with large rewrite rules which
currently are limited in size.
Absolutely correnct.
With the code in place (after a few more fixes) I was able to
create a "SELECT *" view from a 681 attribute table. The
resulting rule is about 170K! And more complex things are
possible too now, because the rewrite rule size is not
limited any longer (as long as you have enough CPU, ram and
disk space).
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #
Bruce Momjian <pgman@candle.pha.pa.us> writes:
- VACUUM doesn't propagate ANALYZE to TOAST tables.
Statistics for them are needless because the toast access
is allways hardcoded indexed.
I don't think statistics are insignificant for TOASTed columns.
He didn't say that! I think what he meant is that there's no need for
statistics associated with the TOAST table itself, and AFAICS that's OK.
BTW, I have thought of a potential problem with indexes on toasted
columns. As I understand Jan's current thinking, the idea is
1. During storage of the tuple in the main table, any oversize fields
get compressed/moved off.
2. The toasted item in the finished main tuple gets handed to the index
routines to be stored in the index.
Now, storing the toasted item in the index tuple seems fine, but what
I do not like here is the implication that all the comparisons needed
to find where to *put* the index tuple are done using a pretoasted
value. That seems to imply dozens of redundant decompressions/fetches,
another one for each key comparison we have to do.
Jan, do you have a way around this that I missed?
One simple answer that might help for other scenarios too is to keep
a small cache of the last few values that had to be untoasted. Maybe
we only need it for moved-off values --- it could be that decompression
is fast enough that we should just do it over rather than trying to
cache.
regards, tom lane
Tom Lane wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
- VACUUM doesn't propagate ANALYZE to TOAST tables.
Statistics for them are needless because the toast access
is allways hardcoded indexed.I don't think statistics are insignificant for TOASTed columns.
He didn't say that! I think what he meant is that there's no need for
statistics associated with the TOAST table itself, and AFAICS that's OK.BTW, I have thought of a potential problem with indexes on toasted
columns. As I understand Jan's current thinking, the idea is1. During storage of the tuple in the main table, any oversize fields
get compressed/moved off.2. The toasted item in the finished main tuple gets handed to the index
routines to be stored in the index.
Right.
Now, storing the toasted item in the index tuple seems fine, but what
I do not like here is the implication that all the comparisons needed
to find where to *put* the index tuple are done using a pretoasted
value. That seems to imply dozens of redundant decompressions/fetches,
another one for each key comparison we have to do.
Dozens - right.
I just did a little gdb session tracing a
SELECT ... WHERE toasted = 'xxx'
The table has 151 rows and an index on 'toasted'. It needed 6
fetches of the attribute. Better than good, because 2^6 is
only 64, so btree did a perfect job. Anyhow, in the case of a
real TOASTed (read burned) value, it'd mean 6 index scans to
recreate the on disk stored representation plus 6
decompression loops to get the plain one to compare against.
What the hell would an "IN (SELECT ...)" cause?
Jan, do you have a way around this that I missed?
One simple answer that might help for other scenarios too is to keep
a small cache of the last few values that had to be untoasted. Maybe
we only need it for moved-off values --- it could be that decompression
is fast enough that we should just do it over rather than trying to
cache.
I'm still argueing that indexing huge values is a hint for a
misleading schema. If this is true, propagating toasted
attributes into indices is a dead end street and I'd have to
change the heap-access<->toaster interface so that the
modified (stored) main tuple isn't visible to the following
code (that does the index inserts).
What is the value of supporting index tuples >2K? Support of
braindead schemas? I can live withoout it!
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #
Bruce Momjian wrote:
- VACUUM doesn't propagate ANALYZE to TOAST tables.
Statistics for them are needless because the toast access
is allways hardcoded indexed.I don't think statistics are insignificant for TOASTed columns. If I
say col=3, the optimizer uses that information for estimating the number
of rows returned, and figuring out the type of join and order of join to
perform, not just for "use index, don't use index" decisions.Ask your boys to give you a training session for "reading"
when they go to bed tonight - and greet them from the "police
officer" :-)
Sure.
I said "to TOAST tables", not "TOASTed columns".
Their master tables will allways have the statistics,
including those for toasted columns, if you ask for them via
ANALYZE.In normal operation, noone would ever know if a TOAST table
is accessed during his query - not even the planner or
optimmizer. It's totally transparent and the only one
accessing the TOAST tables is the toaster himself - and he
knows what he does.
Oh, sure, got it. It is the toast table that doesn't need stats.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
JanWieck@t-online.de (Jan Wieck) writes:
Tom Lane wrote:
One simple answer that might help for other scenarios too is to keep
a small cache of the last few values that had to be untoasted. Maybe
we only need it for moved-off values --- it could be that decompression
is fast enough that we should just do it over rather than trying to
cache.
I'm still argueing that indexing huge values is a hint for a
misleading schema. If this is true, propagating toasted
attributes into indices is a dead end street and I'd have to
change the heap-access<->toaster interface so that the
modified (stored) main tuple isn't visible to the following
code (that does the index inserts).
But you'll notice that is *not* what I suggested. A detoasted-value
cache could be useful in more situations than just an index lookup.
I don't necessarily say we've got to have it in 7.1, but let's keep
the idea in mind in case we start finding there is a bottleneck here.
What is the value of supporting index tuples >2K?
If you're toasting the whole main tuple down to <2K, you might find
yourself toasting individual fields that are a good bit less than
that. So I don't think indexing a toasted value will be all that
unusual.
But this is all speculation for now. Let's get it working bulletproof
for 7.1, and then worry about speedups after we know they are needed.
regards, tom lane
Tom Lane wrote:
JanWieck@t-online.de (Jan Wieck) writes:
Tom Lane wrote:
One simple answer that might help for other scenarios too is to keep
a small cache of the last few values that had to be untoasted. Maybe
we only need it for moved-off values --- it could be that decompression
is fast enough that we should just do it over rather than trying to
cache.I'm still argueing that indexing huge values is a hint for a
misleading schema. If this is true, propagating toasted
attributes into indices is a dead end street and I'd have to
change the heap-access<->toaster interface so that the
modified (stored) main tuple isn't visible to the following
code (that does the index inserts).But you'll notice that is *not* what I suggested. A detoasted-value
Haven't missed it in the first read - of course.
cache could be useful in more situations than just an index lookup.
I don't necessarily say we've got to have it in 7.1, but let's keep
the idea in mind in case we start finding there is a bottleneck here.What is the value of supporting index tuples >2K?
If you're toasting the whole main tuple down to <2K, you might find
yourself toasting individual fields that are a good bit less than
that. So I don't think indexing a toasted value will be all that
unusual.
Exactly that's why I'm asking if we wouldn't be better off by
limiting index tuples to (blocksize - overhead) / 4 and
allways store plain, untoasted values in indices.
I've asked now a couple of times "who really has the need for
indexing huge values"? All responses I got so far where of
the kind "would be nice if we support it" or "I don't like
such restrictions". But noone really said "I need it".
But this is all speculation for now. Let's get it working bulletproof
for 7.1, and then worry about speedups after we know they are needed.
Let me speculate too a little.
The experience I have up to now is that the saved time from
requiring less blocks in the buffer cache outweights the cost
of decompression. Especially with our algorithm, because it
is byte oriented (instead of huffman coding beeing based on a
bit stream), causing it to be extremely fast on
decompression. And the technique of moving off values from
the main heap causes the main tuples to be much smaller. As
long as the toasted values aren't used in qualification or
joining, only their references move around through the
various executor steps, and only those values that are part
of the final result set need to be fetched when sending them
to the client.
Given a limited amount of total memory available for one
running postmaster, we save alot of disk I/O and hold more
values in their compressed format in the shared buffers. With
the limit on total memory, the size of the buffer cache must
be lowered by the size of the new detoasted cache, and that
only if we make it shared too. Given further an average of
50% compression ratio (what's not unlikely with typical input
like html pages), one cached detoasted value would require
two compressed ones to go away.
Wouldn't really surprise me if we gain speed from it in the
average query. Even if some operations might slow down
(sorting on maybe toasted fields).
We need to see some results and wait for reports for this.
But we know already that it can cause trouble with indexed
fields, because these are likely to be used for comparision
during scans. So do we want to have indices storing plain
values allways and limit them in the index-tuple size or not?
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #
JanWieck@t-online.de (Jan Wieck) writes:
... So do we want to have indices storing plain
values allways and limit them in the index-tuple size or not?
I think not: it will be seen as a robustness failure, even (or
especially) if it doesn't happen often. I can see the bug reports now:
"Hey! I tried to insert a long value in my field, and it didn't work!
I thought you'd fixed this bug?"
You make good arguments that we shouldn't be too concerned about the
speed of access to toasted index values, and I'm willing to accept
that point of view (at least till we have hard evidence about it).
But when I say "it should be bulletproof" I mean it should *work*,
without imposing arbitrary limits on the user. Arbitrary limits are
exactly what we are trying to eliminate.
regards, tom lane
Tom Lane wrote:
JanWieck@t-online.de (Jan Wieck) writes:
... So do we want to have indices storing plain
values allways and limit them in the index-tuple size or not?I think not: it will be seen as a robustness failure, even (or
especially) if it doesn't happen often. I can see the bug reports now:
"Hey! I tried to insert a long value in my field, and it didn't work!
I thought you'd fixed this bug?"You make good arguments that we shouldn't be too concerned about the
speed of access to toasted index values, and I'm willing to accept
that point of view (at least till we have hard evidence about it).
But when I say "it should be bulletproof" I mean it should *work*,
without imposing arbitrary limits on the user. Arbitrary limits are
exactly what we are trying to eliminate.
After debugging something I thought was a bug in the toaster,
I've found something really causing headaches.
TOAST AS IS IS NOT CAPABLE OF HOLDING INDEXED VALUES!
It appears that brtee indices (at least) can keep references
to old toast values that survive a VACUUM! Seems these
references live in nodes actually not referring to a heap
tuple any more, but used during tree traversal in
comparisions. As if an index tuple delete from a btree not
necessarily causes the index value to disappear from the
btree completely. It'll never be returned by an index scan,
but the value is still there somewhere.
Everything is OK with this up to a VACUUM run. The toaster
uses SnapShotAny to fetch toast values. So an external value
can be fetched by the toaster even if it is already deleted
and committed. If he has a reference somewhere, he has
allways a share or higher lock on the main relation
preventing VACUUM to mangle up the toast relation (I moved
toast relation vacuuming into the lock time of the main table
recently).
But in the above case it is already vacuumed and not present
any more. Now the btree traversal needs to compare against a
value, long gone to the bit heaven, and that cannot work with
the toast architecture.
Seems the designs of btree and toast are colliding. As soon
as "text" is toastable, this'll hurt - be warned.
AFAICS, we need to detoast values for index inserts allways
and have another toaster inside the index access methods at
some day. In the meantime we should decide a safe maximum
index tuple size and emit an explanative error message on the
attempt to insert oversized index entries instead of possibly
corrupting the index.
Comment!
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #
At 14:02 11/07/00 +0200, Jan Wieck wrote:
AFAICS, we need to detoast values for index inserts allways
and have another toaster inside the index access methods at
some day.
We might not need it...at least not in the furst pass.
In the meantime we should decide a safe maximum
index tuple size and emit an explanative error message on the
attempt to insert oversized index entries instead of possibly
corrupting the index.
Can I suggest that we also put out a warning when defining an index using a
field with a (potentially) unlimited size? Indexing a text field will
mostly be a bizarre thing to do, but, eg, indexing the first 255 chars of a
text field (via substr) might not be.
----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.C.N. 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/
Philip Warner wrote:
At 14:02 11/07/00 +0200, Jan Wieck wrote:
AFAICS, we need to detoast values for index inserts allways
and have another toaster inside the index access methods at
some day.We might not need it...at least not in the furst pass.
The thing is actually broken and needs a fix. As soon as
"text" is toastable, it can happen everywhere that text is
toasted even if it's actual plain value would perfectly fit
into an index tuple. Think of a table with 20 text columns,
where the indexed one has a 1024 bytes value, while all
others hold 512 bytes. In that case, the indexed one is the
biggest and get's toasted first. And if all the data is of
nature that compression doesn't gain enough, it might still
be the biggest one after that step and will be considered for
move off ... boom.
We can't let this in in the first pass!
In the meantime we should decide a safe maximum
index tuple size and emit an explanative error message on the
attempt to insert oversized index entries instead of possibly
corrupting the index.Can I suggest that we also put out a warning when defining an index using a
field with a (potentially) unlimited size? Indexing a text field will
mostly be a bizarre thing to do, but, eg, indexing the first 255 chars of a
text field (via substr) might not be.
Marking it BOLD somewhere in the release notes, the CREATE
INDEX doc and some other places should be enough. Such a
message at every CREATE INDEX is annoying.
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #
At 14:38 11/07/00 +0200, Jan Wieck wrote:
Can I suggest that we also put out a warning when defining an index using a
field with a (potentially) unlimited size? Indexing a text field will
mostly be a bizarre thing to do, but, eg, indexing the first 255 chars of a
text field (via substr) might not be.Marking it BOLD somewhere in the release notes, the CREATE
INDEX doc and some other places should be enough. Such a
message at every CREATE INDEX is annoying.
The suggestion was only if the index contained a text, lztext etc field,
but no problem. The way I read your suggestion was that I'd get a real
error when doing an insert if the text was too large.
----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.C.N. 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/
Philip Warner wrote:
At 14:38 11/07/00 +0200, Jan Wieck wrote:
Can I suggest that we also put out a warning when defining an index using a
field with a (potentially) unlimited size? Indexing a text field will
mostly be a bizarre thing to do, but, eg, indexing the first 255 chars of a
text field (via substr) might not be.Marking it BOLD somewhere in the release notes, the CREATE
INDEX doc and some other places should be enough. Such a
message at every CREATE INDEX is annoying.The suggestion was only if the index contained a text, lztext etc field,
but no problem. The way I read your suggestion was that I'd get a real
error when doing an insert if the text was too large.
Yes, that's what I'm after. It's too fragile IMHO to check on
multi column indices with char(n) or so if resulting index
tuples will fit in the future.
The atttypmod field on NUMERIC columns for example doesn't
tell the easy way how big the internal representation might
grow. And what about variable size user defined types that
are marked toastable? Can you estimate the maximum internal
storage size for them?
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #
At 15:08 11/07/00 +0200, Jan Wieck wrote:
The atttypmod field on NUMERIC columns for example doesn't
tell the easy way how big the internal representation might
grow. And what about variable size user defined types that
are marked toastable? Can you estimate the maximum internal
storage size for them?
Well, uncompressed size would be a good upper estimate, since you may be
passed already compressed data...
----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.C.N. 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/
It appears that brtee indices (at least) can keep references
to old toast values that survive a VACUUM! Seems these
references live in nodes actually not referring to a heap
tuple any more, but used during tree traversal in
comparisions. As if an index tuple delete from a btree not
necessarily causes the index value to disappear from the
btree completely. It'll never be returned by an index scan,
but the value is still there somewhere.
Would it be possible to actually delete those entries during vacuum ?
I guess that would be an overall win, no ?
Andreas
Import Notes
Resolved by subject fallback
Philip Warner wrote:
At 14:02 11/07/00 +0200, Jan Wieck wrote:
AFAICS, we need to detoast values for index inserts allways
and have another toaster inside the index access methods at
some day.We might not need it...at least not in the furst pass.
The thing is actually broken and needs a fix. As soon as
"text" is toastable, it can happen everywhere that text is
toasted even if it's actual plain value would perfectly fit
into an index tuple. Think of a table with 20 text columns,
where the indexed one has a 1024 bytes value, while all
others hold 512 bytes. In that case, the indexed one is the
biggest and get's toasted first. And if all the data is of
nature that compression doesn't gain enough, it might still
be the biggest one after that step and will be considered for
move off ... boom.We can't let this in in the first pass!
Have you added a minimum size for a value to actually be considered
for toasting ? Imho some lower border between 64 and 256 bytes per value
would be useful, not only the row size.
Imho the same logic should apply to choose wheather a fixed (max) size
column
qualifys for toasting (e.g. varchar(32) should never qualify for toasting)
Andreas
Import Notes
Resolved by subject fallback