Applying TOAST to CURRENT
Hi,
now that we have the branch for 7.0, I could apply my actual
work on TOAST to the CURRENT development tree. Before doing
so, I'd like to discuss some related details.
1. In the actual version, the lztext datatype is stripped
down to something more similar to text (does not compress
on input). So it is kinda toastable base type for testing
purposes created at initdb time.
The pg_rules catalog still uses it, just that the toaster
is now responsible to do the compression work. No
problems so far with that.
In the long run I think lztext will disappear completely
again (it was supposed to be). Does anybody see a problem
with abuse of this type during development?
2. I've added another ALTER TABLE command to create the
external storage table for a relation. The syntax is
ALTER TABLE tablename CREATE TOAST TABLE;
Up to that, toastable types (lztext only yet) will be
compressed, but the INSERT still fails if compression
isn't enough to make a tuple fit.
We haven't decided yet how/when to create the secondary
relation and it's index. Since we intend to make base
types like text and varchar by default toastable, I don't
think that "if a tables schema contains toastable types"
is a good enough reason to create them silently. There
might exists tons of tables in a schema, that don't
require it.
OTOH I don't think it's a good thing to try creating
these things on the fly the first time needed. The
required catalog changes and file creations introduce all
kinds of possible rollback/crash problems, that we don't
want to have here - do we?
3. Tom, we don't have a consensus how to merge the TOAST
related function changes with the fmgr changes up to now.
Which base type specific functions will be touched due to
fmgr changes right now?
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #
Hi,
now that we have the branch for 7.0, I could apply my actual
work on TOAST to the CURRENT development tree. Before doing
so, I'd like to discuss some related details.1. In the actual version, the lztext datatype is stripped
down to something more similar to text (does not compress
on input). So it is kinda toastable base type for testing
purposes created at initdb time.The pg_rules catalog still uses it, just that the toaster
is now responsible to do the compression work. No
problems so far with that.In the long run I think lztext will disappear completely
again (it was supposed to be). Does anybody see a problem
with abuse of this type during development?
Sounds fine.
2. I've added another ALTER TABLE command to create the
external storage table for a relation. The syntax isALTER TABLE tablename CREATE TOAST TABLE;
Up to that, toastable types (lztext only yet) will be
compressed, but the INSERT still fails if compression
isn't enough to make a tuple fit.We haven't decided yet how/when to create the secondary
relation and it's index. Since we intend to make base
types like text and varchar by default toastable, I don't
think that "if a tables schema contains toastable types"
is a good enough reason to create them silently. There
might exists tons of tables in a schema, that don't
require it.OTOH I don't think it's a good thing to try creating
these things on the fly the first time needed. The
required catalog changes and file creations introduce all
kinds of possible rollback/crash problems, that we don't
want to have here - do we?
Well, we could print the message suggesing ALTER TABLE when printing
tuple too large. Frankly, I don't see a problem in creating the backup
table automatically. If you are worried about performance, how about
putting it in a subdirectory.
--
Bruce Momjian | http://www.op.net/~candle
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian wrote:
OTOH I don't think it's a good thing to try creating
these things on the fly the first time needed. The
required catalog changes and file creations introduce all
kinds of possible rollback/crash problems, that we don't
want to have here - do we?Well, we could print the message suggesing ALTER TABLE when printing
tuple too large. Frankly, I don't see a problem in creating the backup
table automatically. If you are worried about performance, how about
putting it in a subdirectory.
It's the toast-table and the index. So it's 2 Inodes and 16K
per table. If the backend is compiled with -g, someone needs
to create about 500 tables to waste the same amount of space.
Well, I like the subdirectory idea. I only wonder how that
should be implemented (actually the tablename is the filename
- and that doesn't allow / in it).
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #
Bruce Momjian wrote:
OTOH I don't think it's a good thing to try creating
these things on the fly the first time needed. The
required catalog changes and file creations introduce all
kinds of possible rollback/crash problems, that we don't
want to have here - do we?Well, we could print the message suggesing ALTER TABLE when printing
tuple too large. Frankly, I don't see a problem in creating the backup
table automatically. If you are worried about performance, how about
putting it in a subdirectory.It's the toast-table and the index. So it's 2 Inodes and 16K
per table. If the backend is compiled with -g, someone needs
to create about 500 tables to waste the same amount of space.Well, I like the subdirectory idea. I only wonder how that
should be implemented (actually the tablename is the filename
- and that doesn't allow / in it).
Not sure. It will take some tricks, I am sure. How about if we add
some TOAST option to CREATE TABLE, so they can create with TOAST support
rather than having to use ALTER every time. Maybe that would work.
--
Bruce Momjian | http://www.op.net/~candle
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
JanWieck@t-online.de (Jan Wieck) writes:
3. Tom, we don't have a consensus how to merge the TOAST
related function changes with the fmgr changes up to now.
Which base type specific functions will be touched due to
fmgr changes right now?
For functions that need their inputs de-toasted, I think that the
changes you need should be a free byproduct of the fmgr changes.
I'd recommend we make those changes first, and then in a cleanup pass
you can modify anything that is able to work on still-toasted input.
I can't really do much with updating any varlena datatypes until
there's a version of heap_tuple_untoast_attr() somewhere in the
system --- if you look at src/include/fmgr.h, you'll see the call
is already there:
/* use this if you want the raw, possibly-toasted input datum: */
#define PG_GETARG_RAW_VARLENA_P(n) ((struct varlena *) PG_GETARG_POINTER(n))
/* use this if you want the input datum de-toasted: */
#define PG_GETARG_VARLENA_P(n) \
(VARATT_IS_EXTENDED(PG_GETARG_RAW_VARLENA_P(n)) ? \
(struct varlena *) heap_tuple_untoast_attr((varattrib *) PG_GETARG_RAW_VARLENA_P(n)) : \
PG_GETARG_RAW_VARLENA_P(n))
/* GETARG macros for varlena types will typically look like this: */
#define PG_GETARG_TEXT_P(n) ((text *) PG_GETARG_VARLENA_P(n))
BTW, it would save some casting if heap_tuple_untoast_attr were declared
to accept and return "struct varlena *" ...
Anyway, as soon as that code links to something that works, let me know
and I'll make a pass over the "text" functions. That should give you
something to test with.
regards, tom lane
JanWieck@t-online.de (Jan Wieck) writes:
OTOH I don't think it's a good thing to try creating
these things on the fly the first time needed. The
required catalog changes and file creations introduce all
kinds of possible rollback/crash problems, that we don't
want to have here - do we?
AFAIK we are pretty solid on rolling back table creation, it's just
rename/drop that have problems. A worse problem is what if two
backends both decide they need to create the toast table at the same
time. That might be fixable with appropriate locking but it seems
like there'd be potential for deadlocks.
Bruce Momjian wrote:
Well, we could print the message suggesing ALTER TABLE when printing
tuple too large. Frankly, I don't see a problem in creating the backup
table automatically. If you are worried about performance, how about
putting it in a subdirectory.
I agree with Bruce --- the toast table should be created automatically,
at least if the table contains any potentially-toastable columns. We
want this to be as transparent as possible. I'd rather have auto create
on-the-fly when first needed, but if that seems too risky then let's
just make the table when its owning table is created.
If you want to control it with an ALTER TABLE function, let's add ALTER
TABLE DROP TOAST so that admins who don't like the excess space usage
can get rid of it. (Of course that should only succeed after verifying
the toast table is empty...)
regards, tom lane
On Tue, 30 May 2000, Tom Lane wrote:
JanWieck@t-online.de (Jan Wieck) writes:
OTOH I don't think it's a good thing to try creating
these things on the fly the first time needed. The
required catalog changes and file creations introduce all
kinds of possible rollback/crash problems, that we don't
want to have here - do we?AFAIK we are pretty solid on rolling back table creation, it's just
rename/drop that have problems. A worse problem is what if two
backends both decide they need to create the toast table at the same
time. That might be fixable with appropriate locking but it seems
like there'd be potential for deadlocks.Bruce Momjian wrote:
Well, we could print the message suggesing ALTER TABLE when printing
tuple too large. Frankly, I don't see a problem in creating the backup
table automatically. If you are worried about performance, how about
putting it in a subdirectory.I agree with Bruce --- the toast table should be created automatically,
at least if the table contains any potentially-toastable columns. We
want this to be as transparent as possible. I'd rather have auto create
on-the-fly when first needed, but if that seems too risky then let's
just make the table when its owning table is created.
have to third this one ... I think it should be totally transparent to the
admin/user ... just create it when the table is created, what's the worst
case scenario? it never gets used and you waste 16k of disk space?
I agree with Bruce --- the toast table should be created automatically,
at least if the table contains any potentially-toastable columns. We
want this to be as transparent as possible. I'd rather have auto create
on-the-fly when first needed, but if that seems too risky then let's
just make the table when its owning table is created.If you want to control it with an ALTER TABLE function, let's add ALTER
TABLE DROP TOAST so that admins who don't like the excess space usage
can get rid of it. (Of course that should only succeed after verifying
the toast table is empty...)
But when you vacuum a table, doesn't it get zero size? Sure works here:
#$ cd /u/pg/data/base/test
#$ ls -l kkk*
-rw------- 1 postgres postgres 0 May 30 12:20 kkk
#$
--
Bruce Momjian | http://www.op.net/~candle
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
The Hermit Hacker wrote:
On Tue, 30 May 2000, Tom Lane wrote:
I agree with Bruce --- the toast table should be created automatically,
at least if the table contains any potentially-toastable columns. We
want this to be as transparent as possible. I'd rather have auto create
on-the-fly when first needed, but if that seems too risky then let's
just make the table when its owning table is created.have to third this one ... I think it should be totally transparent to the
admin/user ... just create it when the table is created, what's the worst
case scenario? it never gets used and you waste 16k of disk space?
You dont even use 16k if toast tables are like ordinary tables (which I
guess they are). New empty tables seem to occupy 0k.
So I'm also for immediate creation of tost tables for all tables that
require them, either at create (if there are any toastable columns in
the create clause) or at alter table time if first toestable column is
added after initial create.
The only drawback is bloating directories, but it was already suggested
that
TOAST tables could/should be kept in subdirectory toast (as should
indexes
too, imho).
And the most widespread database in the world does it too ;)
(dBASE and its derivates)
--------
Hannu
The Hermit Hacker wrote:
On Tue, 30 May 2000, Hannu Krosing wrote:
The only drawback is bloating directories, but it was already suggested
that
TOAST tables could/should be kept in subdirectory toast (as should
indexes
too, imho).still say, simplest "fix":
<dbname>/{system,db,toast,index}
Why can't we just add a column named "tablepath" to pg_table, that can
either be
a simple filename, or relative path with a filename or even full path
(if we don't worry too much for security ;)
That has came up before when discussing ways to make rename table
rollbackable
but it could be handy here two.
AFAIK it has been a general principle in programming to keep separate
things
separate unless a very good reason not to do so is present.
-----------
Hannu
Import Notes
Reference msg id not found: Pine.BSF.4.21.0005301813330.474-100000@thelab.hub.org | Resolved by subject fallback
On Tue, 30 May 2000, Hannu Krosing wrote:
The Hermit Hacker wrote:
On Tue, 30 May 2000, Tom Lane wrote:
I agree with Bruce --- the toast table should be created automatically,
at least if the table contains any potentially-toastable columns. We
want this to be as transparent as possible. I'd rather have auto create
on-the-fly when first needed, but if that seems too risky then let's
just make the table when its owning table is created.have to third this one ... I think it should be totally transparent to the
admin/user ... just create it when the table is created, what's the worst
case scenario? it never gets used and you waste 16k of disk space?You dont even use 16k if toast tables are like ordinary tables (which I
guess they are). New empty tables seem to occupy 0k.So I'm also for immediate creation of tost tables for all tables that
require them, either at create (if there are any toastable columns in
the create clause) or at alter table time if first toestable column is
added after initial create.The only drawback is bloating directories, but it was already suggested
that
TOAST tables could/should be kept in subdirectory toast (as should
indexes
too, imho).
still say, simplest "fix":
<dbname>/{system,db,toast,index}
The Hermit Hacker wrote:
have to third this one ... I think it should be totally transparent to the
admin/user ... just create it when the table is created, what's the worst
case scenario? it never gets used and you waste 16k of disk space?
Not exactly.
I've made some good experiences with having the toaster
trying to keep the main tuple size below 1/4 of MaxTupleSize
(BLKSIZE - block header). Remember that external stored
attributes are only fetched from the secondary relation if
really needed (when the result set is sent to the client or
if explicitly used in the query). So in a usual case, where a
relatively small amount of the entire data is retrieved and
key attributes are small, it's a win. With this config more
main tuples fit into one block, and if the attributes used in
the WHERE clause aren't stored external, the result set
(including sort and group actions) can be collected with
fewer block reads. Only those big values, that the client
really wanted, have to be fetched at send time.
If no external table exists, the toaster will try the <2K
thing by compression only. If the resulting tuple fits into
the 8K limit, it's OK. But if a secondary relation exists,
it'll store external to make the tuple <2K. Thus, a 4K or 6K
tuple, that actually fits and would be stored in the main
table, will cause the toaster to jump in if we allways create
the secondary table.
Hmmm - thinking about that it doesn't sound bad if we allways
create a secondary relation at CREATE TABLE time, but NOT the
index for it. And at VACUUM time we create the index if it
doesn't exist AND there is external stored data.
The table is prepared for external storage allways and we
avoid the risks from creating tables in possibly later
aborting transactions or due to concurrency issues. But we
don't waste the index space for really allways-small-tuple
tables.
Another benefit would be, that reloads should be faster
because with this technique, the toaster doesn't need to
insert index tuples during the load. The indices are created
later at VACUUM after reload.
The toaster needs to use sequential scans on the external
table until the next vacuum run, but index usage allways
depends on vacuum so that's not a real issue from my PoV.
At least a transparent compromise - isn't it?
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #
JanWieck@t-online.de (Jan Wieck) writes:
Hmmm - thinking about that it doesn't sound bad if we allways
create a secondary relation at CREATE TABLE time, but NOT the
index for it. And at VACUUM time we create the index if it
doesn't exist AND there is external stored data.
Don't much like that --- what if the user doesn't run vacuum for
a good long while? Could be doing a lot of sequential scans over
a pretty large toast file...
If the 16K for an empty btree index really bothers you, let's
attack that head-on. I don't see why a freshly created index
couldn't be zero bytes, and the metadata page gets created on
first store into the index.
The toaster needs to use sequential scans on the external
table until the next vacuum run, but index usage allways
depends on vacuum so that's not a real issue from my PoV.
What makes you say that? Indexes will be used on a never-vacuumed
table with the current planner.
regards, tom lane
Jan Wieck wrote:
The Hermit Hacker wrote:
have to third this one ... I think it should be totally transparent to the
admin/user ... just create it when the table is created, what's the worst
case scenario? it never gets used and you waste 16k of disk space?Not exactly.
I've made some good experiences with having the toaster
trying to keep the main tuple size below 1/4 of MaxTupleSize
(BLKSIZE - block header).
Can't _that_ behaviour be made modifyable by some setting ?
Remember that external stored
attributes are only fetched from the secondary relation if
really needed (when the result set is sent to the client or
if explicitly used in the query). So in a usual case, where a
relatively small amount of the entire data is retrieved and
key attributes are small, it's a win. With this config more
main tuples fit into one block, and if the attributes used in
the WHERE clause aren't stored external, the result set
(including sort and group actions) can be collected with
fewer block reads. Only those big values, that the client
really wanted, have to be fetched at send time.
What is the priority of checks on indexed fetch?
I mean if we do "SELECT * FROM ttable WHERE toasted LIKE 'ab%' "
DO we first scan by index to 'ab%', then check if tuple is live and
after that to the LIKE comparison ?
Would it not be faster in toast case to use the already retrieved
index data and check that first, before going to main table (not to
mention the TOAST table)
If no external table exists, the toaster will try the <2K
thing by compression only. If the resulting tuple fits into
the 8K limit, it's OK.
Would it not be faster/cleaner to check some configuration variable
than the existance of toest table ?
But if a secondary relation exists,
it'll store external to make the tuple <2K. Thus, a 4K or 6K
tuple, that actually fits and would be stored in the main
table, will cause the toaster to jump in if we allways create
the secondary table.
Do our current (btree/hash) indexes support toast ?
If not, will they ?
Hmmm - thinking about that it doesn't sound bad if we allways
create a secondary relation at CREATE TABLE time, but NOT the
index for it. And at VACUUM time we create the index if it
doesn't exist AND there is external stored data.
Is there a plan to migrate to some combined index/database table for
at least toast tables later ?
For at least toast tables it seems feasible to start using the
originally planned tuple-spanning mechanisms, unless we plan
migrating LOs to toast table at some point which would make index-less
tuple chaining a bad idea as it would make seeking on really large
LOs slow.
The table is prepared for external storage allways and we
avoid the risks from creating tables in possibly later
aborting transactions or due to concurrency issues. But we
don't waste the index space for really allways-small-tuple
tables.
That could perhaps be done for other tables too, ie CREATE INDEX
would not actually create index until VACUUM notices that table is
big enough to make use of that index ?
On second thought that seems not a good idea to me ;(
Another benefit would be, that reloads should be faster
because with this technique, the toaster doesn't need to
insert index tuples during the load. The indices are created
later at VACUUM after reload.
AFAIK reloads (from pg_dump at least) create indexes after LOAD'ing data
The toaster needs to use sequential scans on the external
table until the next vacuum run, but index usage allways
depends on vacuum so that's not a real issue from my PoV.At least a transparent compromise - isn't it?
But do we need it ?
I suspect there are other issues that need your attention more than
complicating table creation to save a few kb ;)
Creating toast tables still wastes only 1MB per 64 tables _that have
toastable columns_, which seems real cheap considering today's HD
prices.
You would need 6400 toast tables to consume 1% of the smallest currently
available (10GB) disk.
If that is a concern this can probably be cured by good docs that say
in detail which datatypes cause toast tables an which don't.
-----------
Hannu
Hmmm - thinking about that it doesn't sound bad if we allways
create a secondary relation at CREATE TABLE time, but NOT the
index for it. And at VACUUM time we create the index if it
doesn't exist AND there is external stored data.
Seems we are trying to reduce the dependency on vacuum in other
areas (e.g. overwriting smgr).
I would prefer explicit syntax to enable toast (create table and alter
table).
Andreas
Import Notes
Resolved by subject fallback
Hannu Krosing wrote:
I've made some good experiences with having the toaster
trying to keep the main tuple size below 1/4 of MaxTupleSize
(BLKSIZE - block header).Can't _that_ behaviour be made modifyable by some setting ?
Good point.
There is already a fine tuning option per table attribute,
where someone can tell things like "forget about compression
for this attribute" or "try keeping in main tuple and toast
others first". Theres no utility command up to now to
customize them, but an UPDATE pg_attribute does it already.
Seems another value in pg_class, telling the toaster what max
size to try, would be a good idea.
What is the priority of checks on indexed fetch?
I mean if we do "SELECT * FROM ttable WHERE toasted LIKE 'ab%' "
DO we first scan by index to 'ab%', then check if tuple is live and
after that to the LIKE comparison ?
That's the current behaviour, and TOAST doesn't change it.
There was discussion already about index tuple toasting.
Indices have different size constraints and other features so
they cannot share exactly the same toasting scheme as heap
tuples.
I'm still not sure if supporting indices on huge values is
worth the efford. Many databases have some limit on the size
of index entries, and noone seems to really care for that.
If no external table exists, the toaster will try the <2K
thing by compression only. If the resulting tuple fits into
the 8K limit, it's OK.Would it not be faster/cleaner to check some configuration variable
than the existance of toest table ?
The toast tables and indexes OID are stored in pg_class. An
open Relation has reference to the pg_class row, so it's
simply comparing that to INVALID_OID. No wasted time here.
Do our current (btree/hash) indexes support toast ?
Not hard tested yet. At least, they don't support it if
toasting would be required to make the index tuple fit, but
the heap toaster is already happy with it.
The tuple is modified in place at heap_insert(). So the later
index_insert() will use the Datums found there to build the
index tuples, either plain or toast reference, whatever the
toaster left.
If not, will they ?
Not planned for 7.1. Maybe we can workout a solution for
unlimited index entries after that.
Hmmm - thinking about that it doesn't sound bad if we allways
create a secondary relation at CREATE TABLE time, but NOT the
index for it. And at VACUUM time we create the index if it
doesn't exist AND there is external stored data.Is there a plan to migrate to some combined index/database table for
at least toast tables later ?
No. But we plan a general overwriting storage manager, so
that might not be an issue at all.
For at least toast tables it seems feasible to start using the
originally planned tuple-spanning mechanisms, unless we plan
migrating LOs to toast table at some point which would make index-less
tuple chaining a bad idea as it would make seeking on really large
LOs slow.
I've never seen a complete proposal for tuple-spanning. The
toaster breaks up the large Datum into chunks. There is a
chunk number, so modifying the index to be a multi-attribute
one would gain direct access to a chunk. That should make
seeks reasonably fast.
Another benefit would be, that reloads should be faster
because with this technique, the toaster doesn't need to
insert index tuples during the load. The indices are created
later at VACUUM after reload.AFAIK reloads (from pg_dump at least) create indexes after LOAD'ing data
Finally the toast table will have another relkind, so it'll
not be accessible by normal SQL. The toaster acts on these
tables more hardwired like on system catalogs. It expects a
fixed schema and uses direct heap access. Due to the
different relkind, a dump wouldn't be able to delay the index
creation.
But do we need it ?
[...]
You would need 6400 toast tables to consume 1% of the smallest currently
available (10GB) disk.If that is a concern this can probably be cured by good docs that say
in detail which datatypes cause toast tables an which don't.
We plan to make ALL variable size builtin types toastable. So
this list would name them all :-).
But this 6400 = 1% really is the point. Let's forget about
the 16K and create the toast table allways (as soon as the
main table has toastable attributes).
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #