Column storage positions

Started by Phil Currierabout 19 years ago60 messageshackers
Jump to latest
#1Phil Currier
pcurrier@gmail.com

Inspired by this thread [1]http://archives.postgresql.org/pgsql-hackers/2006-12/msg00780.php, and in particular by the idea of storing
three numbers (permanent ID, on-disk storage position, display
position) for each column, I spent a little time messing around with a
prototype implementation of column storage positions to see what kind
of difference it would make. The results were encouraging: on a table
with 20 columns of alternating smallint and varchar(10) datatypes,
selecting the max() of one of the rightmost int columns across 1
million rows ran around 3 times faster. The same query on the
leftmost varchar column (which should suffer the most from this
change) predictably got a little slower (about 10%); I couldn't
measure a performance drop on the rightmost varchar columns. The
table's size didn't drop much in this case, but a different table of
20 alternating int and smallint columns showed a 20% slimmer disk
footprint, pretty much as expected. Pgbenching showed no measurable
difference, which isn't surprising since the pgbench test tables
consist of just int values with char filler at the end.

So here is a proposal for separating a column's storage position from
its permanent ID. I've ignored the display position piece of the
original thread because display positions don't do much other than
save you the hassle of creating a view on top of your table, while
storage positions have demonstrable, tangible benefits. And there is
no reason to connect the two features; display positions can easily be
added separately at a later point.

We want to decouple a column's on-disk storage position from its
permanent ID for two reasons: to minimize the space lost to alignment
padding between fields, and to speed up access to individual fields.
The system will automatically assign new storage positions when a
table is created, and when a table alteration requires a rewrite
(currently just adding a column with a default, or changing a column
datatype). To allow users to optimize tables based on the fields they
know will be frequently accessed, I think we should extend ALTER TABLE
to accept user-assigned storage positions (something like "ALTER TABLE
ALTER col SET STORAGE POSITION X"). This command would also be useful
for another reason discussed below.

In my prototype, I used these rules to determine columns' storage order:
1) fixed-width fields before variable-width, dropped columns always last
2) fixed-width fields ordered by increasing size
3) not-null fields before nullable fields
There are other approaches worth considering - for example, you could
imagine swapping the priority of rules 2 and 3. Resultant tables
would generally have more alignment waste, but would tend to have
slightly faster field access. I'm really not sure what the optimal
strategy is since every user will have a slightly different metric for
"optimal". In any event, either of these approaches is better than
the current situation.

To implement this, we'll need a field (perhaps attstoragepos?) in
pg_attribute to hold the storage position. It will equal attnum until
it is explicitly reassigned. The routines in heaptuple.c need to
quickly loop through the fields of a tuple in storage order rather
than attnum order, so I propose extending TupleDesc to hold an
"attrspos" array that sits alongside the attrs array. In the
prototype I used an array of int2 indices into the attrs array,
ordered by storage position.

These changes cause a problem in ExecTypeFromTLInternal: this function
calls CreateTemplateTupleDesc followed by TupleDescInitEntry, assuming
that attnum == attstoragepos for all tuples. With the introduction of
storage positions, this of course will no longer be true. I got
around this by having expand_targetlist, build_physical_tlist, and
build_relation_tlist make sure each TargetEntry (for targetlists
corresponding to either insert/update tuples, or base tuples pulled
straight from the heap) gets a correct resorigtbl and resname. Then
ExecTypeFromTLInternal first tries calling a new function
TupleDescInitEntryAttr, which hands off to TupleDescInitEntry and then
performs a syscache lookup to update the storage position using the
resorigtbl. This is a little ugly because ExecTypeFromTLInternal
doesn't know in advance what kind of tupledesc it's building, so it
needs to retreat to the old method whenever the syscache lookup fails,
but it was enough to pass the regression tests. I could use some
advice on this - there's probably a better way to do it.

Another problem relates to upgrades. With tools like pg_migrator now
on pgfoundry, people will eventually expect quick upgrades that don't
require rewriting each table's data. Storage positions would cause a
problem for every version X -> version Y upgrade with Y >= 8.3, even
when X is also >= 8.3, because a version X table could always have
been altered without a rewrite into a structure different from what
Y's CREATE TABLE will choose. I don't think it's as simple as just
using the above-mentioned ALTER TABLE extension to assign the proper
storage positions for each field, because the version X table could
have dropped columns that might or might not be present in any given
tuple on disk. The best solution I can see is having pg_dump create a
table covering *all* columns (including dropped ones) with explicit
storage positions, and then immediately issue an alter statement to
get rid of the dropped columns. I'm not thrilled about this approach
(in particular, preserving dropped columns across upgrades seems
sloppy), but I haven't been able to think of anything better.
Hopefully I'm missing a simpler way to do this.

Comments and ideas? Does this whole thing seem worthwhile to do?

phil

[1]: http://archives.postgresql.org/pgsql-hackers/2006-12/msg00780.php

#2Sergey E. Koposov
math@sai.msu.ru
In reply to: Phil Currier (#1)
Re: Column storage positions

Just as my 2 cents to the proposed idea.
I want to demonstrate that the proposed idea is very relevant for the
performance.

I recently did an migration from PG 8.1 to PG 8.2. During that time I was
dumping the 2TB database with several very wide tables (having ~ 200
columns). And I saw that on my pretty powerful server with 8Gb
RAM, Itanium2 procesor,large RAID which can do I/O at 100Mb/sec the
performance of pg_dump was CPU limited, and the read speed of the tables
was 1-1.5mb/sec (leading to 2 week dumping time).

I was very surprised by these times, and profiled postgres to check the
reason of that:
here is the top of gprof:
% cumulative self self total
time seconds seconds calls s/call s/call name
60.72 13.52 13.52 6769826 0.00 0.00 nocachegetattr
10.58 15.88 2.36 9035566 0.00 0.00 CopyAttributeOutText
7.22 17.49 1.61 65009457 0.00 0.00 CopySendData
6.34 18.90 1.41 1 1.41 22.21 CopyTo

So the main slow-down of the process was all this code recomputing the
boundaries of the columns.... I checked that by removing one tiny varchar
column and COALESCING all NULLs, and after that the performance of
pg_dumping increased by more than a factor of 2!

I should have reported that experience earlier... but I hope that my
observations can be useful in the context of the Phil's idea.

regards,
Sergey

*******************************************************************
Sergey E. Koposov
Max Planck Institute for Astronomy/Cambridge Institute for Astronomy/Sternberg Astronomical Institute
Tel: +49-6221-528-349
Web: http://lnfm1.sai.msu.ru/~math
E-mail: math@sai.msu.ru

#3Robert Treat
xzilla@users.sourceforge.net
In reply to: Phil Currier (#1)
Re: Column storage positions

On Tuesday 20 February 2007 16:07, Phil Currier wrote:

Another problem relates to upgrades. With tools like pg_migrator now
on pgfoundry, people will eventually expect quick upgrades that don't
require rewriting each table's data. Storage positions would cause a
problem for every version X -> version Y upgrade with Y >= 8.3, even
when X is also >= 8.3, because a version X table could always have
been altered without a rewrite into a structure different from what
Y's CREATE TABLE will choose.

If you are using pg_migrator your not going to be moving the datafiles on disk
anyway,so pg_migrator's behavior shouldnt change terribly. If your doing
pg_dump based upgrade, presumably pg_dump could write it's create statements
with the columns in attstorpos order and set attnum = attstorpos, preserving
the physical layout from the previous install.

--
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

#4Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Phil Currier (#1)
Re: Column storage positions

Phil Currier escribi�:

Inspired by this thread [1], and in particular by the idea of storing
three numbers (permanent ID, on-disk storage position, display
position) for each column, I spent a little time messing around with a
prototype implementation of column storage positions to see what kind
of difference it would make. The results were encouraging: on a table
with 20 columns of alternating smallint and varchar(10) datatypes,
selecting the max() of one of the rightmost int columns across 1
million rows ran around 3 times faster.

[snipped]

I'd expect the system being able to reoder the columns to the most
efficient order possible (performance-wise and padding-saving-wise),
automatically. When you create a table, sort the columns to the most
efficient order; ALTER TABLE ADD COLUMN just puts the new columns at the
end of the tuple; and anything that requires a rewrite of the table
(ALTER TABLE ... ALTER TYPE for example; would be cool to have CLUSTER
do it as well; and do it on TRUNCATE also) again recomputes the most
efficient order.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#5Phil Currier
pcurrier@gmail.com
In reply to: Alvaro Herrera (#4)
Re: Column storage positions

On 2/21/07, Alvaro Herrera <alvherre@commandprompt.com> wrote:

I'd expect the system being able to reoder the columns to the most
efficient order possible (performance-wise and padding-saving-wise),
automatically. When you create a table, sort the columns to the most
efficient order; ALTER TABLE ADD COLUMN just puts the new columns at the
end of the tuple; and anything that requires a rewrite of the table
(ALTER TABLE ... ALTER TYPE for example; would be cool to have CLUSTER
do it as well; and do it on TRUNCATE also) again recomputes the most
efficient order.

That's exactly what I'm proposing. On table creation, the system
chooses an efficient column order for you. The next time an ALTER
TABLE operation forces a rewrite, the system would recompute the
column storage order. I hadn't thought of having CLUSTER also redo
the storage order, but that seems safe since it takes an exclusive
lock on the table. I'm less sure about whether it's safe to do this
during a TRUNCATE.

phil

#6Bruce Momjian
bruce@momjian.us
In reply to: Phil Currier (#5)
Re: Column storage positions

Phil Currier wrote:

On 2/21/07, Alvaro Herrera <alvherre@commandprompt.com> wrote:

I'd expect the system being able to reoder the columns to the most
efficient order possible (performance-wise and padding-saving-wise),
automatically. When you create a table, sort the columns to the most
efficient order; ALTER TABLE ADD COLUMN just puts the new columns at the
end of the tuple; and anything that requires a rewrite of the table
(ALTER TABLE ... ALTER TYPE for example; would be cool to have CLUSTER
do it as well; and do it on TRUNCATE also) again recomputes the most
efficient order.

That's exactly what I'm proposing. On table creation, the system
chooses an efficient column order for you. The next time an ALTER
TABLE operation forces a rewrite, the system would recompute the
column storage order. I hadn't thought of having CLUSTER also redo
the storage order, but that seems safe since it takes an exclusive
lock on the table. I'm less sure about whether it's safe to do this
during a TRUNCATE.

Keep in mind we have a patch in process to reduce the varlena length and
reduce alignment requirements, so once that is in, reordering columns
will not be as important.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#7Florian Pflug
fgp@phlo.org
In reply to: Phil Currier (#5)
Re: Column storage positions

Phil Currier wrote:

On 2/21/07, Alvaro Herrera <alvherre@commandprompt.com> wrote:

I'd expect the system being able to reoder the columns to the most
efficient order possible (performance-wise and padding-saving-wise),
automatically. When you create a table, sort the columns to the most
efficient order; ALTER TABLE ADD COLUMN just puts the new columns at the
end of the tuple; and anything that requires a rewrite of the table
(ALTER TABLE ... ALTER TYPE for example; would be cool to have CLUSTER
do it as well; and do it on TRUNCATE also) again recomputes the most
efficient order.

That's exactly what I'm proposing. On table creation, the system
chooses an efficient column order for you. The next time an ALTER
TABLE operation forces a rewrite, the system would recompute the
column storage order. I hadn't thought of having CLUSTER also redo
the storage order, but that seems safe since it takes an exclusive
lock on the table. I'm less sure about whether it's safe to do this
during a TRUNCATE.

I think you'd want to have a flag per field that tell you if the user
has overridden the storage pos for that specific field. Otherwise,
the next time you have to chance to optimize the ordering, you might
throw away changes that the admin has done on purpose. The same hold
true for a pg_dump/pg_reload cycle. If none of the fields had their
storage order changed manually, you'd want to reoder them optimally
at dump/reload time. If, however, the admin specified an ordering, you'd
want to preserve that.

greetings, Florian Pflug

#8Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Bruce Momjian (#6)
Re: Column storage positions

Bruce Momjian escribi�:

Phil Currier wrote:

On 2/21/07, Alvaro Herrera <alvherre@commandprompt.com> wrote:

I'd expect the system being able to reoder the columns to the most
efficient order possible (performance-wise and padding-saving-wise),
automatically. When you create a table, sort the columns to the most
efficient order; ALTER TABLE ADD COLUMN just puts the new columns at the
end of the tuple; and anything that requires a rewrite of the table
(ALTER TABLE ... ALTER TYPE for example; would be cool to have CLUSTER
do it as well; and do it on TRUNCATE also) again recomputes the most
efficient order.

That's exactly what I'm proposing. On table creation, the system
chooses an efficient column order for you. The next time an ALTER
TABLE operation forces a rewrite, the system would recompute the
column storage order. I hadn't thought of having CLUSTER also redo
the storage order, but that seems safe since it takes an exclusive
lock on the table. I'm less sure about whether it's safe to do this
during a TRUNCATE.

Keep in mind we have a patch in process to reduce the varlena length and
reduce alignment requirements, so once that is in, reordering columns
will not be as important.

Yes, but the "cache offset" stuff is still significant, so there will be
some benefit in putting all the fixed-length attributes at the start of
the tuple, and varlena atts grouped at the end.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

#9Phil Currier
pcurrier@gmail.com
In reply to: Bruce Momjian (#6)
Re: Column storage positions

On 2/21/07, Bruce Momjian <bruce@momjian.us> wrote:

Keep in mind we have a patch in process to reduce the varlena length and
reduce alignment requirements, so once that is in, reordering columns
will not be as important.

Well, as I understand it, that patch isn't really addressing the same
problem. Consider this table:
create table foo (a varchar(10), b int, c smallint, d int, e smallint, ....);

There are two problems here:

1) On my machine, each int/smallint column pair takes up 8 bytes. 2
of those 8 bytes are alignment padding wasted on the smallint field.
If we grouped all the smallint fields together within the tuple, that
space would not be lost.

2) Each time you access any of the int/smallint fields, you have to
peek inside the varchar field to figure out its length. If we stored
the varchar field at the end of the tuple instead, the access times
for all the other fields would be measurably improved, by a factor
that greatly outweighs the small penalty imposed on the varchar field
itself.

My understanding is that the varlena headers patch would potentially
reduce the size of the varchar header (which is definitely worthwhile
by itself), but it wouldn't help much for either of these problems.
Or am I misunderstanding what that patch does?

phil

#10Bruce Momjian
bruce@momjian.us
In reply to: Alvaro Herrera (#8)
Re: Column storage positions

Alvaro Herrera wrote:

Bruce Momjian escribi?:

Phil Currier wrote:

On 2/21/07, Alvaro Herrera <alvherre@commandprompt.com> wrote:

I'd expect the system being able to reoder the columns to the most
efficient order possible (performance-wise and padding-saving-wise),
automatically. When you create a table, sort the columns to the most
efficient order; ALTER TABLE ADD COLUMN just puts the new columns at the
end of the tuple; and anything that requires a rewrite of the table
(ALTER TABLE ... ALTER TYPE for example; would be cool to have CLUSTER
do it as well; and do it on TRUNCATE also) again recomputes the most
efficient order.

That's exactly what I'm proposing. On table creation, the system
chooses an efficient column order for you. The next time an ALTER
TABLE operation forces a rewrite, the system would recompute the
column storage order. I hadn't thought of having CLUSTER also redo
the storage order, but that seems safe since it takes an exclusive
lock on the table. I'm less sure about whether it's safe to do this
during a TRUNCATE.

Keep in mind we have a patch in process to reduce the varlena length and
reduce alignment requirements, so once that is in, reordering columns
will not be as important.

Yes, but the "cache offset" stuff is still significant, so there will be
some benefit in putting all the fixed-length attributes at the start of
the tuple, and varlena atts grouped at the end.

Agreed.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#11Bruce Momjian
bruce@momjian.us
In reply to: Phil Currier (#9)
Re: Column storage positions

Phil Currier wrote:

On 2/21/07, Bruce Momjian <bruce@momjian.us> wrote:

Keep in mind we have a patch in process to reduce the varlena length and
reduce alignment requirements, so once that is in, reordering columns
will not be as important.

Well, as I understand it, that patch isn't really addressing the same
problem. Consider this table:
create table foo (a varchar(10), b int, c smallint, d int, e smallint, ....);

There are two problems here:

1) On my machine, each int/smallint column pair takes up 8 bytes. 2
of those 8 bytes are alignment padding wasted on the smallint field.
If we grouped all the smallint fields together within the tuple, that
space would not be lost.

Yes, good point.

2) Each time you access any of the int/smallint fields, you have to
peek inside the varchar field to figure out its length. If we stored
the varchar field at the end of the tuple instead, the access times
for all the other fields would be measurably improved, by a factor
that greatly outweighs the small penalty imposed on the varchar field
itself.

My understanding is that the varlena headers patch would potentially
reduce the size of the varchar header (which is definitely worthwhile
by itself), but it wouldn't help much for either of these problems.
Or am I misunderstanding what that patch does?

Agreed.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#12Martijn van Oosterhout
kleptog@svana.org
In reply to: Florian Pflug (#7)
Re: Column storage positions

On Wed, Feb 21, 2007 at 03:59:12PM +0100, Florian G. Pflug wrote:

I think you'd want to have a flag per field that tell you if the user
has overridden the storage pos for that specific field. Otherwise,
the next time you have to chance to optimize the ordering, you might
throw away changes that the admin has done on purpose.

Why would you want to let the admin have any say at all about the
storage order?

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#13Andrew Dunstan
andrew@dunslane.net
In reply to: Florian Pflug (#7)
Re: Column storage positions

Florian G. Pflug wrote:

I think you'd want to have a flag per field that tell you if the user
has overridden the storage pos for that specific field. Otherwise,
the next time you have to chance to optimize the ordering, you might
throw away changes that the admin has done on purpose. The same hold
true for a pg_dump/pg_reload cycle. If none of the fields had their
storage order changed manually, you'd want to reoder them optimally
at dump/reload time. If, however, the admin specified an ordering, you'd
want to preserve that.

I don't think users should be monkeying with the storage position at
all. Decisions about that should belong to the engine, not to users.
Providing a user tweakable knob for this strikes me as a large footgun,
as well as requiring all sorts of extra checks along the lines you are
talking of.

cheers

andrew

#14Phil Currier
pcurrier@gmail.com
In reply to: Martijn van Oosterhout (#12)
Re: Column storage positions

On 2/21/07, Martijn van Oosterhout <kleptog@svana.org> wrote:

On Wed, Feb 21, 2007 at 03:59:12PM +0100, Florian G. Pflug wrote:

I think you'd want to have a flag per field that tell you if the user
has overridden the storage pos for that specific field. Otherwise,
the next time you have to chance to optimize the ordering, you might
throw away changes that the admin has done on purpose.

Why would you want to let the admin have any say at all about the
storage order?

Well, for two reasons:

1) If you have a table with one very-frequently-accessed varchar()
column and several not-frequently-accessed int columns, it might
actually make sense to put the varchar column first. The system won't
always be able to make the most intelligent decision about table
layout.

2) As I described in my original email, without this capability, I
don't see any good way to perform an upgrade between PG versions
without rewriting each table's data. Maybe most people aren't doing
upgrades like this right now, but it seems like it will only become
more common in the future. In my opinion, this is more important than
#1.

But I understand that it's a potential foot-gun, so I'm happy to drop
it. It would be nice though if there were some ideas about how to
address problem #2 at least.

phil

#15Florian Pflug
fgp@phlo.org
In reply to: Martijn van Oosterhout (#12)
Re: Column storage positions

Martijn van Oosterhout wrote:

On Wed, Feb 21, 2007 at 03:59:12PM +0100, Florian G. Pflug wrote:

I think you'd want to have a flag per field that tell you if the user
has overridden the storage pos for that specific field. Otherwise,
the next time you have to chance to optimize the ordering, you might
throw away changes that the admin has done on purpose.

Why would you want to let the admin have any say at all about the
storage order?

It wasn't my idea - the OP proposed a "alter table <table> alter column
<col> set storage position <pos>" command. But if you're gonna decouple
the storage order from the attnum, they why don't let the dba tweak it?

Since you have at least two possible optimization speeds - for size, or
for fast access to specifc fields, creating a one-size-fits-all ordering
rule seems hard...

greetings, Florian Pflug

#16Martijn van Oosterhout
kleptog@svana.org
In reply to: Phil Currier (#14)
Re: Column storage positions

On Wed, Feb 21, 2007 at 12:06:30PM -0500, Phil Currier wrote:

Well, for two reasons:

1) If you have a table with one very-frequently-accessed varchar()
column and several not-frequently-accessed int columns, it might
actually make sense to put the varchar column first. The system won't
always be able to make the most intelligent decision about table
layout.

Umm, the point of the exercise is that if you know there are int
columns, then you can skip over them, whereas you can never skip over a
varchar column. So there isn't really any situation where it would be
better to put the varchar first.

don't see any good way to perform an upgrade between PG versions
without rewriting each table's data. Maybe most people aren't doing
upgrades like this right now, but it seems like it will only become
more common in the future. In my opinion, this is more important than
#1.

I don't see this either. For all current tables, the storage position
is the attribute number, no exception. You say:

because the version X table could
have dropped columns that might or might not be present in any given
tuple on disk.

Whether they're there or not is irrelevent. Drop columns are not
necesarily empty, but in any case they occupy a storage position until
the table is rewritten. A dump/restore doesn't need to preserve this,
but pg_migrator will need some smarts to handle it. The system will
need to create a column of the appropriate type and drop it to get to
the right state.

If you really want to use pg_dump I'd suggest an option to pg_dump
--dump-dropped-columns which will include the dropped columns in the
CREATE TABLE but drop them immediatly after. It's really more a corner
case than anything else.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#17Florian Pflug
fgp@phlo.org
In reply to: Andrew Dunstan (#13)
Re: Column storage positions

Andrew Dunstan wrote:

Florian G. Pflug wrote:

I think you'd want to have a flag per field that tell you if the user
has overridden the storage pos for that specific field. Otherwise,
the next time you have to chance to optimize the ordering, you might
throw away changes that the admin has done on purpose. The same hold
true for a pg_dump/pg_reload cycle. If none of the fields had their
storage order changed manually, you'd want to reoder them optimally
at dump/reload time. If, however, the admin specified an ordering, you'd
want to preserve that.

I don't think users should be monkeying with the storage position at
all. Decisions about that should belong to the engine, not to users.
Providing a user tweakable knob for this strikes me as a large footgun,
as well as requiring all sorts of extra checks along the lines you are
talking of.

Maybe you shouldn't support specifying the storage order directly, but
rather through some kind of "priority field". The idea would be that
the storage order is determinted by sorting the fields according to
the priority field. Groups of fields with the same priority would
get ordered for maximal space efficiency.

greetings, Florian Pflug

#18Simon Riggs
simon@2ndQuadrant.com
In reply to: Phil Currier (#5)
Re: Column storage positions

On Wed, 2007-02-21 at 09:25 -0500, Phil Currier wrote:

On 2/21/07, Alvaro Herrera <alvherre@commandprompt.com> wrote:

I'd expect the system being able to reoder the columns to the most
efficient order possible (performance-wise and padding-saving-wise),
automatically. When you create a table, sort the columns to the most
efficient order; ALTER TABLE ADD COLUMN just puts the new columns at the
end of the tuple; and anything that requires a rewrite of the table
(ALTER TABLE ... ALTER TYPE for example; would be cool to have CLUSTER
do it as well; and do it on TRUNCATE also) again recomputes the most
efficient order.

That's exactly what I'm proposing. On table creation, the system
chooses an efficient column order for you.

That's fairly straightforward and beneficial. I much prefer Alvaro's
approach rather than the storage position details originally described.
Moreover, you'd need to significantly re-write lots of ALTER TABLE and I
really don't think you want to go there.

There is a problem: If people do a CREATE TABLE and then issue SELECT *
they will find the columns in a different order. That could actually
break some programs, so it isn't acceptable in all cases. e.g. COPY
without a column-list assumes that the incoming data should be assigned
to the table columns in the same order as the incoming data file.

So if we do this, it should be controllable using a GUC:
optimize_column_order = off (default) | on
This should be a USERSET, so different users can create tables in either
full control or optimised mode, as they choose.

It should be possible to do that with the minimum number of position
swaps, so that people who have ordered the columns according to usage
frequency would still get what they wanted.

The next time an ALTER
TABLE operation forces a rewrite, the system would recompute the
column storage order. I hadn't thought of having CLUSTER also redo
the storage order, but that seems safe since it takes an exclusive
lock on the table. I'm less sure about whether it's safe to do this
during a TRUNCATE.

The GUC should apply to whenever/wherever this optimization occurs.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#19Phil Currier
pcurrier@gmail.com
In reply to: Martijn van Oosterhout (#16)
Re: Column storage positions

On 2/21/07, Martijn van Oosterhout <kleptog@svana.org> wrote:

don't see any good way to perform an upgrade between PG versions
without rewriting each table's data. Maybe most people aren't doing
upgrades like this right now, but it seems like it will only become
more common in the future. In my opinion, this is more important than
#1.

I don't see this either. For all current tables, the storage position
is the attribute number, no exception. You say:

because the version X table could
have dropped columns that might or might not be present in any given
tuple on disk.

Whether they're there or not is irrelevent. Drop columns are not
necesarily empty, but in any case they occupy a storage position until
the table is rewritten. A dump/restore doesn't need to preserve this,
but pg_migrator will need some smarts to handle it. The system will
need to create a column of the appropriate type and drop it to get to
the right state.

I agree, a dump/restore that rewrites all the table datafiles doesn't
need to handle this. And I agree that the system will need to create
dropped columns and then drop them again, that's exactly what I
suggested in fact. We're talking about pg_migrator-style upgrades
only here.

Say we do this in 8.2:

create table foo (a varchar(10), b int);
insert into foo ....
alter table foo add column c int;

At this point, the column storage order is (a, b, c) because 8.2 never
changes storage order. Then you upgrade to 8.3. pg_dump now wants to
write out some DDL that will create a table matching the existing
table datafile, since we don't want to have to rewrite it. pg_dump
prints out:

create table foo (a varchar(10), b int, c int);

The 8.3 system will try to create the table with column order (b, c,
a), since it's trying to optimize storage order, and that won't match
the existing table datafile. What we need is a way to make sure that
the table matches the original datafile.

Now say that it's not an 8.2 -> 8.3 upgrade, say it's an 8.3 -> 8.4
upgrade. In this case, 8.3 would have the table with storage order
(b, a, c). (Column c would have been added at the end since it was
added without a default, and didn't force a table rewrite.) How do
you get pg_dump to print out table creation DDL that will result in a
table matching the existing (b, a, c) table datafile?

This is why I think pg_dump needs to be able to print an ALTER TABLE
statement that will explicitly assign storage positions. This happens
to have the side-effect of being potentially useful to admins who
might want control over that.

If this only affected 8.2 -> 8.3 upgrades, then maybe it's not as
important an issue. But I think it affects *all* future upgrades,
which is why I'm trying to raise the issue now.

Show quoted text

If you really want to use pg_dump I'd suggest an option to pg_dump
--dump-dropped-columns which will include the dropped columns in the
CREATE TABLE but drop them immediatly after. It's really more a corner
case than anything else.

#20Andrew Dunstan
andrew@dunslane.net
In reply to: Simon Riggs (#18)
Re: Column storage positions

Simon Riggs wrote:

On Wed, 2007-02-21 at 09:25 -0500, Phil Currier wrote:

On 2/21/07, Alvaro Herrera <alvherre@commandprompt.com> wrote:

I'd expect the system being able to reoder the columns to the most
efficient order possible (performance-wise and padding-saving-wise),
automatically. When you create a table, sort the columns to the most
efficient order; ALTER TABLE ADD COLUMN just puts the new columns at the
end of the tuple; and anything that requires a rewrite of the table
(ALTER TABLE ... ALTER TYPE for example; would be cool to have CLUSTER
do it as well; and do it on TRUNCATE also) again recomputes the most
efficient order.

That's exactly what I'm proposing. On table creation, the system
chooses an efficient column order for you.

That's fairly straightforward and beneficial. I much prefer Alvaro's
approach rather than the storage position details originally described.
Moreover, you'd need to significantly re-write lots of ALTER TABLE and I
really don't think you want to go there.

There is a problem: If people do a CREATE TABLE and then issue SELECT *
they will find the columns in a different order. That could actually
break some programs, so it isn't acceptable in all cases. e.g. COPY
without a column-list assumes that the incoming data should be assigned
to the table columns in the same order as the incoming data file.

You seem to have missed that we will be separating logical from physical
ordering. Each attribute will have a permanent id, a physical ordering
and a logical ordering. You can change either ordering without affecting
the other.

COPY, SELECT and all user-visible commands should follow the logical
ordering, not the physical ordering, which should be completely
invisible to SQL.

cheers

andrew

#21Simon Riggs
simon@2ndQuadrant.com
In reply to: Andrew Dunstan (#20)
#22Stephan Szabo
sszabo@megazone23.bigpanda.com
In reply to: Martijn van Oosterhout (#16)
#23Florian Pflug
fgp@phlo.org
In reply to: Phil Currier (#19)
#24Florian Pflug
fgp@phlo.org
In reply to: Simon Riggs (#18)
#25Andrew Dunstan
andrew@dunslane.net
In reply to: Simon Riggs (#21)
#26Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Stephan Szabo (#22)
#27Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andrew Dunstan (#25)
#28Bruce Momjian
bruce@momjian.us
In reply to: Florian Pflug (#23)
#29Andrew Dunstan
andrew@dunslane.net
In reply to: Alvaro Herrera (#27)
#30Phil Currier
pcurrier@gmail.com
In reply to: Bruce Momjian (#28)
#31Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Phil Currier (#30)
#32Stephan Szabo
sszabo@megazone23.bigpanda.com
In reply to: Alvaro Herrera (#26)
#33Andrew Dunstan
andrew@dunslane.net
In reply to: Stephan Szabo (#32)
#34Florian Pflug
fgp@phlo.org
In reply to: Stephan Szabo (#32)
#35Andrew Dunstan
andrew@dunslane.net
In reply to: Florian Pflug (#34)
#36Bruce Momjian
bruce@momjian.us
In reply to: Andrew Dunstan (#33)
#37Andrew Dunstan
andrew@dunslane.net
In reply to: Bruce Momjian (#36)
#38Bruce Momjian
bruce@momjian.us
In reply to: Andrew Dunstan (#37)
#39elein
elein@varlena.com
In reply to: Florian Pflug (#24)
#40Andrew Dunstan
andrew@dunslane.net
In reply to: elein (#39)
#41Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#29)
#42Simon Riggs
simon@2ndQuadrant.com
In reply to: Alvaro Herrera (#27)
#43Phil Currier
pcurrier@gmail.com
In reply to: Tom Lane (#41)
#44Robert Treat
xzilla@users.sourceforge.net
In reply to: Phil Currier (#43)
#45Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Phil Currier (#43)
#46Andrew Dunstan
andrew@dunslane.net
In reply to: Zeugswetter Andreas SB SD (#45)
#47Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Andrew Dunstan (#46)
#48Andrew Dunstan
andrew@dunslane.net
In reply to: Zeugswetter Andreas SB SD (#47)
#49Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Andrew Dunstan (#48)
#50Kris Jurka
books@ejurka.com
In reply to: Zeugswetter Andreas SB SD (#45)
#51Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Kris Jurka (#50)
#52Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Zeugswetter Andreas SB SD (#51)
#53Kris Jurka
books@ejurka.com
In reply to: Alvaro Herrera (#52)
#54Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Kris Jurka (#53)
#55Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Alvaro Herrera (#52)
#56Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Kris Jurka (#53)
#57Guillaume Smet
guillaume.smet@gmail.com
In reply to: Phil Currier (#1)
#58Phil Currier
pcurrier@gmail.com
In reply to: Guillaume Smet (#57)
#59Andrew Dunstan
andrew@dunslane.net
In reply to: Phil Currier (#58)
#60Guillaume Smet
guillaume.smet@gmail.com
In reply to: Phil Currier (#58)