Tablespaces

Started by Gavin Sherryabout 22 years ago109 messageshackers
Jump to latest
#1Gavin Sherry
swm@linuxworld.com.au

Hi all,

I've been looking at implementing table spaces for 7.5. Some notes and
implementation details follow.

------

Type of table space:

There are many different table space implementations in relational
database management systems. In my implementation, a table space in
PostgreSQL will be the location of a directory on the file system in
which files backing database objects can be stored. Global tables and
non 'data' objects (WAL, CLOG, config files) will all remain in $PGDATA.
$PGDATA/base will be the default table space.

A given table space will be identified by a unique table space name. I
haven't decided if 'unique' should mean database-wide unique or
cross-database unique. It seems to me that we might run into problems
with CREATE DATABASE ... TEMPLATE = <database with table spaces> if the
uniqueness of table spaces is limited to the database level.

A table space parameter will be added to DDL commands which create
physical database objects (CREATE DATABASE/INDEX/TABLE/SEQUENCE) and to
CREATE SCHEMA. The associated routines, as well as the corresponding DROP
commands will need to be updated. Adding the ability to ALTER <object>
TABLESPACE <name> seems a little painful. Would people use it? Comments?

When an object is created the system will resolve the table space the
object is stored in as follows: if the table space paramater is passed to
the DDL command, then the object is stored in that table space (given
validation of the table space, etc). If it is not passed, the object
inherits its "parent's" table space where the parent/child hierarchy is as
follows: database > schema > table > [index|sequence]. So, if you issued:

create table foo.bar (...);

We would first not that there is no TABLESPACE <name>, then cascade to
the table space for the schema 'foo' (and possibly cascade to the table
space for the database). A database which wasn't created with an explicit
table space will be created under the default table space. This ensures
backward compatibility.

Creating a table space:

A table space is a directory structure. The directory structure is as
follows:

[swm@dev /path/to/tblspc]$ ls
OID1/ OID2/

OID1 and OID2 are the OIDs of databases which have created a table space
against this file system location. In this respect, a table space
resembles $PGDATA/base. I thought it useful to keep this kind of
namespace mechanism in place so that administrators do not need to create
hierarchies of names on different partitions if they want multiple
databases to use the same partition.

The actual creation of the table space will be done with:

CREATE TABLE SPACE <name> LOCATION </path/to/tblspc>;

Before creating the table space we must:

1) Check if the directory exists. If it does, create a sub directory as
the OID of the current database.

2) Alternatively, if the directory doesn't exist, attempt to create it,
then the sub directory.

I wonder if a file, such as PG_TBLSPC, should be added to the table space
directory so that, in the case of an existing non-empty directory, we can
attempt to test if the directory is being used for something else and
error out. Seems like:

CREATE TABLESPACE tbl1 LOCATION '/var/'

which will result in something like '/var/123443' is a bad idea. Then
again, the user should know better. Comments?

If everything goes well, we add an entry to pg_tablespace with the table
space location and name (and and OID).

Tying it all together:

The catalogs pg_database, pg_namespace, and pg_class will have a 'tblspc'
field. This will be the OID of the table space the object resides in, or 0
(default table space). Since we can then resolve relid/relname, schema and
database to a tablespace, there aren't too many cases when extra logic
needs to be added to the IO framework. In fact, most of it is taken care
of because of the abstraction of relpath().

The creation of table spaces will need to be recorded in xlog in the same
way that files are in heap_create() with the corresponding delete logic
incase of ABORT.

Postmaster startup:

Ideally, the postmaster at startup should go into each tblspc/databaseoid
directory and check for a postmaster.pid file to see if some other
instance is touching the files we're interested in. This will require a
control file listing tblspc/databaseoid paths and it will need to plug
into WAL in case we die during CREATE TABLESPACE. Comments?

Creating a database

I think that createdb() is going to have to be reworked if pg_tablespace
isn't shared (ie, tablespaces are only database unique). The reason being
that if we create a database which has a table space, pg_tablespace in the
new database will have to be updated and that cannot be done atomically
with the `cp` based mechanism we currently use.

I think I'm going to have to get my hands dirty before I can tell the
extent to which createdb() will need reworking.

pg_dump

Obviously pg_dump will need to be able to dump table spaces. pg_dump
running against <7.5 will DDL commands without a table space parameter and
as such the database's physical layout, when loaded into 7.5, will be the
same as for <7.5.

---

Comments? Questions? Suggestions?

Thanks,

Gavin

#2Shridhar Daithankar
shridhar@frodo.hserus.net
In reply to: Gavin Sherry (#1)
Re: Tablespaces

On Thursday 26 February 2004 15:37, Gavin Sherry wrote:

Tying it all together:

The catalogs pg_database, pg_namespace, and pg_class will have a 'tblspc'
field. This will be the OID of the table space the object resides in, or 0
(default table space). Since we can then resolve relid/relname, schema and
database to a tablespace, there aren't too many cases when extra logic
needs to be added to the IO framework. In fact, most of it is taken care
of because of the abstraction of relpath().

The creation of table spaces will need to be recorded in xlog in the same
way that files are in heap_create() with the corresponding delete logic
incase of ABORT.

Is tablespace some sort of copyrighted? Last I remembered, the discussion was
about location/storage

Just a thought..

Shridhar

#3Richard Huxton
dev@archonet.com
In reply to: Gavin Sherry (#1)
Re: Tablespaces

On Thursday 26 February 2004 10:07, Gavin Sherry wrote:

CREATE TABLESPACE tbl1 LOCATION '/var/'

which will result in something like '/var/123443' is a bad idea. Then
again, the user should know better. Comments?

The LOCATION should have the same owner and permissions as $PGDATA - that
should catch mistyping.

Unless you're running as root, of course. In which case you clearly know
better than everyone else, so off you go!

Comments? Questions? Suggestions?

Presumably I'm using this to deal with performance/space issues, so there
clearly needs to be something in the pg_stat_xxx system to show figures based
on tablespace - not sure what you'd measure though - disk I/O, number of
nodes?

Perhaps something in contrib/ too - tablespace_disk_usage.pl or some such.
--
Richard Huxton
Archonet Ltd

#4Hans-Jürgen Schönig
postgres@cybertec.at
In reply to: Gavin Sherry (#1)
Re: Tablespaces

Gavin Sherry wrote:

Hi all,

I've been looking at implementing table spaces for 7.5. Some notes and
implementation details follow.

------

Type of table space:

There are many different table space implementations in relational
database management systems. In my implementation, a table space in
PostgreSQL will be the location of a directory on the file system in
which files backing database objects can be stored. Global tables and
non 'data' objects (WAL, CLOG, config files) will all remain in $PGDATA.
$PGDATA/base will be the default table space.

Is it possible to put WALs and CLOGs into different tablespaces? (maybe
different RAID systems). Some companies want that ...

A given table space will be identified by a unique table space name. I
haven't decided if 'unique' should mean database-wide unique or
cross-database unique. It seems to me that we might run into problems
with CREATE DATABASE ... TEMPLATE = <database with table spaces> if the
uniqueness of table spaces is limited to the database level.

I strongly vote for database cluster wide unique names because somebody
could have a tablespace "webusers" or something like that. To me this
makes far more sense.

A table space parameter will be added to DDL commands which create
physical database objects (CREATE DATABASE/INDEX/TABLE/SEQUENCE) and to
CREATE SCHEMA. The associated routines, as well as the corresponding DROP
commands will need to be updated. Adding the ability to ALTER <object>
TABLESPACE <name> seems a little painful. Would people use it? Comments?

I think people won't need it in first place because this seems to be
really painful.
What really matters is that the number of tablespaces and file /
tablespace is unlimited. SAP DB has limited the number of devspaces to
32 (I think). This is real bull.... because if your database grows
unexpectedly you are in deep trouble (expert database design by SAP,
MySQL and 100000....0000 others).

When an object is created the system will resolve the table space the
object is stored in as follows: if the table space paramater is passed to
the DDL command, then the object is stored in that table space (given
validation of the table space, etc). If it is not passed, the object
inherits its "parent's" table space where the parent/child hierarchy is as
follows: database > schema > table > [index|sequence]. So, if you issued:

create table foo.bar (...);

We would first not that there is no TABLESPACE <name>, then cascade to
the table space for the schema 'foo' (and possibly cascade to the table
space for the database). A database which wasn't created with an explicit
table space will be created under the default table space. This ensures
backward compatibility.

Will users automatically be assigned to a certain table space? How is
this going to work?

Creating a table space:

A table space is a directory structure. The directory structure is as
follows:

[swm@dev /path/to/tblspc]$ ls
OID1/ OID2/

OID1 and OID2 are the OIDs of databases which have created a table space
against this file system location. In this respect, a table space
resembles $PGDATA/base. I thought it useful to keep this kind of
namespace mechanism in place so that administrators do not need to create
hierarchies of names on different partitions if they want multiple
databases to use the same partition.

The actual creation of the table space will be done with:

CREATE TABLE SPACE <name> LOCATION </path/to/tblspc>;

Before creating the table space we must:

1) Check if the directory exists. If it does, create a sub directory as
the OID of the current database.

2) Alternatively, if the directory doesn't exist, attempt to create it,
then the sub directory.

I wonder if a file, such as PG_TBLSPC, should be added to the table space
directory so that, in the case of an existing non-empty directory, we can
attempt to test if the directory is being used for something else and
error out. Seems like:

CREATE TABLESPACE tbl1 LOCATION '/var/'

which will result in something like '/var/123443' is a bad idea. Then
again, the user should know better. Comments?

If everything goes well, we add an entry to pg_tablespace with the table
space location and name (and and OID).

Tying it all together:

The catalogs pg_database, pg_namespace, and pg_class will have a 'tblspc'
field. This will be the OID of the table space the object resides in, or 0
(default table space). Since we can then resolve relid/relname, schema and
database to a tablespace, there aren't too many cases when extra logic
needs to be added to the IO framework. In fact, most of it is taken care
of because of the abstraction of relpath().

The creation of table spaces will need to be recorded in xlog in the same
way that files are in heap_create() with the corresponding delete logic
incase of ABORT.

Postmaster startup:

Ideally, the postmaster at startup should go into each tblspc/databaseoid
directory and check for a postmaster.pid file to see if some other
instance is touching the files we're interested in. This will require a
control file listing tblspc/databaseoid paths and it will need to plug
into WAL in case we die during CREATE TABLESPACE. Comments?

Creating a database

I think that createdb() is going to have to be reworked if pg_tablespace
isn't shared (ie, tablespaces are only database unique). The reason being
that if we create a database which has a table space, pg_tablespace in the
new database will have to be updated and that cannot be done atomically
with the `cp` based mechanism we currently use.

I think I'm going to have to get my hands dirty before I can tell the
extent to which createdb() will need reworking.

pg_dump

Obviously pg_dump will need to be able to dump table spaces. pg_dump
running against <7.5 will DDL commands without a table space parameter and
as such the database's physical layout, when loaded into 7.5, will be the
same as for <7.5.

---

Comments? Questions? Suggestions?

Thanks,

Gavin

Do you plan support for limiting the size of a tablespace? ISPs will
vote for that because they can limit the size of a database on the
database level rather than on the operating system level. Of course
this can and (should???) be done on the operation system level but
people will definitely ask for that.
If sizing is not supported we should definitely provide minor
documentation which tells people how to do that on the operating system
level (at least poting to some useful information).

Best regards,

Hans

--
Cybertec Geschwinde u Schoenig
Schoengrabern 134, A-2020 Hollabrunn, Austria
Tel: +43/2952/30706 or +43/664/233 90 75
www.cybertec.at, www.postgresql.at, kernel.cybertec.at

#5Gavin Sherry
swm@linuxworld.com.au
In reply to: Hans-Jürgen Schönig (#4)
Re: Tablespaces

Is it possible to put WALs and CLOGs into different tablespaces? (maybe
different RAID systems). Some companies want that ...

I wasn't going to look at that just yet.

There is of course the temporary hack of symlinking WAL else where.

I'd be interested to see the performance difference between WAL and data
on the same RAID/controller and WAL and data on different RAID/controller
with Jan's improvements to the buffer management.

Gavin

#6Dennis Bjorklund
db@zigo.dhs.org
In reply to: Gavin Sherry (#1)
Re: Tablespaces

On Thu, 26 Feb 2004, Gavin Sherry wrote:

Comments? Questions? Suggestions?

Is that plan that in the future one can split a single table into
different table spaces? Like storing all rows with year < 1999 in one
tablespace and the rest in another?

With the rule system and two underlying tables one could make it work by
hand I think.

I've never used tablespaces in oracle so I don't know what it can offer. I
though it could do things like the above. True? What is the syntax and
for example, how does it effect indexes (not at all maybe).

If you don't want to discuss this now, I understand. It's not part of the
design as it is now. I'm just curious at what direction we are moving and
what is possible to do.

--
/Dennis Bj�rklund

#7Gavin Sherry
swm@linuxworld.com.au
In reply to: Dennis Bjorklund (#6)
Re: Tablespaces

On Thu, 26 Feb 2004, Dennis Bjorklund wrote:

On Thu, 26 Feb 2004, Gavin Sherry wrote:

Comments? Questions? Suggestions?

Is that plan that in the future one can split a single table into
different table spaces? Like storing all rows with year < 1999 in one
tablespace and the rest in another?

These are called partitions in oracle. You can approximate this with table
spaces by using a partial index and putting it in a different table space.
The problem, of course, is seq scans.

With the rule system and two underlying tables one could make it work by
hand I think.

I've never used tablespaces in oracle so I don't know what it can offer. I

Certainly, table spaces are used in many ways in oracle, db2, etc. You can
mirror data across them, have different buffer sizes for example.
In some implementations, they can be raw disk partitions (no file system).
I don't intend going this far, however.

If you don't want to discuss this now, I understand. It's not part of the
design as it is now. I'm just curious at what direction we are moving and
what is possible to do.

Well, partitions are something else entirely. Mirroring would be
interesting, but RAID designers are better at parallelisation of IO than
(some) database developers. Might be better to keep the problem seperate.

Gavin

#8Joshua D. Drake
jd@commandprompt.com
In reply to: Hans-Jürgen Schönig (#4)
Re: Tablespaces

Is it possible to put WALs and CLOGs into different tablespaces? (maybe
different RAID systems). Some companies want that ...

You can do this now, but it would be nice to be able to have it actually
configurable versus the hacked linked method.

J

--
Co-Founder
Command Prompt, Inc.
The wheel's spinning but the hamster's dead

#9Hans-Jürgen Schönig
postgres@cybertec.at
In reply to: Gavin Sherry (#5)
Re: Tablespaces

Gavin Sherry wrote:

Is it possible to put WALs and CLOGs into different tablespaces? (maybe
different RAID systems). Some companies want that ...

I wasn't going to look at that just yet.

There is of course the temporary hack of symlinking WAL else where.

that's what we do now.
we symlink databases and wals ...

I'd be interested to see the performance difference between WAL and data
on the same RAID/controller and WAL and data on different RAID/controller
with Jan's improvements to the buffer management.

Gavin

yes, that's what i am looking for. i should do some testing.

in case of enough i/o power additional cpus scale almost linearily
(depending on the application of course; i have done some testing on a
customer's aix box ...).
it would be interesting to see what jan's buffer strategy does (and bg
writer) ...

--
Cybertec Geschwinde u Schoenig
Schoengrabern 134, A-2020 Hollabrunn, Austria
Tel: +43/2952/30706 or +43/664/233 90 75
www.cybertec.at, www.postgresql.at, kernel.cybertec.at

#10Alex J. Avriette
alex@posixnap.net
In reply to: Gavin Sherry (#7)
Re: Tablespaces

On Thu, Feb 26, 2004 at 11:22:28PM +1100, Gavin Sherry wrote:

Certainly, table spaces are used in many ways in oracle, db2, etc. You can
mirror data across them, have different buffer sizes for example.
In some implementations, they can be raw disk partitions (no file system).
I don't intend going this far, however.

Perhaps now would be a good time to bring up my directio on Solaris question
from a year or so back? Is there any interest in the ability to use raw
disk?

Alex (who is overjoyed to hear discussion of tablespaces again)

--
alex@posixnap.net
Alex J. Avriette, Unix Systems Gladiator
Sep 25 12:52:39 buggle /bsd: wsdisplay0 at vga1: removing /dev/radio/*

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Gavin Sherry (#1)
Re: Tablespaces

Gavin Sherry <swm@linuxworld.com.au> writes:

A table space is a directory structure. The directory structure is as
follows:
[swm@dev /path/to/tblspc]$ ls
OID1/ OID2/
OID1 and OID2 are the OIDs of databases which have created a table space
against this file system location. In this respect, a table space
resembles $PGDATA/base. I thought it useful to keep this kind of
namespace mechanism in place ...

Actually, this is *necessary* AFAICT. The case that forces it is DROP
DATABASE. Since you have to execute that from another database, there's
no reasonable way to look into the target database's catalogs. That
means that the OID of the database has to be sufficient information to
get rid of all its files. You can do this fairly easily if in each
tablespace (whose locations you know from the shared pg_tablespace
table) you can look for a subdirectory matching the target database's
OID. If we tried to put the database's files just "loose" in each
tablespace directory then we'd be in trouble.

I think this is also an implementation reason for favoring cluster-wide
tablespaces over database-local ones. I'm not sure how you drop a
database from outside if you can't see where its tablespaces are.

I believe that it will be necessary to expand RelFileNode to three OIDs
(tablespace, database, relation). I had once hoped that it could be
kept at two (tablespace, relation) but with a physical layout like this
you more or less have to have three.

One issue that needs to be agreed on early is how the low-level file
access code finds a tablespace. What I would personally like is for
$PGDATA to contain symlinks to the tablespace top directories. The
actual access path for any relation could then be built trivially from
its RelFileNode:
$PGDATA/tablespaces/TBOID/DBOID/RELFILENODE
-------------------------
The underlined part references a symlink that leads to the directory
containing the per-database subdirectories.

I am expecting to hear some bleating about this from people whose
preferred platforms don't support symlinks ;-). However, if we don't
rely on symlinks for this then the tablespace-OID-to-physical-path
mapping has to be explicitly known at very low levels of the system
(md.c in particular). We can't expect md.c to get that information by
reading pg_tablespace. It would have to rely on some backdoor path,
such as a flat text file it could read at backend startup. I think
this approach will leave us fighting a lot of problems with locking
and out-of-date information.

Speaking of locking, can we do anything to prevent people from shooting
themselves in the foot by changing active tablespaces? Are we even
going to have a DROP TABLESPACE command, and if so what would it do?

regards, tom lane

#12Barry Lind
barry@xythos.com
In reply to: Gavin Sherry (#1)
Re: Tablespaces

Gavin,

After creating a tablespace what (if any) changes can be done to it.
Can you DROP a tablespace, or once created will it always exist? Can
you RENAME a tablespace? Can you change the location of a tablespace
(i.e you did a disk reorg and move the contents to a different location
and now want to point to the new location)? What are the permissions
necessary to create a tablespace (can any use connected to the database
create a tablespace, or only superuser, or ...)?

Overall this will be a great addition to postgres. I am looking forward
to this feature.

thanks,
--Barry

Gavin Sherry wrote:

Show quoted text

Hi all,

I've been looking at implementing table spaces for 7.5. Some notes and
implementation details follow.

------

Type of table space:

There are many different table space implementations in relational
database management systems. In my implementation, a table space in
PostgreSQL will be the location of a directory on the file system in
which files backing database objects can be stored. Global tables and
non 'data' objects (WAL, CLOG, config files) will all remain in $PGDATA.
$PGDATA/base will be the default table space.

A given table space will be identified by a unique table space name. I
haven't decided if 'unique' should mean database-wide unique or
cross-database unique. It seems to me that we might run into problems
with CREATE DATABASE ... TEMPLATE = <database with table spaces> if the
uniqueness of table spaces is limited to the database level.

A table space parameter will be added to DDL commands which create
physical database objects (CREATE DATABASE/INDEX/TABLE/SEQUENCE) and to
CREATE SCHEMA. The associated routines, as well as the corresponding DROP
commands will need to be updated. Adding the ability to ALTER <object>
TABLESPACE <name> seems a little painful. Would people use it? Comments?

When an object is created the system will resolve the table space the
object is stored in as follows: if the table space paramater is passed to
the DDL command, then the object is stored in that table space (given
validation of the table space, etc). If it is not passed, the object
inherits its "parent's" table space where the parent/child hierarchy is as
follows: database > schema > table > [index|sequence]. So, if you issued:

create table foo.bar (...);

We would first not that there is no TABLESPACE <name>, then cascade to
the table space for the schema 'foo' (and possibly cascade to the table
space for the database). A database which wasn't created with an explicit
table space will be created under the default table space. This ensures
backward compatibility.

Creating a table space:

A table space is a directory structure. The directory structure is as
follows:

[swm@dev /path/to/tblspc]$ ls
OID1/ OID2/

OID1 and OID2 are the OIDs of databases which have created a table space
against this file system location. In this respect, a table space
resembles $PGDATA/base. I thought it useful to keep this kind of
namespace mechanism in place so that administrators do not need to create
hierarchies of names on different partitions if they want multiple
databases to use the same partition.

The actual creation of the table space will be done with:

CREATE TABLE SPACE <name> LOCATION </path/to/tblspc>;

Before creating the table space we must:

1) Check if the directory exists. If it does, create a sub directory as
the OID of the current database.

2) Alternatively, if the directory doesn't exist, attempt to create it,
then the sub directory.

I wonder if a file, such as PG_TBLSPC, should be added to the table space
directory so that, in the case of an existing non-empty directory, we can
attempt to test if the directory is being used for something else and
error out. Seems like:

CREATE TABLESPACE tbl1 LOCATION '/var/'

which will result in something like '/var/123443' is a bad idea. Then
again, the user should know better. Comments?

If everything goes well, we add an entry to pg_tablespace with the table
space location and name (and and OID).

Tying it all together:

The catalogs pg_database, pg_namespace, and pg_class will have a 'tblspc'
field. This will be the OID of the table space the object resides in, or 0
(default table space). Since we can then resolve relid/relname, schema and
database to a tablespace, there aren't too many cases when extra logic
needs to be added to the IO framework. In fact, most of it is taken care
of because of the abstraction of relpath().

The creation of table spaces will need to be recorded in xlog in the same
way that files are in heap_create() with the corresponding delete logic
incase of ABORT.

Postmaster startup:

Ideally, the postmaster at startup should go into each tblspc/databaseoid
directory and check for a postmaster.pid file to see if some other
instance is touching the files we're interested in. This will require a
control file listing tblspc/databaseoid paths and it will need to plug
into WAL in case we die during CREATE TABLESPACE. Comments?

Creating a database

I think that createdb() is going to have to be reworked if pg_tablespace
isn't shared (ie, tablespaces are only database unique). The reason being
that if we create a database which has a table space, pg_tablespace in the
new database will have to be updated and that cannot be done atomically
with the `cp` based mechanism we currently use.

I think I'm going to have to get my hands dirty before I can tell the
extent to which createdb() will need reworking.

pg_dump

Obviously pg_dump will need to be able to dump table spaces. pg_dump
running against <7.5 will DDL commands without a table space parameter and
as such the database's physical layout, when loaded into 7.5, will be the
same as for <7.5.

---

Comments? Questions? Suggestions?

Thanks,

Gavin

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

#13Gavin Sherry
swm@linuxworld.com.au
In reply to: Alex J. Avriette (#10)
Re: Tablespaces

On Thu, 26 Feb 2004, Alex J. Avriette wrote:

On Thu, Feb 26, 2004 at 11:22:28PM +1100, Gavin Sherry wrote:

Certainly, table spaces are used in many ways in oracle, db2, etc. You can
mirror data across them, have different buffer sizes for example.
In some implementations, they can be raw disk partitions (no file system).
I don't intend going this far, however.

Perhaps now would be a good time to bring up my directio on Solaris question
from a year or so back? Is there any interest in the ability to use raw
disk?

I do not intend to undertake raw disk tablespaces for 7.5. I'd be
interested if anyone could provide some real world benchmarking of file
system vs. raw disk. Postgres benefits a lot from kernel file system cache
at the moment. Also, I believe that database designers have traditionally
made bad file system designers. Raw database partitions often lack the
tools essential to a scalable environment. For example, the ability to
resize partitions.

Gavin

#14Gavin Sherry
swm@linuxworld.com.au
In reply to: Tom Lane (#11)
Re: Tablespaces

On Thu, 26 Feb 2004, Tom Lane wrote:

Gavin Sherry <swm@linuxworld.com.au> writes:

A table space is a directory structure. The directory structure is as
follows:
[swm@dev /path/to/tblspc]$ ls
OID1/ OID2/
OID1 and OID2 are the OIDs of databases which have created a table space
against this file system location. In this respect, a table space
resembles $PGDATA/base. I thought it useful to keep this kind of
namespace mechanism in place ...

Actually, this is *necessary* AFAICT. The case that forces it is DROP
DATABASE. Since you have to execute that from another database, there's
no reasonable way to look into the target database's catalogs. That
means that the OID of the database has to be sufficient information to
get rid of all its files. You can do this fairly easily if in each
tablespace (whose locations you know from the shared pg_tablespace
table) you can look for a subdirectory matching the target database's
OID. If we tried to put the database's files just "loose" in each
tablespace directory then we'd be in trouble.

Ahhh. Yes.

I think this is also an implementation reason for favoring cluster-wide
tablespaces over database-local ones. I'm not sure how you drop a
database from outside if you can't see where its tablespaces are.

Naturally.

I believe that it will be necessary to expand RelFileNode to three OIDs
(tablespace, database, relation). I had once hoped that it could be
kept at two (tablespace, relation) but with a physical layout like this
you more or less have to have three.

Yes. I agree.

One issue that needs to be agreed on early is how the low-level file
access code finds a tablespace. What I would personally like is for
$PGDATA to contain symlinks to the tablespace top directories. The
actual access path for any relation could then be built trivially from
its RelFileNode:
$PGDATA/tablespaces/TBOID/DBOID/RELFILENODE
-------------------------
The underlined part references a symlink that leads to the directory
containing the per-database subdirectories.

I am expecting to hear some bleating about this from people whose
preferred platforms don't support symlinks ;-). However, if we don't

Actually, I think that's a pretty good idea :-). I'd solves a bunch of
issues in the backend (postmaster start up can recurse through
$PGDATA/tablespaces looking for postmaster.pid files) and will also assist
admins with complex configurations (perhaps).

Speaking of locking, can we do anything to prevent people from shooting
themselves in the foot by changing active tablespaces? Are we even
going to have a DROP TABLESPACE command, and if so what would it do?

Ahh. I forgot to detail my ideas on this. It seems to me that we cannot
drop a table space until the directory is empty. We will need a shared
invalidation message so that backends do not attempt to create an object
just after we drop the table space.

Thanks,

Gavin

#15J. Andrew Rogers
jrogers@neopolitan.com
In reply to: Gavin Sherry (#13)
Re: Tablespaces

On Thu, 2004-02-26 at 13:22, Gavin Sherry wrote:

Postgres benefits a lot from kernel file system cache
at the moment.

With the implementation of much smarter and more adaptive cache
replacement algorithm i.e. ARC, I would expect the benefit of using the
kernel file system cache to diminish significantly. It appears to me,
and I could be wrong, that the reason Postgres has depended on the
kernel file system cache isn't that this is obviously better in some
absolute sense (though it might be depending on the deployment
scenario), but that the original cache replacement algorithm in Postgres
was sufficiently poor that the better cache replacement algorithms in
the kernel cache more than offset any sub-optimality that might result
from doing so.

I would expect that with ARC and the redesign of some of the buffer
management bits for more scalability, you might very well get better
performance by allocating most of the memory to the buffer cache rather
than leaving it to the kernel file cache.

I'm actually fairly curious to see what the new buffer management scheme
will mean in terms of real world performance and parameter tuning.

-James Rogers
jrogers@neopolitan.com

#16Simon Riggs
simon@2ndQuadrant.com
In reply to: Gavin Sherry (#1)
Re: Tablespaces

Gavin Sherry
The creation of table spaces will need to be recorded in xlog in the

same

way that files are in heap_create() with the corresponding delete

logic

incase of ABORT.

Overall, sounds very cool.

Please could we record the OID of the tablespace in the WAL logs, not
the path to the tablespace? That way, we run no risks of having the WAL
logs not work correctly should things change slightly...

.. need to record drop tablespaces in the WAL logs also.

I'm sure you meant both of those, just checking.

Can drop tablespace require a specific privelege too? It's too easy to
drop parts of a database without thinking...

Best Regards, Simon Riggs

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Gavin Sherry (#14)
Re: Tablespaces

Gavin Sherry <swm@linuxworld.com.au> writes:

Speaking of locking, can we do anything to prevent people from shooting
themselves in the foot by changing active tablespaces? Are we even
going to have a DROP TABLESPACE command, and if so what would it do?

Ahh. I forgot to detail my ideas on this. It seems to me that we cannot
drop a table space until the directory is empty.

How would it get to be empty? Are you thinking of some sort of "connect
database to tablespace" and "disconnect database from tablespace"
commands that would respectively create and delete the per-database
subdirectory? That seems moderately reasonable to me. We could then
invent a locking protocol that requires backends to lock a tablespace
before they can execute either of these operations (or delete the
tablespace of course).

regards, tom lane

#18Alex J. Avriette
alex@posixnap.net
In reply to: Gavin Sherry (#13)
Re: Tablespaces

On Fri, Feb 27, 2004 at 08:22:25AM +1100, Gavin Sherry wrote:

interested if anyone could provide some real world benchmarking of file
system vs. raw disk. Postgres benefits a lot from kernel file system cache
at the moment. Also, I believe that database designers have traditionally
made bad file system designers. Raw database partitions often lack the
tools essential to a scalable environment. For example, the ability to
resize partitions.

The only reason I mentioned it to begin with was the recommendation of
directio for databases in the Sun Blueprint, _Tuning Databases on the
Solaris Platform_ (and being a Solaris geek, I asked, but apparently
nobody else is worried enough about performance or not using Solaris
enough to care).

It's not critical, of course. I think, however, that many of us would
like to see some of the features of Oracle and DB2 available to users
of postgres. Some of these features are raw disk, tablespaces, and
replication. We're getting there, and making really terrific progress
(I fully expect replication to be ready for primetime in the 8-12 mos
timeframe), but we're not quite there yet.

As I said, I'm very glad to hear tablespaces mentioned again and see
what looks like work being done on it.

Thanks!
Alex

--
alex@posixnap.net
Alex J. Avriette, Solaris Artillery Officer
"Among the many misdeeds of the British rule in India, history will look upon the act of depriving a whole nation of arms, as the blackest." - Mahatma Gandhi

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: J. Andrew Rogers (#15)
Re: Tablespaces

James Rogers <jrogers@neopolitan.com> writes:

With the implementation of much smarter and more adaptive cache
replacement algorithm i.e. ARC, I would expect the benefit of using the
kernel file system cache to diminish significantly. It appears to me,
and I could be wrong, that the reason Postgres has depended on the
kernel file system cache isn't that this is obviously better in some
absolute sense (though it might be depending on the deployment
scenario), but that the original cache replacement algorithm in Postgres
was sufficiently poor that the better cache replacement algorithms in
the kernel cache more than offset any sub-optimality that might result
from doing so.

The question of optimality of replacement algorithm is only one of the
arguments for using a small buffer cache. IMHO a considerably stronger
argument is that the kernel's memory management is more flexible: it can
use that memory for either disk cache or program workspace, and it can
change the allocation on-the-fly as load demands. If you dedicate most
of RAM to Postgres buffers then you are likely to be wasting RAM or
swapping heavily. Possibly both :-(

Another gotcha is that unless the OS allows you to lock shared memory
into RAM, the shared buffers themselves could get swapped out, which is
a no-win scenario by any measure. Keeping the shared buffer arena small
helps prevent that by ensuring all the buffers are "hot".

Of course, this is all speculation until we get some real-world
experience with ARC. But I don't expect it to be a magic bullet.

regards, tom lane

#20Bruce Momjian
bruce@momjian.us
In reply to: Gavin Sherry (#14)
Re: Tablespaces

I am expecting to hear some bleating about this from people whose
preferred platforms don't support symlinks ;-). However, if we don't

Well, one option would be to have the low level filesystem storage (md.c?)
routines implement a kind of symlink themselves. Just a file with a special
magic number followed by a path.

I'm normally against reimplementing OS services but symlinks are really a very
simple concept and simple to implement. Especially if you can make a few
simplifying assumptions: they only ever need to appear as the final path
element not as parent directories and tablespaces don't need symlinks pointing
to symlinks. Ideally postgres also doesn't need to implement relative links
either.

--
greg

#21Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Bruce Momjian (#20)
#22Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Zeugswetter Andreas SB SD (#21)
#23scott.marlowe
scott.marlowe@ihs.com
In reply to: Zeugswetter Andreas SB SD (#21)
#24scott.marlowe
scott.marlowe@ihs.com
In reply to: Gavin Sherry (#13)
#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: scott.marlowe (#23)
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: scott.marlowe (#24)
#27scott.marlowe
scott.marlowe@ihs.com
In reply to: Tom Lane (#26)
#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: scott.marlowe (#27)
#29Thomas Swan
tswan@idigx.com
In reply to: Zeugswetter Andreas SB SD (#22)
#30Gavin Sherry
swm@linuxworld.com.au
In reply to: Thomas Swan (#29)
#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Gavin Sherry (#30)
#32Josh Berkus
josh@agliodbs.com
In reply to: Tom Lane (#31)
#33Andrew Sullivan
ajs@crankycanuck.ca
In reply to: Alex J. Avriette (#18)
#34Simon Riggs
simon@2ndQuadrant.com
In reply to: Gavin Sherry (#30)
#35Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#34)
#36Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#35)
#37Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: Gavin Sherry (#1)
#38Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: Gavin Sherry (#1)
#39Tom Lane
tgl@sss.pgh.pa.us
In reply to: Christopher Kings-Lynne (#37)
#40Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#36)
#41Bruce Momjian
bruce@momjian.us
In reply to: Gavin Sherry (#1)
#42Bruce Momjian
bruce@momjian.us
In reply to: Richard Huxton (#3)
#43Bruce Momjian
bruce@momjian.us
In reply to: Joshua D. Drake (#8)
#44Gavin Sherry
swm@linuxworld.com.au
In reply to: Bruce Momjian (#41)
#45Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#11)
#46Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#17)
#47Joe Conway
mail@joeconway.com
In reply to: Tom Lane (#40)
#48Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#20)
#49Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#26)
#50Bruce Momjian
bruce@momjian.us
In reply to: Josh Berkus (#32)
#51Bruce Momjian
bruce@momjian.us
In reply to: Gavin Sherry (#14)
#52Gavin Sherry
swm@linuxworld.com.au
In reply to: Bruce Momjian (#51)
#53Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joe Conway (#47)
#54Tom Lane
tgl@sss.pgh.pa.us
In reply to: Gavin Sherry (#52)
#55Bruce Momjian
bruce@momjian.us
In reply to: Gavin Sherry (#52)
#56Joe Conway
mail@joeconway.com
In reply to: Tom Lane (#53)
#57Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#48)
#58Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#57)
#59Gavin Sherry
swm@linuxworld.com.au
In reply to: Tom Lane (#57)
#60Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joe Conway (#56)
#61Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#58)
#62Claudio Natoli
claudio.natoli@memetrics.com
In reply to: Tom Lane (#61)
#63Claudio Natoli
claudio.natoli@memetrics.com
In reply to: Claudio Natoli (#62)
#64Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Claudio Natoli (#63)
#65Oliver Elphick
olly@lfix.co.uk
In reply to: Tom Lane (#54)
#66Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#61)
#67Bruce Momjian
bruce@momjian.us
In reply to: Claudio Natoli (#63)
#68Magnus Hagander
magnus@hagander.net
In reply to: Bruce Momjian (#67)
#69Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zeugswetter Andreas SB SD (#64)
#70Tom Lane
tgl@sss.pgh.pa.us
In reply to: Oliver Elphick (#65)
#71Andrew Dunstan
andrew@dunslane.net
In reply to: Zeugswetter Andreas SB SD (#64)
#72Magnus Hagander
magnus@hagander.net
In reply to: Andrew Dunstan (#71)
#73Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#48)
#74Barry Lind
barry@xythos.com
In reply to: Oliver Elphick (#65)
#75Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#73)
#76Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#60)
#77Joe Conway
mail@joeconway.com
In reply to: Simon Riggs (#76)
#78Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joe Conway (#77)
#79Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#76)
#80Thomas Swan
tswan@idigx.com
In reply to: Bruce Momjian (#75)
#81Bruce Momjian
bruce@momjian.us
In reply to: Thomas Swan (#80)
#82Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Swan (#80)
#83Thomas Swan
tswan@idigx.com
In reply to: Tom Lane (#82)
#84Shridhar Daithankar
shridhar@frodo.hserus.net
In reply to: Tom Lane (#78)
#85Thomas Swan
tswan@idigx.com
In reply to: Tom Lane (#69)
#86Bruce Momjian
bruce@momjian.us
In reply to: Thomas Swan (#85)
#87Thomas Swan
tswan@idigx.com
In reply to: Bruce Momjian (#86)
In reply to: Thomas Swan (#87)
#89Noname
jearl@bullysports.com
In reply to: Thomas Swan (#87)
#90Thomas Swan
tswan@idigx.com
In reply to: Noname (#89)
#91Richard Huxton
dev@archonet.com
In reply to: Thomas Swan (#90)
#92Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Richard Huxton (#91)
#93Bruce Momjian
bruce@momjian.us
In reply to: Thomas Swan (#90)
#94Bruce Momjian
bruce@momjian.us
In reply to: Zeugswetter Andreas SB SD (#92)
#95scott.marlowe
scott.marlowe@ihs.com
In reply to: Thomas Swan (#90)
In reply to: scott.marlowe (#95)
#97Noname
jearl@bullysports.com
In reply to: Thomas Swan (#90)
#98Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noname (#97)
#99Marko Karppinen
marko@karppinen.fi
In reply to: Bruce Momjian (#75)
#100Andrew Sullivan
ajs@crankycanuck.ca
In reply to: Marko Karppinen (#99)
#101Andreas Pflug
pgadmin@pse-consulting.de
In reply to: Andrew Sullivan (#100)
#102Bruce Momjian
bruce@momjian.us
In reply to: Andrew Sullivan (#100)
#103Simon Riggs
simon@2ndQuadrant.com
In reply to: Bruce Momjian (#79)
#104Simon Riggs
simon@2ndQuadrant.com
In reply to: Joe Conway (#77)
#105Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#78)
#106Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#103)
#107Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#104)
#108Bruce Momjian
bruce@momjian.us
In reply to: Dennis Bjorklund (#6)
#109Mark Woodward
pgsql@mohawksoft.com
In reply to: Andrew Sullivan (#33)