Tablespaces

Started by Gavin Sherryalmost 22 years ago109 messages
#1Gavin Sherry
swm@linuxworld.com.au

Hi all,

I've been looking at implementing table spaces for 7.5. Some notes and
implementation details follow.

------

Type of table space:

There are many different table space implementations in relational
database management systems. In my implementation, a table space in
PostgreSQL will be the location of a directory on the file system in
which files backing database objects can be stored. Global tables and
non 'data' objects (WAL, CLOG, config files) will all remain in $PGDATA.
$PGDATA/base will be the default table space.

A given table space will be identified by a unique table space name. I
haven't decided if 'unique' should mean database-wide unique or
cross-database unique. It seems to me that we might run into problems
with CREATE DATABASE ... TEMPLATE = <database with table spaces> if the
uniqueness of table spaces is limited to the database level.

A table space parameter will be added to DDL commands which create
physical database objects (CREATE DATABASE/INDEX/TABLE/SEQUENCE) and to
CREATE SCHEMA. The associated routines, as well as the corresponding DROP
commands will need to be updated. Adding the ability to ALTER <object>
TABLESPACE <name> seems a little painful. Would people use it? Comments?

When an object is created the system will resolve the table space the
object is stored in as follows: if the table space paramater is passed to
the DDL command, then the object is stored in that table space (given
validation of the table space, etc). If it is not passed, the object
inherits its "parent's" table space where the parent/child hierarchy is as
follows: database > schema > table > [index|sequence]. So, if you issued:

create table foo.bar (...);

We would first not that there is no TABLESPACE <name>, then cascade to
the table space for the schema 'foo' (and possibly cascade to the table
space for the database). A database which wasn't created with an explicit
table space will be created under the default table space. This ensures
backward compatibility.

Creating a table space:

A table space is a directory structure. The directory structure is as
follows:

[swm@dev /path/to/tblspc]$ ls
OID1/ OID2/

OID1 and OID2 are the OIDs of databases which have created a table space
against this file system location. In this respect, a table space
resembles $PGDATA/base. I thought it useful to keep this kind of
namespace mechanism in place so that administrators do not need to create
hierarchies of names on different partitions if they want multiple
databases to use the same partition.

The actual creation of the table space will be done with:

CREATE TABLE SPACE <name> LOCATION </path/to/tblspc>;

Before creating the table space we must:

1) Check if the directory exists. If it does, create a sub directory as
the OID of the current database.

2) Alternatively, if the directory doesn't exist, attempt to create it,
then the sub directory.

I wonder if a file, such as PG_TBLSPC, should be added to the table space
directory so that, in the case of an existing non-empty directory, we can
attempt to test if the directory is being used for something else and
error out. Seems like:

CREATE TABLESPACE tbl1 LOCATION '/var/'

which will result in something like '/var/123443' is a bad idea. Then
again, the user should know better. Comments?

If everything goes well, we add an entry to pg_tablespace with the table
space location and name (and and OID).

Tying it all together:

The catalogs pg_database, pg_namespace, and pg_class will have a 'tblspc'
field. This will be the OID of the table space the object resides in, or 0
(default table space). Since we can then resolve relid/relname, schema and
database to a tablespace, there aren't too many cases when extra logic
needs to be added to the IO framework. In fact, most of it is taken care
of because of the abstraction of relpath().

The creation of table spaces will need to be recorded in xlog in the same
way that files are in heap_create() with the corresponding delete logic
incase of ABORT.

Postmaster startup:

Ideally, the postmaster at startup should go into each tblspc/databaseoid
directory and check for a postmaster.pid file to see if some other
instance is touching the files we're interested in. This will require a
control file listing tblspc/databaseoid paths and it will need to plug
into WAL in case we die during CREATE TABLESPACE. Comments?

Creating a database

I think that createdb() is going to have to be reworked if pg_tablespace
isn't shared (ie, tablespaces are only database unique). The reason being
that if we create a database which has a table space, pg_tablespace in the
new database will have to be updated and that cannot be done atomically
with the `cp` based mechanism we currently use.

I think I'm going to have to get my hands dirty before I can tell the
extent to which createdb() will need reworking.

pg_dump

Obviously pg_dump will need to be able to dump table spaces. pg_dump
running against <7.5 will DDL commands without a table space parameter and
as such the database's physical layout, when loaded into 7.5, will be the
same as for <7.5.

---

Comments? Questions? Suggestions?

Thanks,

Gavin

#2Shridhar Daithankar
shridhar@frodo.hserus.net
In reply to: Gavin Sherry (#1)
Re: Tablespaces

On Thursday 26 February 2004 15:37, Gavin Sherry wrote:

Tying it all together:

The catalogs pg_database, pg_namespace, and pg_class will have a 'tblspc'
field. This will be the OID of the table space the object resides in, or 0
(default table space). Since we can then resolve relid/relname, schema and
database to a tablespace, there aren't too many cases when extra logic
needs to be added to the IO framework. In fact, most of it is taken care
of because of the abstraction of relpath().

The creation of table spaces will need to be recorded in xlog in the same
way that files are in heap_create() with the corresponding delete logic
incase of ABORT.

Is tablespace some sort of copyrighted? Last I remembered, the discussion was
about location/storage

Just a thought..

Shridhar

#3Richard Huxton
dev@archonet.com
In reply to: Gavin Sherry (#1)
Re: Tablespaces

On Thursday 26 February 2004 10:07, Gavin Sherry wrote:

CREATE TABLESPACE tbl1 LOCATION '/var/'

which will result in something like '/var/123443' is a bad idea. Then
again, the user should know better. Comments?

The LOCATION should have the same owner and permissions as $PGDATA - that
should catch mistyping.

Unless you're running as root, of course. In which case you clearly know
better than everyone else, so off you go!

Comments? Questions? Suggestions?

Presumably I'm using this to deal with performance/space issues, so there
clearly needs to be something in the pg_stat_xxx system to show figures based
on tablespace - not sure what you'd measure though - disk I/O, number of
nodes?

Perhaps something in contrib/ too - tablespace_disk_usage.pl or some such.
--
Richard Huxton
Archonet Ltd

#4Hans-Jürgen Schönig
postgres@cybertec.at
In reply to: Gavin Sherry (#1)
Re: Tablespaces

Gavin Sherry wrote:

Hi all,

I've been looking at implementing table spaces for 7.5. Some notes and
implementation details follow.

------

Type of table space:

There are many different table space implementations in relational
database management systems. In my implementation, a table space in
PostgreSQL will be the location of a directory on the file system in
which files backing database objects can be stored. Global tables and
non 'data' objects (WAL, CLOG, config files) will all remain in $PGDATA.
$PGDATA/base will be the default table space.

Is it possible to put WALs and CLOGs into different tablespaces? (maybe
different RAID systems). Some companies want that ...

A given table space will be identified by a unique table space name. I
haven't decided if 'unique' should mean database-wide unique or
cross-database unique. It seems to me that we might run into problems
with CREATE DATABASE ... TEMPLATE = <database with table spaces> if the
uniqueness of table spaces is limited to the database level.

I strongly vote for database cluster wide unique names because somebody
could have a tablespace "webusers" or something like that. To me this
makes far more sense.

A table space parameter will be added to DDL commands which create
physical database objects (CREATE DATABASE/INDEX/TABLE/SEQUENCE) and to
CREATE SCHEMA. The associated routines, as well as the corresponding DROP
commands will need to be updated. Adding the ability to ALTER <object>
TABLESPACE <name> seems a little painful. Would people use it? Comments?

I think people won't need it in first place because this seems to be
really painful.
What really matters is that the number of tablespaces and file /
tablespace is unlimited. SAP DB has limited the number of devspaces to
32 (I think). This is real bull.... because if your database grows
unexpectedly you are in deep trouble (expert database design by SAP,
MySQL and 100000....0000 others).

When an object is created the system will resolve the table space the
object is stored in as follows: if the table space paramater is passed to
the DDL command, then the object is stored in that table space (given
validation of the table space, etc). If it is not passed, the object
inherits its "parent's" table space where the parent/child hierarchy is as
follows: database > schema > table > [index|sequence]. So, if you issued:

create table foo.bar (...);

We would first not that there is no TABLESPACE <name>, then cascade to
the table space for the schema 'foo' (and possibly cascade to the table
space for the database). A database which wasn't created with an explicit
table space will be created under the default table space. This ensures
backward compatibility.

Will users automatically be assigned to a certain table space? How is
this going to work?

Creating a table space:

A table space is a directory structure. The directory structure is as
follows:

[swm@dev /path/to/tblspc]$ ls
OID1/ OID2/

OID1 and OID2 are the OIDs of databases which have created a table space
against this file system location. In this respect, a table space
resembles $PGDATA/base. I thought it useful to keep this kind of
namespace mechanism in place so that administrators do not need to create
hierarchies of names on different partitions if they want multiple
databases to use the same partition.

The actual creation of the table space will be done with:

CREATE TABLE SPACE <name> LOCATION </path/to/tblspc>;

Before creating the table space we must:

1) Check if the directory exists. If it does, create a sub directory as
the OID of the current database.

2) Alternatively, if the directory doesn't exist, attempt to create it,
then the sub directory.

I wonder if a file, such as PG_TBLSPC, should be added to the table space
directory so that, in the case of an existing non-empty directory, we can
attempt to test if the directory is being used for something else and
error out. Seems like:

CREATE TABLESPACE tbl1 LOCATION '/var/'

which will result in something like '/var/123443' is a bad idea. Then
again, the user should know better. Comments?

If everything goes well, we add an entry to pg_tablespace with the table
space location and name (and and OID).

Tying it all together:

The catalogs pg_database, pg_namespace, and pg_class will have a 'tblspc'
field. This will be the OID of the table space the object resides in, or 0
(default table space). Since we can then resolve relid/relname, schema and
database to a tablespace, there aren't too many cases when extra logic
needs to be added to the IO framework. In fact, most of it is taken care
of because of the abstraction of relpath().

The creation of table spaces will need to be recorded in xlog in the same
way that files are in heap_create() with the corresponding delete logic
incase of ABORT.

Postmaster startup:

Ideally, the postmaster at startup should go into each tblspc/databaseoid
directory and check for a postmaster.pid file to see if some other
instance is touching the files we're interested in. This will require a
control file listing tblspc/databaseoid paths and it will need to plug
into WAL in case we die during CREATE TABLESPACE. Comments?

Creating a database

I think that createdb() is going to have to be reworked if pg_tablespace
isn't shared (ie, tablespaces are only database unique). The reason being
that if we create a database which has a table space, pg_tablespace in the
new database will have to be updated and that cannot be done atomically
with the `cp` based mechanism we currently use.

I think I'm going to have to get my hands dirty before I can tell the
extent to which createdb() will need reworking.

pg_dump

Obviously pg_dump will need to be able to dump table spaces. pg_dump
running against <7.5 will DDL commands without a table space parameter and
as such the database's physical layout, when loaded into 7.5, will be the
same as for <7.5.

---

Comments? Questions? Suggestions?

Thanks,

Gavin

Do you plan support for limiting the size of a tablespace? ISPs will
vote for that because they can limit the size of a database on the
database level rather than on the operating system level. Of course
this can and (should???) be done on the operation system level but
people will definitely ask for that.
If sizing is not supported we should definitely provide minor
documentation which tells people how to do that on the operating system
level (at least poting to some useful information).

Best regards,

Hans

--
Cybertec Geschwinde u Schoenig
Schoengrabern 134, A-2020 Hollabrunn, Austria
Tel: +43/2952/30706 or +43/664/233 90 75
www.cybertec.at, www.postgresql.at, kernel.cybertec.at

#5Gavin Sherry
swm@linuxworld.com.au
In reply to: Hans-Jürgen Schönig (#4)
Re: Tablespaces

Is it possible to put WALs and CLOGs into different tablespaces? (maybe
different RAID systems). Some companies want that ...

I wasn't going to look at that just yet.

There is of course the temporary hack of symlinking WAL else where.

I'd be interested to see the performance difference between WAL and data
on the same RAID/controller and WAL and data on different RAID/controller
with Jan's improvements to the buffer management.

Gavin

#6Dennis Bjorklund
db@zigo.dhs.org
In reply to: Gavin Sherry (#1)
Re: Tablespaces

On Thu, 26 Feb 2004, Gavin Sherry wrote:

Comments? Questions? Suggestions?

Is that plan that in the future one can split a single table into
different table spaces? Like storing all rows with year < 1999 in one
tablespace and the rest in another?

With the rule system and two underlying tables one could make it work by
hand I think.

I've never used tablespaces in oracle so I don't know what it can offer. I
though it could do things like the above. True? What is the syntax and
for example, how does it effect indexes (not at all maybe).

If you don't want to discuss this now, I understand. It's not part of the
design as it is now. I'm just curious at what direction we are moving and
what is possible to do.

--
/Dennis Bj�rklund

#7Gavin Sherry
swm@linuxworld.com.au
In reply to: Dennis Bjorklund (#6)
Re: Tablespaces

On Thu, 26 Feb 2004, Dennis Bjorklund wrote:

On Thu, 26 Feb 2004, Gavin Sherry wrote:

Comments? Questions? Suggestions?

Is that plan that in the future one can split a single table into
different table spaces? Like storing all rows with year < 1999 in one
tablespace and the rest in another?

These are called partitions in oracle. You can approximate this with table
spaces by using a partial index and putting it in a different table space.
The problem, of course, is seq scans.

With the rule system and two underlying tables one could make it work by
hand I think.

I've never used tablespaces in oracle so I don't know what it can offer. I

Certainly, table spaces are used in many ways in oracle, db2, etc. You can
mirror data across them, have different buffer sizes for example.
In some implementations, they can be raw disk partitions (no file system).
I don't intend going this far, however.

If you don't want to discuss this now, I understand. It's not part of the
design as it is now. I'm just curious at what direction we are moving and
what is possible to do.

Well, partitions are something else entirely. Mirroring would be
interesting, but RAID designers are better at parallelisation of IO than
(some) database developers. Might be better to keep the problem seperate.

Gavin

#8Joshua D. Drake
jd@commandprompt.com
In reply to: Hans-Jürgen Schönig (#4)
Re: Tablespaces

Is it possible to put WALs and CLOGs into different tablespaces? (maybe
different RAID systems). Some companies want that ...

You can do this now, but it would be nice to be able to have it actually
configurable versus the hacked linked method.

J

--
Co-Founder
Command Prompt, Inc.
The wheel's spinning but the hamster's dead

#9Hans-Jürgen Schönig
postgres@cybertec.at
In reply to: Gavin Sherry (#5)
Re: Tablespaces

Gavin Sherry wrote:

Is it possible to put WALs and CLOGs into different tablespaces? (maybe
different RAID systems). Some companies want that ...

I wasn't going to look at that just yet.

There is of course the temporary hack of symlinking WAL else where.

that's what we do now.
we symlink databases and wals ...

I'd be interested to see the performance difference between WAL and data
on the same RAID/controller and WAL and data on different RAID/controller
with Jan's improvements to the buffer management.

Gavin

yes, that's what i am looking for. i should do some testing.

in case of enough i/o power additional cpus scale almost linearily
(depending on the application of course; i have done some testing on a
customer's aix box ...).
it would be interesting to see what jan's buffer strategy does (and bg
writer) ...

--
Cybertec Geschwinde u Schoenig
Schoengrabern 134, A-2020 Hollabrunn, Austria
Tel: +43/2952/30706 or +43/664/233 90 75
www.cybertec.at, www.postgresql.at, kernel.cybertec.at

#10Alex J. Avriette
alex@posixnap.net
In reply to: Gavin Sherry (#7)
Re: Tablespaces

On Thu, Feb 26, 2004 at 11:22:28PM +1100, Gavin Sherry wrote:

Certainly, table spaces are used in many ways in oracle, db2, etc. You can
mirror data across them, have different buffer sizes for example.
In some implementations, they can be raw disk partitions (no file system).
I don't intend going this far, however.

Perhaps now would be a good time to bring up my directio on Solaris question
from a year or so back? Is there any interest in the ability to use raw
disk?

Alex (who is overjoyed to hear discussion of tablespaces again)

--
alex@posixnap.net
Alex J. Avriette, Unix Systems Gladiator
Sep 25 12:52:39 buggle /bsd: wsdisplay0 at vga1: removing /dev/radio/*

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Gavin Sherry (#1)
Re: Tablespaces

Gavin Sherry <swm@linuxworld.com.au> writes:

A table space is a directory structure. The directory structure is as
follows:
[swm@dev /path/to/tblspc]$ ls
OID1/ OID2/
OID1 and OID2 are the OIDs of databases which have created a table space
against this file system location. In this respect, a table space
resembles $PGDATA/base. I thought it useful to keep this kind of
namespace mechanism in place ...

Actually, this is *necessary* AFAICT. The case that forces it is DROP
DATABASE. Since you have to execute that from another database, there's
no reasonable way to look into the target database's catalogs. That
means that the OID of the database has to be sufficient information to
get rid of all its files. You can do this fairly easily if in each
tablespace (whose locations you know from the shared pg_tablespace
table) you can look for a subdirectory matching the target database's
OID. If we tried to put the database's files just "loose" in each
tablespace directory then we'd be in trouble.

I think this is also an implementation reason for favoring cluster-wide
tablespaces over database-local ones. I'm not sure how you drop a
database from outside if you can't see where its tablespaces are.

I believe that it will be necessary to expand RelFileNode to three OIDs
(tablespace, database, relation). I had once hoped that it could be
kept at two (tablespace, relation) but with a physical layout like this
you more or less have to have three.

One issue that needs to be agreed on early is how the low-level file
access code finds a tablespace. What I would personally like is for
$PGDATA to contain symlinks to the tablespace top directories. The
actual access path for any relation could then be built trivially from
its RelFileNode:
$PGDATA/tablespaces/TBOID/DBOID/RELFILENODE
-------------------------
The underlined part references a symlink that leads to the directory
containing the per-database subdirectories.

I am expecting to hear some bleating about this from people whose
preferred platforms don't support symlinks ;-). However, if we don't
rely on symlinks for this then the tablespace-OID-to-physical-path
mapping has to be explicitly known at very low levels of the system
(md.c in particular). We can't expect md.c to get that information by
reading pg_tablespace. It would have to rely on some backdoor path,
such as a flat text file it could read at backend startup. I think
this approach will leave us fighting a lot of problems with locking
and out-of-date information.

Speaking of locking, can we do anything to prevent people from shooting
themselves in the foot by changing active tablespaces? Are we even
going to have a DROP TABLESPACE command, and if so what would it do?

regards, tom lane

#12Barry Lind
blind@xythos.com
In reply to: Gavin Sherry (#1)
Re: Tablespaces

Gavin,

After creating a tablespace what (if any) changes can be done to it.
Can you DROP a tablespace, or once created will it always exist? Can
you RENAME a tablespace? Can you change the location of a tablespace
(i.e you did a disk reorg and move the contents to a different location
and now want to point to the new location)? What are the permissions
necessary to create a tablespace (can any use connected to the database
create a tablespace, or only superuser, or ...)?

Overall this will be a great addition to postgres. I am looking forward
to this feature.

thanks,
--Barry

Gavin Sherry wrote:

Show quoted text

Hi all,

I've been looking at implementing table spaces for 7.5. Some notes and
implementation details follow.

------

Type of table space:

There are many different table space implementations in relational
database management systems. In my implementation, a table space in
PostgreSQL will be the location of a directory on the file system in
which files backing database objects can be stored. Global tables and
non 'data' objects (WAL, CLOG, config files) will all remain in $PGDATA.
$PGDATA/base will be the default table space.

A given table space will be identified by a unique table space name. I
haven't decided if 'unique' should mean database-wide unique or
cross-database unique. It seems to me that we might run into problems
with CREATE DATABASE ... TEMPLATE = <database with table spaces> if the
uniqueness of table spaces is limited to the database level.

A table space parameter will be added to DDL commands which create
physical database objects (CREATE DATABASE/INDEX/TABLE/SEQUENCE) and to
CREATE SCHEMA. The associated routines, as well as the corresponding DROP
commands will need to be updated. Adding the ability to ALTER <object>
TABLESPACE <name> seems a little painful. Would people use it? Comments?

When an object is created the system will resolve the table space the
object is stored in as follows: if the table space paramater is passed to
the DDL command, then the object is stored in that table space (given
validation of the table space, etc). If it is not passed, the object
inherits its "parent's" table space where the parent/child hierarchy is as
follows: database > schema > table > [index|sequence]. So, if you issued:

create table foo.bar (...);

We would first not that there is no TABLESPACE <name>, then cascade to
the table space for the schema 'foo' (and possibly cascade to the table
space for the database). A database which wasn't created with an explicit
table space will be created under the default table space. This ensures
backward compatibility.

Creating a table space:

A table space is a directory structure. The directory structure is as
follows:

[swm@dev /path/to/tblspc]$ ls
OID1/ OID2/

OID1 and OID2 are the OIDs of databases which have created a table space
against this file system location. In this respect, a table space
resembles $PGDATA/base. I thought it useful to keep this kind of
namespace mechanism in place so that administrators do not need to create
hierarchies of names on different partitions if they want multiple
databases to use the same partition.

The actual creation of the table space will be done with:

CREATE TABLE SPACE <name> LOCATION </path/to/tblspc>;

Before creating the table space we must:

1) Check if the directory exists. If it does, create a sub directory as
the OID of the current database.

2) Alternatively, if the directory doesn't exist, attempt to create it,
then the sub directory.

I wonder if a file, such as PG_TBLSPC, should be added to the table space
directory so that, in the case of an existing non-empty directory, we can
attempt to test if the directory is being used for something else and
error out. Seems like:

CREATE TABLESPACE tbl1 LOCATION '/var/'

which will result in something like '/var/123443' is a bad idea. Then
again, the user should know better. Comments?

If everything goes well, we add an entry to pg_tablespace with the table
space location and name (and and OID).

Tying it all together:

The catalogs pg_database, pg_namespace, and pg_class will have a 'tblspc'
field. This will be the OID of the table space the object resides in, or 0
(default table space). Since we can then resolve relid/relname, schema and
database to a tablespace, there aren't too many cases when extra logic
needs to be added to the IO framework. In fact, most of it is taken care
of because of the abstraction of relpath().

The creation of table spaces will need to be recorded in xlog in the same
way that files are in heap_create() with the corresponding delete logic
incase of ABORT.

Postmaster startup:

Ideally, the postmaster at startup should go into each tblspc/databaseoid
directory and check for a postmaster.pid file to see if some other
instance is touching the files we're interested in. This will require a
control file listing tblspc/databaseoid paths and it will need to plug
into WAL in case we die during CREATE TABLESPACE. Comments?

Creating a database

I think that createdb() is going to have to be reworked if pg_tablespace
isn't shared (ie, tablespaces are only database unique). The reason being
that if we create a database which has a table space, pg_tablespace in the
new database will have to be updated and that cannot be done atomically
with the `cp` based mechanism we currently use.

I think I'm going to have to get my hands dirty before I can tell the
extent to which createdb() will need reworking.

pg_dump

Obviously pg_dump will need to be able to dump table spaces. pg_dump
running against <7.5 will DDL commands without a table space parameter and
as such the database's physical layout, when loaded into 7.5, will be the
same as for <7.5.

---

Comments? Questions? Suggestions?

Thanks,

Gavin

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

#13Gavin Sherry
swm@linuxworld.com.au
In reply to: Alex J. Avriette (#10)
Re: Tablespaces

On Thu, 26 Feb 2004, Alex J. Avriette wrote:

On Thu, Feb 26, 2004 at 11:22:28PM +1100, Gavin Sherry wrote:

Certainly, table spaces are used in many ways in oracle, db2, etc. You can
mirror data across them, have different buffer sizes for example.
In some implementations, they can be raw disk partitions (no file system).
I don't intend going this far, however.

Perhaps now would be a good time to bring up my directio on Solaris question
from a year or so back? Is there any interest in the ability to use raw
disk?

I do not intend to undertake raw disk tablespaces for 7.5. I'd be
interested if anyone could provide some real world benchmarking of file
system vs. raw disk. Postgres benefits a lot from kernel file system cache
at the moment. Also, I believe that database designers have traditionally
made bad file system designers. Raw database partitions often lack the
tools essential to a scalable environment. For example, the ability to
resize partitions.

Gavin

#14Gavin Sherry
swm@linuxworld.com.au
In reply to: Tom Lane (#11)
Re: Tablespaces

On Thu, 26 Feb 2004, Tom Lane wrote:

Gavin Sherry <swm@linuxworld.com.au> writes:

A table space is a directory structure. The directory structure is as
follows:
[swm@dev /path/to/tblspc]$ ls
OID1/ OID2/
OID1 and OID2 are the OIDs of databases which have created a table space
against this file system location. In this respect, a table space
resembles $PGDATA/base. I thought it useful to keep this kind of
namespace mechanism in place ...

Actually, this is *necessary* AFAICT. The case that forces it is DROP
DATABASE. Since you have to execute that from another database, there's
no reasonable way to look into the target database's catalogs. That
means that the OID of the database has to be sufficient information to
get rid of all its files. You can do this fairly easily if in each
tablespace (whose locations you know from the shared pg_tablespace
table) you can look for a subdirectory matching the target database's
OID. If we tried to put the database's files just "loose" in each
tablespace directory then we'd be in trouble.

Ahhh. Yes.

I think this is also an implementation reason for favoring cluster-wide
tablespaces over database-local ones. I'm not sure how you drop a
database from outside if you can't see where its tablespaces are.

Naturally.

I believe that it will be necessary to expand RelFileNode to three OIDs
(tablespace, database, relation). I had once hoped that it could be
kept at two (tablespace, relation) but with a physical layout like this
you more or less have to have three.

Yes. I agree.

One issue that needs to be agreed on early is how the low-level file
access code finds a tablespace. What I would personally like is for
$PGDATA to contain symlinks to the tablespace top directories. The
actual access path for any relation could then be built trivially from
its RelFileNode:
$PGDATA/tablespaces/TBOID/DBOID/RELFILENODE
-------------------------
The underlined part references a symlink that leads to the directory
containing the per-database subdirectories.

I am expecting to hear some bleating about this from people whose
preferred platforms don't support symlinks ;-). However, if we don't

Actually, I think that's a pretty good idea :-). I'd solves a bunch of
issues in the backend (postmaster start up can recurse through
$PGDATA/tablespaces looking for postmaster.pid files) and will also assist
admins with complex configurations (perhaps).

Speaking of locking, can we do anything to prevent people from shooting
themselves in the foot by changing active tablespaces? Are we even
going to have a DROP TABLESPACE command, and if so what would it do?

Ahh. I forgot to detail my ideas on this. It seems to me that we cannot
drop a table space until the directory is empty. We will need a shared
invalidation message so that backends do not attempt to create an object
just after we drop the table space.

Thanks,

Gavin

#15James Rogers
jrogers@neopolitan.com
In reply to: Gavin Sherry (#13)
Re: Tablespaces

On Thu, 2004-02-26 at 13:22, Gavin Sherry wrote:

Postgres benefits a lot from kernel file system cache
at the moment.

With the implementation of much smarter and more adaptive cache
replacement algorithm i.e. ARC, I would expect the benefit of using the
kernel file system cache to diminish significantly. It appears to me,
and I could be wrong, that the reason Postgres has depended on the
kernel file system cache isn't that this is obviously better in some
absolute sense (though it might be depending on the deployment
scenario), but that the original cache replacement algorithm in Postgres
was sufficiently poor that the better cache replacement algorithms in
the kernel cache more than offset any sub-optimality that might result
from doing so.

I would expect that with ARC and the redesign of some of the buffer
management bits for more scalability, you might very well get better
performance by allocating most of the memory to the buffer cache rather
than leaving it to the kernel file cache.

I'm actually fairly curious to see what the new buffer management scheme
will mean in terms of real world performance and parameter tuning.

-James Rogers
jrogers@neopolitan.com

#16Simon Riggs
simon@2ndquadrant.com
In reply to: Gavin Sherry (#1)
Re: Tablespaces

Gavin Sherry
The creation of table spaces will need to be recorded in xlog in the

same

way that files are in heap_create() with the corresponding delete

logic

incase of ABORT.

Overall, sounds very cool.

Please could we record the OID of the tablespace in the WAL logs, not
the path to the tablespace? That way, we run no risks of having the WAL
logs not work correctly should things change slightly...

.. need to record drop tablespaces in the WAL logs also.

I'm sure you meant both of those, just checking.

Can drop tablespace require a specific privelege too? It's too easy to
drop parts of a database without thinking...

Best Regards, Simon Riggs

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Gavin Sherry (#14)
Re: Tablespaces

Gavin Sherry <swm@linuxworld.com.au> writes:

Speaking of locking, can we do anything to prevent people from shooting
themselves in the foot by changing active tablespaces? Are we even
going to have a DROP TABLESPACE command, and if so what would it do?

Ahh. I forgot to detail my ideas on this. It seems to me that we cannot
drop a table space until the directory is empty.

How would it get to be empty? Are you thinking of some sort of "connect
database to tablespace" and "disconnect database from tablespace"
commands that would respectively create and delete the per-database
subdirectory? That seems moderately reasonable to me. We could then
invent a locking protocol that requires backends to lock a tablespace
before they can execute either of these operations (or delete the
tablespace of course).

regards, tom lane

#18Alex J. Avriette
alex@posixnap.net
In reply to: Gavin Sherry (#13)
Re: Tablespaces

On Fri, Feb 27, 2004 at 08:22:25AM +1100, Gavin Sherry wrote:

interested if anyone could provide some real world benchmarking of file
system vs. raw disk. Postgres benefits a lot from kernel file system cache
at the moment. Also, I believe that database designers have traditionally
made bad file system designers. Raw database partitions often lack the
tools essential to a scalable environment. For example, the ability to
resize partitions.

The only reason I mentioned it to begin with was the recommendation of
directio for databases in the Sun Blueprint, _Tuning Databases on the
Solaris Platform_ (and being a Solaris geek, I asked, but apparently
nobody else is worried enough about performance or not using Solaris
enough to care).

It's not critical, of course. I think, however, that many of us would
like to see some of the features of Oracle and DB2 available to users
of postgres. Some of these features are raw disk, tablespaces, and
replication. We're getting there, and making really terrific progress
(I fully expect replication to be ready for primetime in the 8-12 mos
timeframe), but we're not quite there yet.

As I said, I'm very glad to hear tablespaces mentioned again and see
what looks like work being done on it.

Thanks!
Alex

--
alex@posixnap.net
Alex J. Avriette, Solaris Artillery Officer
"Among the many misdeeds of the British rule in India, history will look upon the act of depriving a whole nation of arms, as the blackest." - Mahatma Gandhi

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: James Rogers (#15)
Re: Tablespaces

James Rogers <jrogers@neopolitan.com> writes:

With the implementation of much smarter and more adaptive cache
replacement algorithm i.e. ARC, I would expect the benefit of using the
kernel file system cache to diminish significantly. It appears to me,
and I could be wrong, that the reason Postgres has depended on the
kernel file system cache isn't that this is obviously better in some
absolute sense (though it might be depending on the deployment
scenario), but that the original cache replacement algorithm in Postgres
was sufficiently poor that the better cache replacement algorithms in
the kernel cache more than offset any sub-optimality that might result
from doing so.

The question of optimality of replacement algorithm is only one of the
arguments for using a small buffer cache. IMHO a considerably stronger
argument is that the kernel's memory management is more flexible: it can
use that memory for either disk cache or program workspace, and it can
change the allocation on-the-fly as load demands. If you dedicate most
of RAM to Postgres buffers then you are likely to be wasting RAM or
swapping heavily. Possibly both :-(

Another gotcha is that unless the OS allows you to lock shared memory
into RAM, the shared buffers themselves could get swapped out, which is
a no-win scenario by any measure. Keeping the shared buffer arena small
helps prevent that by ensuring all the buffers are "hot".

Of course, this is all speculation until we get some real-world
experience with ARC. But I don't expect it to be a magic bullet.

regards, tom lane

#20Greg Stark
gsstark@mit.edu
In reply to: Gavin Sherry (#14)
Re: Tablespaces

I am expecting to hear some bleating about this from people whose
preferred platforms don't support symlinks ;-). However, if we don't

Well, one option would be to have the low level filesystem storage (md.c?)
routines implement a kind of symlink themselves. Just a file with a special
magic number followed by a path.

I'm normally against reimplementing OS services but symlinks are really a very
simple concept and simple to implement. Especially if you can make a few
simplifying assumptions: they only ever need to appear as the final path
element not as parent directories and tablespaces don't need symlinks pointing
to symlinks. Ideally postgres also doesn't need to implement relative links
either.

--
greg

#21Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Greg Stark (#20)
Re: Tablespaces

Ahh. I forgot to detail my ideas on this. It seems to me that we cannot
drop a table space until the directory is empty.

Agreed.

How would it get to be empty? Are you thinking of some sort of "connect
database to tablespace" and "disconnect database from tablespace"
commands that would respectively create and delete the per-database
subdirectory? That seems moderately reasonable to me. We could then

I would only allow the drop if the directory only contains empty db oid
directories.

Andreas

#22Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Zeugswetter Andreas SB SD (#21)
Re: Tablespaces

I do not intend to undertake raw disk tablespaces for 7.5. I'd be
interested if anyone could provide some real world benchmarking of file
system vs. raw disk. Postgres benefits a lot from kernel file system cache
at the moment.

Yes, and don't forget that pg also relys on the OS for grouping and
sorting the physical writes and doing readahead where appropriate.

The use of raw disks is usually paired with the use of kernel aio.
The difference is said to be up to 30% on Solaris. I can assert, that
it made the difference between a bogged down system and a much better behaved
DB on Sun here.

My experience with kaio on AIX Informix is, that kaio is faster as long as IO
is not the bottleneck (disk 100% busy is the metric to watch, not Mb/s), while
for an IO bound system the Informix builtin IO threads that can be used instead
win. (Since they obviously do better at grouping, sorting and readahead
than the AIX kernel does for kaio)

Overall I think the price and komplexity is too high, especially since there are
enough platforms where the kernel does a pretty good job at grouping, sorting and
readahead. Additionally the kernel takes non PostgreSQL IO into account.

Andreas

#23scott.marlowe
scott.marlowe@ihs.com
In reply to: Zeugswetter Andreas SB SD (#21)
Re: Tablespaces

On Fri, 27 Feb 2004, Zeugswetter Andreas SB SD wrote:

Ahh. I forgot to detail my ideas on this. It seems to me that we cannot
drop a table space until the directory is empty.

Agreed.

How would it get to be empty? Are you thinking of some sort of "connect
database to tablespace" and "disconnect database from tablespace"
commands that would respectively create and delete the per-database
subdirectory? That seems moderately reasonable to me. We could then

I would only allow the drop if the directory only contains empty db oid
directories.

Wouldn't this be better tracked in the dependency tracking that's already
built into postgresql? Checking to see if the directory is empty is open
to race conditions, but locking the dependency tracking while dropping a
tablespace isn't.

#24scott.marlowe
scott.marlowe@ihs.com
In reply to: Gavin Sherry (#13)
Re: Tablespaces

On Fri, 27 Feb 2004, Gavin Sherry wrote:

On Thu, 26 Feb 2004, Alex J. Avriette wrote:

On Thu, Feb 26, 2004 at 11:22:28PM +1100, Gavin Sherry wrote:

Certainly, table spaces are used in many ways in oracle, db2, etc. You can
mirror data across them, have different buffer sizes for example.
In some implementations, they can be raw disk partitions (no file system).
I don't intend going this far, however.

Perhaps now would be a good time to bring up my directio on Solaris question
from a year or so back? Is there any interest in the ability to use raw
disk?

I do not intend to undertake raw disk tablespaces for 7.5. I'd be
interested if anyone could provide some real world benchmarking of file
system vs. raw disk. Postgres benefits a lot from kernel file system cache
at the moment. Also, I believe that database designers have traditionally
made bad file system designers. Raw database partitions often lack the
tools essential to a scalable environment. For example, the ability to
resize partitions.

Is possible / reasonable / smart and or dumb to look at implementing the
tablespaces as riding atop the initlocation handled stuff. I.e.
postgresql can only create tablespaces in areas that are created by
initlocation, thus keeping it in its box, so to speak?

#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: scott.marlowe (#23)
Re: Tablespaces

"scott.marlowe" <scott.marlowe@ihs.com> writes:

On Fri, 27 Feb 2004, Zeugswetter Andreas SB SD wrote:

How would it get to be empty? Are you thinking of some sort of "connect
database to tablespace" and "disconnect database from tablespace"
commands that would respectively create and delete the per-database
subdirectory? That seems moderately reasonable to me. We could then

I would only allow the drop if the directory only contains empty db oid
directories.

That's subject to race conditions (ie, someone creating a table about
the same time you are deciding it's okay to drop the tablespace). There
needs to be some interlock, and I think that associating that lock with
infrequently executed connect/disconnect operations would be good from a
performance standpoint.

Wouldn't this be better tracked in the dependency tracking that's already
built into postgresql?

No, because dependencies are local to individual databases.

regards, tom lane

#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: scott.marlowe (#24)
Re: Tablespaces

"scott.marlowe" <scott.marlowe@ihs.com> writes:

Is possible / reasonable / smart and or dumb to look at implementing the
tablespaces as riding atop the initlocation handled stuff.

In my mind, one of the main benefits of this work will be that we'll be
able to get *rid* of the initlocation stuff. It's a crock.

regards, tom lane

#27scott.marlowe
scott.marlowe@ihs.com
In reply to: Tom Lane (#26)
Re: Tablespaces

On Fri, 27 Feb 2004, Tom Lane wrote:

"scott.marlowe" <scott.marlowe@ihs.com> writes:

Is possible / reasonable / smart and or dumb to look at implementing the
tablespaces as riding atop the initlocation handled stuff.

In my mind, one of the main benefits of this work will be that we'll be
able to get *rid* of the initlocation stuff. It's a crock.

OK, that's fine, but I keep thinking that a superuser should have to
create the tablespace itself, and then tables can be assigned by users
based on the rights assigned by the dba / superuser. Is that how we're
looking at doing it, or will any user be able to create a tablespace
anywhere postgresql has write permission, or will only dbas be able to
create AND use table spaces. I'm just not sure how that's gonna be
handled, and haven't seen it addressed.

#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: scott.marlowe (#27)
Re: Tablespaces

"scott.marlowe" <scott.marlowe@ihs.com> writes:

On Fri, 27 Feb 2004, Tom Lane wrote:

In my mind, one of the main benefits of this work will be that we'll be
able to get *rid* of the initlocation stuff. It's a crock.

OK, that's fine, but I keep thinking that a superuser should have to
create the tablespace itself, and then tables can be assigned by users
based on the rights assigned by the dba / superuser.

Yeah, we haven't yet gotten to the issue of permissions, but certainly
creating or deleting a tablespace has to be a superuser-only operation,
if only because you probably have also got some manual filesystem work
to do to set up the associated directory; and that has to be done as
root or postgres.

It might be a good idea to restrict connect/disconnect (if we use those
operations) to superusers as well.

regards, tom lane

#29Noname
tswan@idigx.com
In reply to: Zeugswetter Andreas SB SD (#22)
Re: Tablespaces

I do not intend to undertake raw disk tablespaces for 7.5. I'd be
interested if anyone could provide some real world benchmarking of file
system vs. raw disk. Postgres benefits a lot from kernel file system
cache
at the moment.

Yes, and don't forget that pg also relys on the OS for grouping and
sorting the physical writes and doing readahead where appropriate.

Most people I know want tablespaces in order to limit or preallocate the
disk space used by a table or database in addition to controlling the
physical location of a table or database.

I know on linux, there is the option of creating an empty file or a
specific size using dd, mounting it through loopback, formatting it,
symlinking the appropriate OID/TID (or mounting the lpb device in the
appropriate directory) and then you control how much space that
directory/mount point can contain.

Of course, with MVCC you would have to vacuum frequently, as you could
miss some updates if there weren't enough tuples marked as free. If there
were "in-place" updates, the preallocation and limitation much easier, but
that's not how PG works.

If the tablespace disk space allocation is exceeded there would need to be
some graceful reporting condition back to the client. "UPDATE/INSERT
failed (tablespace size exceeded)", "(tablespace full)", "(disk full)" or
some other error may need to be handled/reported.

#30Gavin Sherry
swm@linuxworld.com.au
In reply to: Noname (#29)
Re: Tablespaces

On Fri, 27 Feb 2004 tswan@idigx.com wrote:

I do not intend to undertake raw disk tablespaces for 7.5. I'd be
interested if anyone could provide some real world benchmarking of file
system vs. raw disk. Postgres benefits a lot from kernel file system
cache
at the moment.

Yes, and don't forget that pg also relys on the OS for grouping and
sorting the physical writes and doing readahead where appropriate.

Most people I know want tablespaces in order to limit or preallocate the
disk space used by a table or database in addition to controlling the
physical location of a table or database.

I know on linux, there is the option of creating an empty file or a
specific size using dd, mounting it through loopback, formatting it,
symlinking the appropriate OID/TID (or mounting the lpb device in the
appropriate directory) and then you control how much space that
directory/mount point can contain.

Of course, with MVCC you would have to vacuum frequently, as you could
miss some updates if there weren't enough tuples marked as free. If there
were "in-place" updates, the preallocation and limitation much easier, but
that's not how PG works.

I do not intend to work on such a system for the initial introduction of
table spaces. The problem is, of course, knowing when you're actually out
of space in a table space in any given transaction. Given that WAL is on a
different partition (at least for the moment) the table space will not
have transaction X's data written to it until after transaction X is
finished. And we cannot error out a transaction which is already commited.

The solution is to keep track of free space and error out at some
percentage of free space remaining. But I don't want to complicate
tablespaces too much in 7.5.

Thanks,

Gavin

#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Gavin Sherry (#30)
Re: Tablespaces

Gavin Sherry <swm@linuxworld.com.au> writes:

I do not intend to work on such a system for the initial introduction of
table spaces. The problem is, of course, knowing when you're actually out
of space in a table space in any given transaction.

It should not be that hard, at least not on local filesystems. When PG
realizes that a new page must be added to a table, it does a write() to
append a page of zeroes to the physical table. This happens
immediately. It's true that actual data may not be written into that
section of the file till long after commit, but the kernel should do
space allocation checking upon the first write.

I have heard tell that this may not happen when you are dealing with NFS
(yet another reason not to run databases across NFS) but on all local
filesystems I know of, out-of-space should result in a failure before
transaction commit.

I say "should" because I suspect this isn't a very heavily tested code
path in Postgres. But in theory it should work. Feel free to submit
bug reports if you find it doesn't.

regards, tom lane

#32Josh Berkus
josh@agliodbs.com
In reply to: Tom Lane (#31)
Re: Tablespaces

Gavin,

#1:  I really think that we should have a way to set a "default tablespace"
for any database in a cluster.    This property would be vitally important
for anyone wishing to use tablespaces to impose quotas.   First, the
superuser would:
ALTER DATABASE db1 ALTER DEFAULT_TABLESPACE partition2;
then any regular users creating tables in that database would, by default,
have TABLESPACE partition2 automatically appended to them by the parser
unless overridden in the creation statement by specifying another, specific,
tablespace.

Alternately, the default tablespace could be set through a GUC.   In my mind,
this would be inferior on 2 counts:
1) It would require adding Yet Another Miscellaneos GUC Variable.
2) It would preclude large, multisuer installations from seamlessly using
tablespaces for quotas, becuase there would be no way to transparently set
the GUC differently for each user or database.

#2: Permissions:
I see the permissions issue as quite transparent.   First, I agree that only
the superuser should have the right to create, alter, or drop tablespaces.
'nuff said.
Second, as far as I can see, there is only one relevant permission for regular
users:  USE.   Either the user is permitted to create objects in that
tablespace, or he/she is not.  Other permissions, such as read access, should
NOT be set by tablespace, as such permissions are already governed by
database, table, and schema; to add a SELECT restriction to tablespaces would
frequently result in paralytic snarls of conflicting permissions on complex
installations.
Thus, by my proposal, the only GRANT for tablespaces (executed by a superuser)
would be:
GRANT USE ON tablespace1 TO user;
This permission would ONLY be accessed for CREATE/ALTER TABLE, and CREATE
INDEX statements.
Easy, neh?

#3: ALTER TABLE .... CHANGE TABLESPACE:
This is strictly in the class of "would be a very nice & useful feature if
it's not too difficult".  

Given how painful it is to drop & replace a table with multiple dependencies
(on some databases, only possible by droping & re-loading the entire
database) it would be nice to have an ALTER TABLE command that moved the
table to another tablespace.    It doesn't *seem* to me that this would be a
very challenging bit of programming, as the operation would be very similar
to REINDEX in the manipulation of files.   (But what I know, really?)

Once tablespaces are a feature and some users start using them for quota
management, there will quickly develop situations where the original
tablespace for a db runs out of room and can't be resized.   Being able to
move the table "in situ" then becomes vital, especially on very large
databases ... and when someday combined with partitioned tables, will become
essential.

Further, we will get an *immediate* flurry of requests from users who just
upgraded to 7.5 and want to make use of the tablespaces feature on an
existing production database.

ALTER INDEX ... CHANGE TABLESPACE is *not* needed, though, as there are no
issues other than time which I know of with dropping & re-creating an index.

If ALTER TABLE CHANGE TABLESPACE has some major technical hurdles, then I
think it's one of those things that could be put off until the next version
of tablespaces, or even held until Partition Tables is developed for a
combined solution.    But it would be nice to have.

--
Josh Berkus
Aglio Database Solutions
San Francisco

#33Andrew Sullivan
ajs@crankycanuck.ca
In reply to: Alex J. Avriette (#18)
Re: Tablespaces

On Thu, Feb 26, 2004 at 05:28:41PM -0500, Alex J. Avriette wrote:

The only reason I mentioned it to begin with was the recommendation of
directio for databases in the Sun Blueprint, _Tuning Databases on the
Solaris Platform_ (and being a Solaris geek, I asked, but apparently
nobody else is worried enough about performance or not using Solaris
enough to care).

That recommendation itself is a few years old. While it may still be
true that directio is still fastest for Oracle on Solaris, I'd sure
like to see some recent evidence. I've a funny feeling that this is
an old rule of thumb which is now true in the sense that everyone
believes it, but maybe not in the sense that a test would reveal it
to be a sensible rule.

like to see some of the features of Oracle and DB2 available to users
of postgres. Some of these features are raw disk, tablespaces, and
replication. We're getting there, and making really terrific progress

I don't think we want features for their own sake, though, and I'm
not convinced that raw filesystems are actually useful. Course, it's
not my itch, and PostgreSQL _is_ free software.

A

--
Andrew Sullivan

#34Simon Riggs
simon@2ndquadrant.com
In reply to: Gavin Sherry (#30)
Re: Tablespaces

Gavin Sherry
On Fri, 27 Feb 2004 tswan@idigx.com wrote:

Most people I know want tablespaces in order to limit or preallocate

the

disk space used by a table or database in addition to controlling

the

physical location of a table or database.

I do not intend to work on such a system for the initial introduction

of

table spaces. The problem is, of course, knowing when you're actually

out

of space in a table space in any given transaction. Given that WAL is

on a

different partition (at least for the moment) the table space will not
have transaction X's data written to it until after transaction X is
finished. And we cannot error out a transaction which is already

commited.

The solution is to keep track of free space and error out at some
percentage of free space remaining. But I don't want to complicate
tablespaces too much in 7.5.

You're absolutely right about the not-knowing when you're out of space
issue. However, if the xlog has been written then it is not desirable,
but at least acceptable that the checkpoint/bgwriter cannot complete on
an already committed txn. It's not the txn which is getting the error,
that's all.

Hmmm...I'm not sure that we'll be able or should avoid the out of space
situation completely. The question is...what will we do when we hit it?
It doesn't matter whether you stop at 100% or 90% or whatever, you still
have to stop and then what? Stay up as long as possible hopefully: If
there wasn't enough space to write to the tablespace, going into
recovery won't help the situation either; youre still out of space until
you fix that. We now have the option not to crash, since it might be
perfectly viable to keep on chugging away on one Tablespace even though
all txn work on the out-of-space tablespace is frozen/barred etc. Sounds
like a refinement, but something to keep in mind at the design stage if
we can.

The problem is that tablespaces do complicate space management (that's
what people want though, so that's OK). That complicates admin and so pg
will hit many more out of space errors than we've seen previously.
Trying to work out how to spot these ahead of time, accept user defined
limits on each tablespace etc sounds like extra complexity for the
initial drop. I guess my own suggested approach is to start by handling
the error cases, then go back and try to avoid some of them.

All of this exposes for me the complication that doing PITR and
tablespaces at the same time is likely to be more complex for us both
than either had envisaged. The reduced complexity for PITR was what I
was shooting for, also! I'm happy to work together on any issues that
arise.

For PITR, I think we would need:
- a very accessible list of tablespace locations, so taking a full
physical database backup can be easily accomplished using OS utilities.
Hopefully a list maintained external to the database? We have the
equivalent now with env variables.
- decisions about what occurs when for-whatever-reason one or more
tablespaces are not recoverable from backup?
- it might be desirable to allow recovery with less than all of the
original tablespces
- it might also be desirable to allow recovery when the tablespaces txn
Ids don't match (though that is forbidden on many other dbms)

Best Regards, Simon Riggs

#35Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#34)
Re: Tablespaces

"Simon Riggs" <simon@2ndquadrant.com> writes:

Gavin Sherry wrote:

I do not intend to work on such a system for the initial introduction of
table spaces. The problem is, of course, knowing when you're actually out
of space in a table space in any given transaction. Given that WAL is on a
different partition (at least for the moment) the table space will not
have transaction X's data written to it until after transaction X is
finished. And we cannot error out a transaction which is already
commited.

As long as the kernel doesn't lie about file extension, we will not
commit any transaction that requires a disallowed increase in the
allocated size of data files, because allocation of another table page
is checked with the kernel during the transaction. So on most
filesystems (maybe not NFS) the problem Gavin is worried about doesn't
exist.

You're absolutely right about the not-knowing when you're out of space
issue. However, if the xlog has been written then it is not desirable,
but at least acceptable that the checkpoint/bgwriter cannot complete on
an already committed txn. It's not the txn which is getting the error,
that's all.

Right. This is in fact not a fatal situation, as long as you don't run
out of preallocated WAL space. For a recent practical example of our
behavior under zero-free-space conditions, see this thread:
http://archives.postgresql.org/pgsql-hackers/2004-01/msg00530.php
particularly the post-mortem here:
http://archives.postgresql.org/pgsql-hackers/2004-01/msg00606.php
Barring one small bug, the database would likely have stayed up, and
continued to service at least the read-only transactions, until Chris
got around to freeing some disk space.

I think it is sufficient (at least in the near term) to expect people to
use partition size limits if they want to control database size --- that
is, make a partition of the desired size and put the database directory
in there. Tablespaces as per the design we are discussing would make it
easier to apply such a policy to a sub-area of a database cluster than
it is today, but they needn't in themselves implement the restriction.

regards, tom lane

#36Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#35)
Re: Out of space situation and WAL log pre-allocation (was Tablespaces)

Tom Lane [mailto:tgl@sss.pgh.pa.us]

"Simon Riggs" <simon@2ndquadrant.com> writes:

You're absolutely right about the not-knowing when you're out of

space

issue. However, if the xlog has been written then it is not

desirable,

but at least acceptable that the checkpoint/bgwriter cannot complete

on

an already committed txn. It's not the txn which is getting the

error,

that's all.

Right. This is in fact not a fatal situation, as long as you don't

run

out of preallocated WAL space.

...following on also from thoughts on [PERFORM] list...

Clearly running out of pre-allocated WAL space is likely to be the next
issue. Running out of space in the first place is likely to be because
of an intense workload, which is exactly the thing which also makes you
run out of pre-allocated WAL space. Does that make sense?

Best regards, Simon Riggs

#37Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: Gavin Sherry (#1)
Re: Tablespaces

A table space parameter will be added to DDL commands which create
physical database objects (CREATE DATABASE/INDEX/TABLE/SEQUENCE) and to
CREATE SCHEMA. The associated routines, as well as the corresponding DROP
commands will need to be updated. Adding the ability to ALTER <object>
TABLESPACE <name> seems a little painful. Would people use it? Comments?

How about allowing the specification on schemas and databases of
different default tablespaces for TEMP, TABLE and INDEX?? Is there any
point to that?

Chris

#38Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: Gavin Sherry (#1)
Re: Tablespaces

I've been looking at implementing table spaces for 7.5. Some notes and
implementation details follow.

Ah sorry, other things you might need to consider:

Privileges on tablespaces:

GRANT USAGE ON TABLESPACE tbsp TO ...;

Different disk settings for different tablespaces (since they will
likely be on different disks):

ALTER TABLESPACE tbsp SET random_page_cost TO 2.5;

Chris

#39Tom Lane
tgl@sss.pgh.pa.us
In reply to: Christopher Kings-Lynne (#37)
Re: Tablespaces

Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:

How about allowing the specification on schemas and databases of
different default tablespaces for TEMP, TABLE and INDEX?? Is there any
point to that?

TEMP tables are not local to any particular schema, so it wouldn't make
sense to have a schema-level default for their placement.

The other five combinations are at least theoretically sensible, but
do we need 'em all? It seems to me that a reasonable compromise is to
offer database-level default tablespaces for TEMP, TABLE, and INDEX,
ignoring the schema level. This is simple and understandable, and if
you don't like it, you're probably the kind of guy who will want to
override it per-table anyway ...

BTW, another dimension to think about is where TOAST tables and their
indexes will get placed.

regards, tom lane

#40Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#36)
Re: Out of space situation and WAL log pre-allocation (was Tablespaces)

"Simon Riggs" <simon@2ndquadrant.com> writes:

Tom Lane wrote:

Right. This is in fact not a fatal situation, as long as you don't
run out of preallocated WAL space.

Clearly running out of pre-allocated WAL space is likely to be the next
issue. Running out of space in the first place is likely to be because
of an intense workload, which is exactly the thing which also makes you
run out of pre-allocated WAL space. Does that make sense?

I think one of the first things people would do with tablespaces is
stick the data files onto a separate partition from the WAL and clog
files. (Actually you can do this today with a simple symlink hack, but
tablespaces will make it easier and clearer.) The space usage for WAL
is really pretty predictable, because of the checkpoint-at-least-
every-N-segments setting. clog is not exactly a space hog either.
Once you have that separation established, out-of-disk-space can kill
individual transactions but never the database as a whole.

One of the things that bothers me about the present PITR design is that
it presumes that individual WAL log segments can be kept until the
external archiver process feels like writing them somewhere. If there's
no guarantee that that happens within X amount of time, then you can't
bound the amount of space needed on the WAL drive, and so you are back
facing the possibility of an out-of-WAL-space panic. I suspect that we
cannot really do anything about that, but it's annoying. Any bright
ideas out there?

regards, tom lane

#41Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Gavin Sherry (#1)
Re: Tablespaces

Gavin Sherry wrote:

The actual creation of the table space will be done with:

CREATE TABLE SPACE <name> LOCATION </path/to/tblspc>;

Seems you should use CREATE TABLESPACE (no space) so it is more distinct
from CREATE TABLE.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#42Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Richard Huxton (#3)
Re: Tablespaces

Richard Huxton wrote:

On Thursday 26 February 2004 10:07, Gavin Sherry wrote:

CREATE TABLESPACE tbl1 LOCATION '/var/'

which will result in something like '/var/123443' is a bad idea. Then
again, the user should know better. Comments?

The LOCATION should have the same owner and permissions as $PGDATA - that
should catch mistyping.

Unless you're running as root, of course. In which case you clearly know
better than everyone else, so off you go!

FYI, you can't run the postmaster as root.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#43Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Joshua D. Drake (#8)
Re: Tablespaces

Joshua D. Drake wrote:

Is it possible to put WALs and CLOGs into different tablespaces? (maybe
different RAID systems). Some companies want that ...

You can do this now, but it would be nice to be able to have it actually
configurable versus the hacked linked method.

Agreed, but because the system has to be down to move pg_xlog, I think
we should write a command-line utility to assist with this, perhaps. It
could check permissions and stuff.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#44Gavin Sherry
swm@linuxworld.com.au
In reply to: Bruce Momjian (#41)
Re: Tablespaces

On Tue, 2 Mar 2004, Bruce Momjian wrote:

Gavin Sherry wrote:

The actual creation of the table space will be done with:

CREATE TABLE SPACE <name> LOCATION </path/to/tblspc>;

Seems you should use CREATE TABLESPACE (no space) so it is more distinct
from CREATE TABLE.

Oops. Typo.

Show quoted text
--
Bruce Momjian                        |  http://candle.pha.pa.us
pgman@candle.pha.pa.us               |  (610) 359-1001
+  If your life is a hard drive,     |  13 Roberts Road
+  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

!DSPAM:40455de0297537578347468!

#45Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#11)
Re: Tablespaces

Tom Lane wrote:

Gavin Sherry <swm@linuxworld.com.au> writes:

A table space is a directory structure. The directory structure is as
follows:
[swm@dev /path/to/tblspc]$ ls
OID1/ OID2/
OID1 and OID2 are the OIDs of databases which have created a table space
against this file system location. In this respect, a table space
resembles $PGDATA/base. I thought it useful to keep this kind of
namespace mechanism in place ...

Actually, this is *necessary* AFAICT. The case that forces it is DROP
DATABASE. Since you have to execute that from another database, there's
no reasonable way to look into the target database's catalogs. That
means that the OID of the database has to be sufficient information to
get rid of all its files. You can do this fairly easily if in each
tablespace (whose locations you know from the shared pg_tablespace
table) you can look for a subdirectory matching the target database's
OID. If we tried to put the database's files just "loose" in each
tablespace directory then we'd be in trouble.

Gavin, let us know if you want us to create the global pg_tablespace for
you. Some of us have done a lot of system catalog work.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#46Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#17)
Re: Tablespaces

Tom Lane wrote:

Gavin Sherry <swm@linuxworld.com.au> writes:

Speaking of locking, can we do anything to prevent people from shooting
themselves in the foot by changing active tablespaces? Are we even
going to have a DROP TABLESPACE command, and if so what would it do?

Ahh. I forgot to detail my ideas on this. It seems to me that we cannot
drop a table space until the directory is empty.

How would it get to be empty? Are you thinking of some sort of "connect
database to tablespace" and "disconnect database from tablespace"
commands that would respectively create and delete the per-database
subdirectory? That seems moderately reasonable to me. We could then
invent a locking protocol that requires backends to lock a tablespace
before they can execute either of these operations (or delete the
tablespace of course).

One crude solution would be to remove the tablespace oid directory only
when the database is dropped, and require an empty tablespace directory
to drop the tablespace. This allows a lock only on tablespace creation,
and not a lock on object creation in each tablespace.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#47Joe Conway
mail@joeconway.com
In reply to: Tom Lane (#40)
Re: Out of space situation and WAL log pre-allocation (was

Tom Lane wrote:

One of the things that bothers me about the present PITR design is that
it presumes that individual WAL log segments can be kept until the
external archiver process feels like writing them somewhere. If there's
no guarantee that that happens within X amount of time, then you can't
bound the amount of space needed on the WAL drive, and so you are back
facing the possibility of an out-of-WAL-space panic. I suspect that we
cannot really do anything about that, but it's annoying. Any bright
ideas out there?

Maybe specify an archive location (that of course could be on a separate
partition) that the external archiver should check in addition to the
normal WAL location. At some predetermined interval, push WAL log
segments no longer needed to the archive location.

Joe

#48Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Greg Stark (#20)
Re: [HACKERS] Tablespaces

Greg Stark wrote:

I am expecting to hear some bleating about this from people whose
preferred platforms don't support symlinks ;-). However, if we don't

Well, one option would be to have the low level filesystem storage (md.c?)
routines implement a kind of symlink themselves. Just a file with a special
magic number followed by a path.

I'm normally against reimplementing OS services but symlinks are really a very
simple concept and simple to implement. Especially if you can make a few
simplifying assumptions: they only ever need to appear as the final path
element not as parent directories and tablespaces don't need symlinks pointing
to symlinks. Ideally postgres also doesn't need to implement relative links
either.

I just checked from the MinGW console and I see:

# touch a
# ln -s a b
# echo test >a
# cat b
# l ?
-rw-r--r-- 1 Bruce Mo Administ 5 Mar 2 23:30 a
-rw-r--r-- 1 Bruce Mo Administ 0 Mar 2 23:30 b
# cat a
test
# cat b
#

It accepts ln -s, but does nothing with it.

For tablespaces on OS's that don't support it, I think we will have to
store the path name in the file and read it via the backend. Somehow we
should cache those lookups.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#49Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#26)
Re: Tablespaces

Tom Lane wrote:

"scott.marlowe" <scott.marlowe@ihs.com> writes:

Is possible / reasonable / smart and or dumb to look at implementing the
tablespaces as riding atop the initlocation handled stuff.

In my mind, one of the main benefits of this work will be that we'll be
able to get *rid* of the initlocation stuff. It's a crock.

Agreed. It should be ripped out once we have tablespaces, and if we keep
it for one extra release, there will be confusion over which to use.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#50Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Josh Berkus (#32)
Re: Tablespaces

Josh Berkus wrote:

#3: ALTER TABLE .... CHANGE TABLESPACE:
This is strictly in the class of "would be a very nice & useful feature if
it's not too difficult". ?

Given how painful it is to drop & replace a table with multiple dependencies
(on some databases, only possible by droping & re-loading the entire
database) it would be nice to have an ALTER TABLE command that moved the
table to another tablespace. ? ?It doesn't *seem* to me that this would be a
very challenging bit of programming, as the operation would be very similar
to REINDEX in the manipulation of files. ? (But what I know, really?)

Once tablespaces are a feature and some users start using them for quota
management, there will quickly develop situations where the original
tablespace for a db runs out of room and can't be resized. ? Being able to
move the table "in situ" then becomes vital, especially on very large
databases ... and when someday combined with partitioned tables, will become
essential.

Further, we will get an *immediate* flurry of requests from users who just
upgraded to 7.5 and want to make use of the tablespaces feature on an
existing production database.

If we don't implement moving tables between tablespaces, we should add a
stub for it in the grammer and mention it is not implemented yet,
because if we don't, we will get tons of questions.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#51Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Gavin Sherry (#14)
Re: Tablespaces

Gavin Sherry wrote:

Actually, I think that's a pretty good idea :-). I'd solves a bunch of
issues in the backend (postmaster start up can recurse through
$PGDATA/tablespaces looking for postmaster.pid files) and will also assist
admins with complex configurations (perhaps).

Why are you asking about postmaster.pid files. That file goes in the
top level /data directory, no?

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#52Gavin Sherry
swm@linuxworld.com.au
In reply to: Bruce Momjian (#51)
Re: Tablespaces

On Tue, 2 Mar 2004, Bruce Momjian wrote:

Gavin Sherry wrote:

Actually, I think that's a pretty good idea :-). I'd solves a bunch of
issues in the backend (postmaster start up can recurse through
$PGDATA/tablespaces looking for postmaster.pid files) and will also assist
admins with complex configurations (perhaps).

Why are you asking about postmaster.pid files. That file goes in the
top level /data directory, no?

I was trying to be paranoid about users who have multiple postmasters on
the same machine and want to share a table space while both systems are
live. There'd be no mechanism to test for that situation if we didn't have
something like a postmaster.pid file. Is this being a little too paranoid?

Gavin

#53Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joe Conway (#47)
Re: Out of space situation and WAL log pre-allocation (was Tablespaces)

Joe Conway <mail@joeconway.com> writes:

Tom Lane wrote:

facing the possibility of an out-of-WAL-space panic. I suspect that we
cannot really do anything about that, but it's annoying. Any bright
ideas out there?

Maybe specify an archive location (that of course could be on a separate
partition) that the external archiver should check in addition to the
normal WAL location. At some predetermined interval, push WAL log
segments no longer needed to the archive location.

Does that really help? The panic happens when you fill the "normal" and
"archive" partitions, how's that different from one partition?

regards, tom lane

#54Tom Lane
tgl@sss.pgh.pa.us
In reply to: Gavin Sherry (#52)
Re: Tablespaces

Gavin Sherry <swm@linuxworld.com.au> writes:

I was trying to be paranoid about users who have multiple postmasters on
the same machine and want to share a table space while both systems are
live. There'd be no mechanism to test for that situation if we didn't have
something like a postmaster.pid file. Is this being a little too paranoid?

Hm. AFAICS there is no safe situation in which a tablespace directory
could be shared by two different installations (== toplevel $PGDATA
directories). I don't think we need a dynamic postmaster.pid-type lock
to protect them. What might make sense is some sort of marker file in a
tablespace directory that links back to the owning $PGDATA directory.
CREATE TABLESPACE should create this, or reject if it already exists.

regards, tom lane

#55Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Gavin Sherry (#52)
Re: Tablespaces

Gavin Sherry wrote:

On Tue, 2 Mar 2004, Bruce Momjian wrote:

Gavin Sherry wrote:

Actually, I think that's a pretty good idea :-). I'd solves a bunch of
issues in the backend (postmaster start up can recurse through
$PGDATA/tablespaces looking for postmaster.pid files) and will also assist
admins with complex configurations (perhaps).

Why are you asking about postmaster.pid files. That file goes in the
top level /data directory, no?

I was trying to be paranoid about users who have multiple postmasters on
the same machine and want to share a table space while both systems are
live. There'd be no mechanism to test for that situation if we didn't have
something like a postmaster.pid file. Is this being a little too paranoid?

Oh, yikes, I see. Right now we have the interlock on the /data
directory, but once you start moving stuff out from under /data using
tablespaces, we do perhaps loose the interlock. However, I assume the
CREATE TABLESPACE is going to create the tablespace directory, so I
don't see how two postmasters could both create the directory.

For example, if you say

CREATE TABLESPACE tb IN '/var/tb1'

I assume you have to create:

/var/tb1/pgsql_tablespace

and then

/var/tb1/pgsql_tablespace/oid1
/var/tb1/pgsql_tablespace/oid2

or something like that, and set the proper permissions on
pgsql_tablespace. We will have write permission on the directory they
pass to us, but we might not have permissions to change the mode of the
directory they pass, so we have to create a subdirectory anyway, and
that is our interlock.

For example:

# run as root
$ chmod a+w .
$ ls -ld .
drwxrwxrwx 2 root wheel 512 Mar 2 23:51 .

# run as the postmaster
$ mkdir new
$ ls -ld new
drwxr-xr-x 2 postgres wheel 512 Mar 2 23:52 new
$ chmod 700 new
$ chmod 700 .
chmod: .: Operation not permitted
chmod: .: Operation not permitted

As you can see, I have permission to create the /new directory, but no
ability to set its mode, so we have to create a directory that matches
the permissions of /data:

drwx------ 6 postgres postgres 512 Mar 2 12:48 /u/pg/data/

We could require the admin to create a directory that we own instead of
just one that we have write permission in, but why bother when we can
use the new directory as an interlock from multiple postmasters anyway.

Right now we do require the directory used as /data be one where we can
create a /data subdirectory, so this seems similar. We don't put the
data directly in the passed directory, but in /data under that. In
fact, we could just call it /var/tb1/data instead of
/var/tb1/pgsql_tablespace.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#56Joe Conway
mail@joeconway.com
In reply to: Tom Lane (#53)
Re: Out of space situation and WAL log pre-allocation (was

Tom Lane wrote:

Joe Conway <mail@joeconway.com> writes:

Maybe specify an archive location (that of course could be on a separate
partition) that the external archiver should check in addition to the
normal WAL location. At some predetermined interval, push WAL log
segments no longer needed to the archive location.

Does that really help? The panic happens when you fill the "normal" and
"archive" partitions, how's that different from one partition?

I see your point. But it would allow you to use a relatively modest
local partition for WAL segments, while you might be using a 1TB netapp
tray over NFS for the archive segments. I guess if the archive partition
fills up, I would err on the side of dropping archive segments on the
floor. That would mean a new full backup would be needed, but at least
it wouldn't result in a corrupt, or shut down, database.

Joe

#57Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#48)
Re: [HACKERS] Tablespaces

Bruce Momjian <pgman@candle.pha.pa.us> writes:

For tablespaces on OS's that don't support it, I think we will have to
store the path name in the file and read it via the backend. Somehow we
should cache those lookups.

My feeling is that we need not support tablespaces on OS's without
symlinks.

regards, tom lane

#58Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#57)
Re: [HACKERS] Tablespaces

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

For tablespaces on OS's that don't support it, I think we will have to
store the path name in the file and read it via the backend. Somehow we
should cache those lookups.

My feeling is that we need not support tablespaces on OS's without
symlinks.

Agreed, but are we going to support non-tablespace installs? I wasn't
sure that was an option.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#59Gavin Sherry
swm@linuxworld.com.au
In reply to: Tom Lane (#57)
Re: [HACKERS] Tablespaces

On Wed, 3 Mar 2004, Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

For tablespaces on OS's that don't support it, I think we will have to
store the path name in the file and read it via the backend. Somehow we
should cache those lookups.

My feeling is that we need not support tablespaces on OS's without
symlinks.

I'm going to focus on implementing this on the system(s) I'm used to
developing on (ie, those which support symlinks). Once that is done, I'll
talk with the Win32 guys about what, if anything, we can do about getting
this to work on Win32 (and possibly other non-symlink supporting OSs).

Thanks,

Gavin

#60Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joe Conway (#56)
Re: Out of space situation and WAL log pre-allocation (was Tablespaces)

Joe Conway <mail@joeconway.com> writes:

Tom Lane wrote:

Joe Conway <mail@joeconway.com> writes:

Maybe specify an archive location (that of course could be on a separate
partition) that the external archiver should check in addition to the
normal WAL location. At some predetermined interval, push WAL log
segments no longer needed to the archive location.

Does that really help? The panic happens when you fill the "normal" and
"archive" partitions, how's that different from one partition?

I see your point. But it would allow you to use a relatively modest
local partition for WAL segments, while you might be using a 1TB netapp
tray over NFS for the archive segments.

Fair enough, but it seems to me that that sort of setup really falls in
the category of a user-defined archiving process --- that is, the hook
that Postgres calls will push WAL segments from the local partition to
the NFS server, and then pushing them off NFS to tape is the
responsibility of some other user-defined subprocess. Database panic
happens if and only if the local partition overflows. I don't see that
making Postgres explicitly aware of the secondary NFS arrangement will
buy anything.

I guess if the archive partition fills up, I would err on the side of
dropping archive segments on the floor.

That should be user-scriptable policy, in my worldview.

We haven't yet talked much about what the WAL-segment-archiving API
should look like, but if it cannot support implementing the above kind
of arrangement outside the database, then we've dropped the ball.
IMHO anyway.

regards, tom lane

#61Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#58)
Re: [HACKERS] Tablespaces

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Tom Lane wrote:

My feeling is that we need not support tablespaces on OS's without
symlinks.

Agreed, but are we going to support non-tablespace installs? I wasn't
sure that was an option.

A setup containing only the default tablespace cannot use any symlinks.
That doesn't seem hard though.

regards, tom lane

#62Claudio Natoli
claudio.natoli@memetrics.com
In reply to: Tom Lane (#61)
Re: [HACKERS] Tablespaces

Gavin Sherry wrote:

I'm going to focus on implementing this on the system(s) I'm used to
developing on (ie, those which support symlinks). Once that is done, I'll
talk with the Win32 guys about what, if anything, we can do about getting
this to work on Win32 (and possibly other non-symlink supporting OSs).

If the design can allow the default tablespace to not require symlinks, then
I think we're in good shape, as we can simply choose to introduce
tablespaces under Win32 whenever it suits.

Cheers,
Claudio

--- 
Certain disclaimers and policies apply to all email sent from Memetrics.
For the full text of these disclaimers and policies see 
<a
href="http://www.memetrics.com/emailpolicy.html">http://www.memetrics.com/em
ailpolicy.html</a>
#63Claudio Natoli
claudio.natoli@memetrics.com
In reply to: Claudio Natoli (#62)
Re: [HACKERS] Tablespaces

Bruce Momjian writes:

I just checked from the MinGW console and I see:
[snip]
It accepts ln -s, but does nothing with it.

And even if it had worked, it wouldn't really matter, since we don't
actually want to *run* the system under MinGW/msys, just build it.

I think the idea of implementing in symlinks for non-compliant platforms in
md.c has some merit. FWIW, looks like that is how cygwin implements
symlinks...

Cheers,
Claudio

--- 
Certain disclaimers and policies apply to all email sent from Memetrics.
For the full text of these disclaimers and policies see 
<a
href="http://www.memetrics.com/emailpolicy.html">http://www.memetrics.com/em
ailpolicy.html</a>
#64Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Claudio Natoli (#63)
Re: [HACKERS] Tablespaces

For tablespaces on OS's that don't support it, I think we will have to
store the path name in the file and read it via the backend. Somehow we
should cache those lookups.

My feeling is that we need not support tablespaces on OS's without
symlinks.

To create symlinked directories on Win2k NTFS see:
http://www.sysinternals.com/ntw2k/source/misc.shtml#junction

I think Win2000 or XP would be a reasonable restriction for Win32 PG
installations that want tablespaces.

Andreas

#65Oliver Elphick
olly@lfix.co.uk
In reply to: Tom Lane (#54)
Re: Tablespaces

On Wed, 2004-03-03 at 04:59, Tom Lane wrote:

What might make sense is some sort of marker file in a
tablespace directory that links back to the owning $PGDATA directory.
CREATE TABLESPACE should create this, or reject if it already exists.

It will not be enough for the marker to list the path of the parent
$PGDATA, since that path might get changed by system administration
action. The marker should contain some sort of unique string which
would match the same string somewhere in $PGDATA. Then, if either
tablespace or $PGDATA were moved, it would be possible to tie the two
back together. It wouldn't be an issue on most normal systems, but
might be of crucial importance for an ISP running numerous separate
clusters.
--
Oliver Elphick <olly@lfix.co.uk>
LFIX Ltd

#66Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#61)
Re: [HACKERS] Tablespaces

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Tom Lane wrote:

My feeling is that we need not support tablespaces on OS's without
symlinks.

Agreed, but are we going to support non-tablespace installs? I wasn't
sure that was an option.

A setup containing only the default tablespace cannot use any symlinks.
That doesn't seem hard though.

Yea, I think you are right. We just disable CREATE TABLESPACE and the
rest should just work. Basically, pg_tablespace will only have one
entry on those platforms. The initdb directory structure will have a
single tablespace, but that doesn't use symlinks.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#67Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Claudio Natoli (#63)
Re: [HACKERS] Tablespaces

Claudio Natoli wrote:

Bruce Momjian writes:

I just checked from the MinGW console and I see:
[snip]
It accepts ln -s, but does nothing with it.

And even if it had worked, it wouldn't really matter, since we don't
actually want to *run* the system under MinGW/msys, just build it.

I think the idea of implementing in symlinks for non-compliant platforms in
md.c has some merit. FWIW, looks like that is how cygwin implements
symlinks...

Why can't we use MS Win32 shortcut files to simulate symlinks? MinGW
doesn't do it, so I suppose it isn't possible.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#68Magnus Hagander
mha@sollentuna.net
In reply to: Bruce Momjian (#67)
Re: [HACKERS] Tablespaces

I just checked from the MinGW console and I see:
[snip]
It accepts ln -s, but does nothing with it.

And even if it had worked, it wouldn't really matter, since

we don't

actually want to *run* the system under MinGW/msys, just build it.

I think the idea of implementing in symlinks for non-compliant
platforms in md.c has some merit. FWIW, looks like that is

how cygwin

implements symlinks...

Why can't we use MS Win32 shortcut files to simulate
symlinks? MinGW doesn't do it, so I suppose it isn't possible.

Shortcut files are *only* a GUI feature. If you do a "dir" in the
filesystem they show up as a file. If you issue open() or similar on
them (for example, by doing "type" on the command prompt), you will get
the .lnk file, not the file it points to.

//Magnus

#69Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zeugswetter Andreas SB SD (#64)
Re: [HACKERS] Tablespaces

"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:

My feeling is that we need not support tablespaces on OS's without
symlinks.

To create symlinked directories on Win2k NTFS see:
http://www.sysinternals.com/ntw2k/source/misc.shtml#junction
I think Win2000 or XP would be a reasonable restriction for Win32 PG
installations that want tablespaces.

Oh, good --- symlinks for directories are all that we need for this
design. I think that settles it then.

regards, tom lane

#70Tom Lane
tgl@sss.pgh.pa.us
In reply to: Oliver Elphick (#65)
Re: Tablespaces

Oliver Elphick <olly@lfix.co.uk> writes:

It will not be enough for the marker to list the path of the parent
$PGDATA, since that path might get changed by system administration
action. The marker should contain some sort of unique string which
would match the same string somewhere in $PGDATA.

We have already added a notion of a "unique installation identifier"
for PITR purposes (look in pg_control). So we could use that for this
purpose if we wanted to.

But I'm not sure how important it really is. AFAICS the behavior of
CREATE TABLESPACE will be "create marker file, if it already exists
then abort". It has no need to actually look in the file and so there's
no need for the contents to be unique.

regards, tom lane

#71Andrew Dunstan
andrew@dunslane.net
In reply to: Zeugswetter Andreas SB SD (#64)
Re: [HACKERS] Tablespaces

Zeugswetter Andreas SB SD wrote:

To create symlinked directories on Win2k NTFS see:
http://www.sysinternals.com/ntw2k/source/misc.shtml#junction

I don't think we could use this s/w though, unless the author is
prepared to relicense it. I'm sure implementing a clean room version of
the relevant parts wouldn't be too hard, though.

cheers

andrew

#72Magnus Hagander
mha@sollentuna.net
In reply to: Andrew Dunstan (#71)
Re: [HACKERS] Tablespaces

To create symlinked directories on Win2k NTFS see:
http://www.sysinternals.com/ntw2k/source/misc.shtml#junction

I don't think we could use this s/w though, unless the author is
prepared to relicense it. I'm sure implementing a clean room
version of
the relevant parts wouldn't be too hard, though.

Definitly not (ooh, danger...) - it's a simple and well documented call
to DeviceIOControl().
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/fileio/
base/fsctl_set_reparse_point.asp
is the one you want, I think.

We can always reference to that utility as a good way to get information
about the junctions pgsql has created...

//Magnus

#73Greg Stark
gsstark@mit.edu
In reply to: Bruce Momjian (#48)
Re: Tablespaces

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Greg Stark wrote:

I am expecting to hear some bleating about this from people whose
preferred platforms don't support symlinks ;-). However, if we don't

Well, one option would be to have the low level filesystem storage (md.c?)
routines implement a kind of symlink themselves. Just a file with a special
magic number followed by a path.

On further contemplation it doesn't seem like using symlinks really ought to
be necessary. It should be possible to drive everything off the catalog tables
while avoidin having the low level filesystem code know anything about them.

Instead of having the low level code fetch the pg_* records themselves, some
piece of higher level code would do the query and call down to storage layer
to inform it of the locations for everything. It would have to do this on
database initialization and on any subsequent object creation.

Basically maintain an in-memory hash table of oid -> path, and call down to
the low level code whenever that hash changes. (Or more likely oid->ts_id and
a separate list of ts_id -> path.)

--
greg

#74Barry Lind
blind@xythos.com
In reply to: Oliver Elphick (#65)
Re: Tablespaces

Oliver Elphick wrote:

On Wed, 2004-03-03 at 04:59, Tom Lane wrote:

What might make sense is some sort of marker file in a
tablespace directory that links back to the owning $PGDATA directory.
CREATE TABLESPACE should create this, or reject if it already exists.

It will not be enough for the marker to list the path of the parent
$PGDATA, since that path might get changed by system administration
action. The marker should contain some sort of unique string which
would match the same string somewhere in $PGDATA. Then, if either
tablespace or $PGDATA were moved, it would be possible to tie the two
back together. It wouldn't be an issue on most normal systems, but
might be of crucial importance for an ISP running numerous separate
clusters.

Taking this one step further would be to do something like Oracle does.
Every datafile in Oracle (because the Oracle storage manager stores
multiple objects inside datafiles, one could say there is some
similarity between Oracle datafiles and the proposed pg tablespaces),
has meta info that tells it which database instance it belongs to and
the last checkpoint that occured (It might actually be more granular
than checkpoint, such that on a clean shutdown you can tell that all
datafiles are consistent with each other and form a consistent database
instance). So Oracle on every checkpoint updates all datafiles with an
identifier. Now you might ask why is this useful. Well in normal day
to day operation it isn't, but it can be usefull in disaster recovery.
If you loose a disk and need to restore the entire database from backups
it can be difficult to make sure you have done it all correctly (do I
have all the necessary files/directories? did I get the right ones from
the right tapes?) Especially if you have directories spread across
various different disks that might be backed up to different tapes. So
by having additional information stored in each datafile Oracle can
provide additional checks that the set of files that are being used when
the database starts up are consistent and all belong together. Oracle
also ensures that all the datafiles that are suposed to exist actually
do as well.

So what might this mean for postgres and tablespaces? It could mean
that on startup the database checks to verify that all the tablespaces
that are registered actually exist. And that the data in each
tablespace is consistent with the current WAL status. (i.e. someone
didn't restore a tablespace from backup while the database was down that
is old and needs recovery.

A lot of what I am talking about here become PITR issues. But since
PITR and tablespaces are both potential features for 7.5, how they
interact probably should be thought about in the designs for each.

thanks,
--Barry

#75Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Greg Stark (#73)
Re: Tablespaces

Greg Stark wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Greg Stark wrote:

I am expecting to hear some bleating about this from people whose
preferred platforms don't support symlinks ;-). However, if we don't

Well, one option would be to have the low level filesystem storage (md.c?)
routines implement a kind of symlink themselves. Just a file with a special
magic number followed by a path.

On further contemplation it doesn't seem like using symlinks really ought to
be necessary. It should be possible to drive everything off the catalog tables
while avoidin having the low level filesystem code know anything about them.

Instead of having the low level code fetch the pg_* records themselves, some
piece of higher level code would do the query and call down to storage layer
to inform it of the locations for everything. It would have to do this on
database initialization and on any subsequent object creation.

Basically maintain an in-memory hash table of oid -> path, and call down to
the low level code whenever that hash changes. (Or more likely oid->ts_id and
a separate list of ts_id -> path.)

The advantage of symlinks is that an administrator could see how things
are laid out from the command line.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#76Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#60)
Re: Out of space situation and WAL log pre-allocation (was Tablespaces)

Tom Lane [mailto:tgl@sss.pgh.pa.us]
Joe Conway <mail@joeconway.com> writes:

Tom Lane wrote:

Joe Conway <mail@joeconway.com> writes:

Maybe specify an archive location (that of course could be on a

separate

partition) that the external archiver should check in addition to

the

normal WAL location. At some predetermined interval, push WAL log
segments no longer needed to the archive location.

Does that really help? The panic happens when you fill the

"normal"

and

"archive" partitions, how's that different from one partition?

I see your point. But it would allow you to use a relatively modest
local partition for WAL segments, while you might be using a 1TB

netapp

tray over NFS for the archive segments.

Fair enough, but it seems to me that that sort of setup really falls

in

the category of a user-defined archiving process --- that is, the hook
that Postgres calls will push WAL segments from the local partition to
the NFS server, and then pushing them off NFS to tape is the
responsibility of some other user-defined subprocess. Database panic
happens if and only if the local partition overflows. I don't see

that

making Postgres explicitly aware of the secondary NFS arrangement will
buy anything.

Tom's last sentence there summarises the design I was working with. I
had considered Joe's suggested approach (which was Oracle's also).

However, the PITR design will come with a usable low-function program
which can easily copy logs from pg_xlog to another archive directory.
That's needed as a test harness anyway, so it may as well be part of the
package. You'd be able to use that in production to copy xlogs to
another larger directory as a staging area to tape/failover on another
system: effectively Joe's idea is catered for in the basic package.

Anyway I'm answering questions before publishing the design as
stands...though people do keep spurring me to refine it as I'm writing
it down! That's why its good to document it I guess.

I guess if the archive partition fills up, I would err on the side

of

dropping archive segments on the floor.

That should be user-scriptable policy, in my worldview.

Hmmm. Very difficult that one.

My experience is in commercial systems. Dropping archive segments on the
floor is just absolutely NOT GOOD, if that is the only behaviour. The
whole purpose of having a dbms is so that you can protect your business
data, while using it. Such behaviour would most likely be a barrier to
wider commercial adoption. [Oracle and other dbms will freeze when this
situation is hit, rather than continue and drop archive logs.]

User-selectable behaviour? OK. That's how we deal with fsync; I can
relate to that. That hadn't been part of my thinking because of the
importance I'd attached to the log files themselves, but I can go with
that, if that's what was meant.

So, if we had a parameter called Wal_archive_policy that has 3 settings:
None = no archiving
Optimistic = archive, but if for some reason log space runs out then
make space by dropping the oldest archive logs
Strict = if log space runs out, stop further write transactions from
committing, by whatever means, even if this takes down dbms.

That way, we've got something akin to transaction isolation level with
various levels of protection.

Best Regards, Simon Riggs

#77Joe Conway
mail@joeconway.com
In reply to: Simon Riggs (#76)
Re: Out of space situation and WAL log pre-allocation (was

Simon Riggs wrote:

Tom Lane [mailto:tgl@sss.pgh.pa.us] That should be user-scriptable
policy, in my worldview.

O... and other dbms will freeze when this situation is hit, rather
than continue and drop archive logs.]

Been there, done that, don't see how it's any better. I hesitate to be
real specific here, but let's just say the end result was restore from
backup :-(

So, if we had a parameter called Wal_archive_policy that has 3
settings: None = no archiving Optimistic = archive, but if for some
reason log space runs out then make space by dropping the oldest
archive logs Strict = if log space runs out, stop further write
transactions from committing, by whatever means, even if this takes
down dbms.

That sounds good to me. For the "Optimistic" case, we need to yell
loudly if we do find ourselves needing to drop segments. For the
"Strict" case, we just need to be sure it works correctly ;-)

Joe

#78Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joe Conway (#77)
Re: Out of space situation and WAL log pre-allocation (was Tablespaces)

Joe Conway <mail@joeconway.com> writes:

Simon Riggs wrote:

O... and other dbms will freeze when this situation is hit, rather
than continue and drop archive logs.]

Been there, done that, don't see how it's any better. I hesitate to be
real specific here, but let's just say the end result was restore from
backup :-(

It's hard for me to imagine a situation in which killing the database
would be considered a more attractive option than dropping old log
data. You may or may not ever need the old log data, but you darn well
do need a functioning database. (If you don't, you wouldn't be going to
all this work.)

I think also that Simon completely misunderstood my intent in saying
that this could be "user-scriptable policy". By that I meant that the
*user* could write the code to behave whichever way he liked. Not that
we were going to go into a mad rush of feature invention and try to
support every combination we could think of. I repeat: code that pushes
logs into a secondary area is not ours to write. We should concentrate
on providing an API that lets users write it. We have only limited
manpower for this project and we need to spend it on getting the core
functionality done right, not on inventing frammishes.

regards, tom lane

#79Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Simon Riggs (#76)
Re: Out of space situation and WAL log pre-allocation (was

Simon Riggs wrote:

User-selectable behaviour? OK. That's how we deal with fsync; I can
relate to that. That hadn't been part of my thinking because of the
importance I'd attached to the log files themselves, but I can go with
that, if that's what was meant.

So, if we had a parameter called Wal_archive_policy that has 3 settings:
None = no archiving
Optimistic = archive, but if for some reason log space runs out then
make space by dropping the oldest archive logs
Strict = if log space runs out, stop further write transactions from
committing, by whatever means, even if this takes down dbms.

That way, we've got something akin to transaction isolation level with
various levels of protection.

Yep, we will definately need something like that. Basically whenever
the logs are being archived, you have to stop the database if you can't
archive, no?

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#80Thomas Swan
tswan@idigx.com
In reply to: Bruce Momjian (#75)
Re: Tablespaces

Bruce Momjian wrote:

Greg Stark wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Greg Stark wrote:

I am expecting to hear some bleating about this from people whose
preferred platforms don't support symlinks ;-). However, if we don't

Well, one option would be to have the low level filesystem storage (md.c?)
routines implement a kind of symlink themselves. Just a file with a special
magic number followed by a path.

On further contemplation it doesn't seem like using symlinks really ought to
be necessary. It should be possible to drive everything off the catalog tables
while avoidin having the low level filesystem code know anything about them.

Instead of having the low level code fetch the pg_* records themselves, some
piece of higher level code would do the query and call down to storage layer
to inform it of the locations for everything. It would have to do this on
database initialization and on any subsequent object creation.

Basically maintain an in-memory hash table of oid -> path, and call down to
the low level code whenever that hash changes. (Or more likely oid->ts_id and
a separate list of ts_id -> path.)

The advantage of symlinks is that an administrator could see how things
are laid out from the command line.

That's a poor reason to require symlinks. The administrator can just as
easily open up psql and query pg_tablespace to see that same
information. Besides, the postgres doesn't know where to look on the
filesystem for the /path/to/oid without a system catalog lookup. There
doesn't seem to be any sensible reason to force a filesystem requirement
when the core operations are diffferent to begin with.

If a more global view of all databases is necessary, perhaps there ought
to be a system wide view which could display all of that information at
once: dbname, relation name, and physical location.

#81Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Thomas Swan (#80)
Re: Tablespaces

Thomas Swan wrote:

The advantage of symlinks is that an administrator could see how things
are laid out from the command line.

That's a poor reason to require symlinks. The administrator can just as
easily open up psql and query pg_tablespace to see that same
information. Besides, the postgres doesn't know where to look on the
filesystem for the /path/to/oid without a system catalog lookup. There
doesn't seem to be any sensible reason to force a filesystem requirement
when the core operations are diffferent to begin with.

If a more global view of all databases is necessary, perhaps there ought
to be a system wide view which could display all of that information at
once: dbname, relation name, and physical location.

Who doesn't have symlinks these days, and is going to be using
tablespaces? Even Win32 has them.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#82Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Swan (#80)
Re: Tablespaces

Thomas Swan <tswan@idigx.com> writes:

Bruce Momjian wrote:

The advantage of symlinks is that an administrator could see how things
are laid out from the command line.

That's a poor reason to require symlinks. The administrator can just as
easily open up psql and query pg_tablespace to see that same
information.

Something to keep in mind here is that one of the times you would most
likely need that information is when the database is broken and you
*can't* simply "open up psql" and inspect system catalogs. I like the
fact that a symlink implementation can be inspected without depending on
a working database.

If we were going to build a non-symlink implementation, I'd want the
highlevel-to-lowlevel data transfer to take the form of a flat ASCII
file that could be inspected by hand, rather than some hidden in-memory
datastructure. But given the previous discussion in this thread,
I cannot see any strong reason not to rely on symlinks for the purpose.
We are not in the business of building replacements for OS features.

regards, tom lane

#83Thomas Swan
tswan@idigx.com
In reply to: Tom Lane (#82)
Re: Tablespaces

Tom Lane wrote:

Thomas Swan <tswan@idigx.com> writes:

Bruce Momjian wrote:

The advantage of symlinks is that an administrator could see how things
are laid out from the command line.

That's a poor reason to require symlinks. The administrator can just as
easily open up psql and query pg_tablespace to see that same
information.

Something to keep in mind here is that one of the times you would most
likely need that information is when the database is broken and you
*can't* simply "open up psql" and inspect system catalogs. I like the
fact that a symlink implementation can be inspected without depending on
a working database.

That's a sufficient argument, to allow for it. Recoverability would be
one reason.

If we were going to build a non-symlink implementation, I'd want the
highlevel-to-lowlevel data transfer to take the form of a flat ASCII
file that could be inspected by hand, rather than some hidden in-memory
datastructure. But given the previous discussion in this thread,
I cannot see any strong reason not to rely on symlinks for the purpose.
We are not in the business of building replacements for OS features.

I do like the flat file output at least for a record of what went
where. Regardless of whether or not symlinks are used, the admin would
need to know what directories/files/filesystems are to be backed up.

I am concerned as to what extent different filesystems do when you back
the directories up. Would NTFS containing symlinks be able to be
backed up with a tar/zip command, or is something more elaborate needed?
In the past, before upgrading, I have had to tar the pgdata directory
with the postmaster shutdown to insure a quick restoration of the
database in case an upgrade didn't proceed uneventfully. Also, in the
event of a major version upgrade the restored information may or may not
proceed uneventfully. I just wanted to point out something I thought
might be an issue further down the road.

Perhaps the system catalog / flat file approach would be a more solid
approach, both of which would not involve replacing or duplicating OS
features.

#84Shridhar Daithankar
shridhar@frodo.hserus.net
In reply to: Tom Lane (#78)
Re: Out of space situation and WAL log pre-allocation (was

Tom Lane wrote:

I think also that Simon completely misunderstood my intent in saying
that this could be "user-scriptable policy". By that I meant that the
*user* could write the code to behave whichever way he liked. Not that
we were going to go into a mad rush of feature invention and try to
support every combination we could think of. I repeat: code that pushes
logs into a secondary area is not ours to write. We should concentrate
on providing an API that lets users write it. We have only limited
manpower for this project and we need to spend it on getting the core
functionality done right, not on inventing frammishes.

Hmm... I totally agree. I think the backend could just offer a shared memory
segment and a marker message to another process to allow copy from it. then it
is the applications business to do the things.

Of course there has to be a two way agreement about it but an API is a real nice
thing rather than an application.

Shridhar

#85Noname
tswan@idigx.com
In reply to: Tom Lane (#69)
Re: [HACKERS] Tablespaces

"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:

My feeling is that we need not support tablespaces on OS's without
symlinks.

To create symlinked directories on Win2k NTFS see:
http://www.sysinternals.com/ntw2k/source/misc.shtml#junction
I think Win2000 or XP would be a reasonable restriction for Win32 PG
installations that want tablespaces.

Oh, good --- symlinks for directories are all that we need for this
design. I think that settles it then.

What archival tools are there that would restore this to this back to the
filesystem: tar? zip? What would happen if a symlink were removed or
pointed to an invalid location while the postmaste was running?

I think the catalog approach would future proof yourself; think about the
possibility of new filesystems, table storage mechanisms, or devices.
There may be a raw disk system in the future. With that you could point
to a block device instead of a filesystem directory. Symlinks seem like a
simple solution but will it be complete enough? A meta file or catalog
entry could store more than just the location of the tablespace: max/min
space allocation, vacuum frequencies, storage type (maybe in-place
updates), locking schemes (in case someome had a wild hair to customize
NFS operations).

That meta/configuration file with a summary or one that is generated for a
summary could be useful.

I just can't help but state that "I've got a bad feeling about this."

#86Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Noname (#85)
Re: [HACKERS] Tablespaces

tswan@idigx.com wrote:

"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:

My feeling is that we need not support tablespaces on OS's without
symlinks.

To create symlinked directories on Win2k NTFS see:
http://www.sysinternals.com/ntw2k/source/misc.shtml#junction
I think Win2000 or XP would be a reasonable restriction for Win32 PG
installations that want tablespaces.

Oh, good --- symlinks for directories are all that we need for this
design. I think that settles it then.

What archival tools are there that would restore this to this back to the
filesystem: tar? zip? What would happen if a symlink were removed or
pointed to an invalid location while the postmaste was running?

Well, for backup, just run tar or find on /data with a flag to follow
symlinks, and you are done. Can't get much easier than that.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#87Noname
tswan@idigx.com
In reply to: Bruce Momjian (#86)
Re: [HACKERS] Tablespaces

tswan@idigx.com wrote:

"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:

My feeling is that we need not support tablespaces on OS's without
symlinks.

To create symlinked directories on Win2k NTFS see:
http://www.sysinternals.com/ntw2k/source/misc.shtml#junction
I think Win2000 or XP would be a reasonable restriction for Win32 PG
installations that want tablespaces.

Oh, good --- symlinks for directories are all that we need for this
design. I think that settles it then.

What archival tools are there that would restore this to this back to
the
filesystem: tar? zip? What would happen if a symlink were removed or
pointed to an invalid location while the postmaste was running?

Well, for backup, just run tar or find on /data with a flag to follow
symlinks, and you are done. Can't get much easier than that.

I'm ruferring to NTFS and the win32 platforms. How does tar handle these
symlinks on the NTFS filesystem? What about if someone finds that FAT32
is significantly better for the database?

It seems a little insane to introduce an OS/filesystem dependency at the
onset of a porting effort especially if you hope to be OS agnostic for
feature sets. I think someone would be crying foul if a new feature only
worked on Linux and not on FreeBSD.

Additionally, another developer noted the advantage of a text file is that
it would be easy for someone to develop tools to help if it became
difficult to edit or parse. Additionally, there could be a change away
from a flat file format to an XML format to configure the tablespace area.

Another argument against the symlink approach is how they may change in
the future. A file location is universal, symlink behavoir may not be.
The symlink behavior on different ports may change. To rely on symlinks
introduces an unnecessary OS dependency. All that is needed is the file
location. This can be derived from a flat file, an XML file, a sysetm
catalog.

Also, to support some features on some platforms is a real poor prospect.
( This application requires the Linux port of PostgreSQL 7.5.1 ) seems
like a poor choice for advocating PostgreSQL. The extra effort insures
that all ports, current and future, can get the same set of features and
nhancements. I like Unix and Linux as much as the next guy, but I have
to be real and make the presumption that there are and will be other
operating systems and it would be wise to plan a little for that.

In reply to: Noname (#87)
Re: [HACKERS] Tablespaces

"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:

My feeling is that we need not support tablespaces on OS's without
symlinks.

To create symlinked directories on Win2k NTFS see:
http://www.sysinternals.com/ntw2k/source/misc.shtml#junction
I think Win2000 or XP would be a reasonable restriction for Win32 PG
installations that want tablespaces.

Oh, good --- symlinks for directories are all that we need for this
design. I think that settles it then.

Er, sorry to drop into the middle of this but do you want to cripple a port
before it is even complete? Is there a compelling reason to use symlinks rather
than a flat file? If the issue is just:

Gavin Sherry <swm@linuxworld.com.au> writes:
how the low-level file access code finds a tablespace.

then what is wrong with using an XML file that is loaded and traversed at start
up? I agree it would be a cool to use the file system as a database, but why
place a possible limiting factor for the sake of elegance? Isn't XML a valid
and accepted way to store hierarchial data?

Gavin Sherry <swm@linuxworld.com.au> writes:
I am expecting to hear some bleating about this from people whose
preferred platforms don't support symlinks ;-). However, if we don't

Well bleat I guess. Although I wouldn't exactly say preferred. I prefer to
think of myself as a realest getting paid to program on a platform. A platform
with symlinks carrying quite a bit of baggage. On NTFS they are called Junction
Points and are a special type of Reparse Point. One thing I noticed on the
Microsoft site regarding these:

(http://www.microsoft.com/whdc/DDK/IFSkit/reparse.mspx)

Reparse Points are a powerful feature of Windows 2000 (not available on Windows
NT� 4.0), but developers should be aware that there can only be one reparse
point per file, and some new Windows 2000 mechanisms use reparse points (HSM,
Native Structured Storage). Developers need to have fallback strategies for
when the reparse point tag is already in use for a file.

makes me question their usefulness at this point. I am currently exploring
another solution to the problem that caused me to investigate them.

Well, thanks for your time. I guess I can go baaack to lurking now. ;-)

Lawrence E. Smithmier, Jr.
MCP, MCAD
(919) 522-9738
Larry@Smithmier.com

#89Noname
jearl@bullysports.com
In reply to: Noname (#87)
Re: [HACKERS] Tablespaces

<tswan@idigx.com> writes:

tswan@idigx.com wrote:

"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:

My feeling is that we need not support tablespaces on OS's without
symlinks.

To create symlinked directories on Win2k NTFS see:
� http://www.sysinternals.com/ntw2k/source/misc.shtml#junction
I think Win2000 or XP would be a reasonable restriction for Win32 PG
installations that want tablespaces.

Oh, good --- symlinks for directories are all that we need for this
design.� I think that settles it then.

What archival tools are there that would restore this to this back to
the
filesystem: tar? zip?� What would happen if a symlink were removed or
pointed to an invalid location while the postmaste was running?

Well, for backup, just run tar or find on /data with a flag to
follow symlinks, and you are done.� Can't get much easier than
that.

I'm ruferring to NTFS and the win32 platforms.� How does tar handle
these symlinks on the NTFS filesystem?� What about if someone finds
that FAT32 is significantly better for the database?

tar doesn't know anything about PostgreSQL system catalogs. If we use
symlinks for tablespaces then it would be possible to backup downed
databases with a simple tar command on every platform *I* care about
(and probably Windows too). Using system catalogs for this stuff
would simply guarantee that I would have to read the system catalogs
and then back up each tablespace manually. In short, your idea would
trade off (maybe) having to backup tablespaces manually on a few
platforms for the certainty of having to backup tablespaces manually
on all platforms.

How is that a win?

It seems a little insane to introduce an OS/filesystem dependency at
the onset of a porting effort especially if you hope to be OS
agnostic for feature sets.� I think someone would be crying foul if
a new feature only worked on Linux and not on FreeBSD.

First of all, symlinks are a pretty popular "feature." Even Windows
supports what would be needed. Second of all, PostgreSQL will still
run on OSes without symlinks, tablespaces won't be available, but
PostgreSQL will still run. Since we are all using PostgreSQL without
tablespaces now, it can hardly be argued that tablespaces are a
critical feature.

We aren't talking about a "feature that work[s] on Linux on not on
FreeBSD." We are talking about a feature that works on every OS that
suports symlinks (which includes even operating systems like Windows
that PostgreSQL doesn't currently support).

Additionally, another developer noted the advantage of a text file
is that it would be easy for someone to develop tools to help if it
became difficult to edit or parse.� Additionally, there could be a
change away from a flat file format to an XML format to configure
the tablespace area.

The advantage of symlinks is that no tools would have to be written
and 'ls -l' would show everything you would need to know about where
your tablespaces actually were.

XML files are relatively easy to parse, but they certainly aren't as
easy as simply letting PostgreSQL follow a symlink. Why reinvent the
wheel with what would essentially be PostgreSQL's own implementation
of a symlink?

To go back to your *tar* example, are we going to rewrite tar so that
it reads PostgreSQL's XML file and *does the right thing*?

Another argument against the symlink approach is how they may change
in the future.�� A file location is universal, symlink behavoir may
not be. The symlink behavior on different ports may change.� To
rely on symlinks introduces an unnecessary OS dependency.� All that
is needed is the file location.� This can be derived from a flat
file, an XML file, a sysetm catalog.

Yes, and someone can whip out vi and edit a file or fire up psql and
change the system catalog. Heck, someone could issue a 'mv' or 'rm'
command on the actual files, for that matter. PostgreSQL can't
protect itself from careless sysadmins, and symlinks are no more
inherently dangerous than normal files. They don't just *disappear*.

Also, to support some features on some platforms is a real poor
prospect. �( This application requires the Linux port of PostgreSQL
7.5.1 ) seems like a poor choice for advocating PostgreSQL.�� The
extra effort insures that all ports, current and future, can get the
same set of features and nhancements.�� I like Unix and Linux as
much as the next guy, but I have to be real and make the presumption
that there are and will be other operating systems and it would be
wise to plan a little for that.

The fact of the matter is that PostgreSQL runs better on some
platforms than others, and it probably always will. Heck, as of
today, PostgreSQL is officially supported on the Gamecube. Does that
mean that the PostgreSQL developers should limit themselves to the
features offered by the Gamecube? What, precisely, is the point of
developing for the lowest common denominator?

Perhaps if you could give us an example of an actual case where some
actual PostgreSQL users (or potential users) might be affected?

#90Thomas Swan
tswan@idigx.com
In reply to: Noname (#89)
Re: [HACKERS] Tablespaces

jearl@bullysports.com wrote:

<tswan@idigx.com> writes:

tswan@idigx.com wrote:

"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:

My feeling is that we need not support tablespaces on OS's without
symlinks.

To create symlinked directories on Win2k NTFS see:
http://www.sysinternals.com/ntw2k/source/misc.shtml#junction
I think Win2000 or XP would be a reasonable restriction for Win32 PG
installations that want tablespaces.

Oh, good --- symlinks for directories are all that we need for this
design. I think that settles it then.

What archival tools are there that would restore this to this back to
the
filesystem: tar? zip? What would happen if a symlink were removed or
pointed to an invalid location while the postmaste was running?

Well, for backup, just run tar or find on /data with a flag to
follow symlinks, and you are done. Can't get much easier than
that.

I'm ruferring to NTFS and the win32 platforms. How does tar handle
these symlinks on the NTFS filesystem? What about if someone finds
that FAT32 is significantly better for the database?

tar doesn't know anything about PostgreSQL system catalogs. If we use
symlinks for tablespaces then it would be possible to backup downed
databases with a simple tar command on every platform *I* care about
(and probably Windows too). Using system catalogs for this stuff
would simply guarantee that I would have to read the system catalogs
and then back up each tablespace manually. In short, your idea would
trade off (maybe) having to backup tablespaces manually on a few
platforms for the certainty of having to backup tablespaces manually
on all platforms.

How is that a win?

Apparently, I have failed tremendously in addressing a concern. The
question is does PostgreSQL need to rely on symlinks and will that
dependency introduce problems?

There is an active win32 port underway (see this mailing list). One
proposal was to try to use an OS specific filesystem feature to perform
a symlink on NTFS. Can the special symlink that NTFS allegedly supports
be archived the same way symlinks are archived on Unix? If so, is there
a utility that can do this (zip, tar, etc). The backup operator would
still need to know what directories needed to be archived in addtion to
the pgdata directory. Is this symlink structure a normal/special file
that can be archived by normal means (tar,zip, etc)?

Example:

PGDATA is C:\pgdata
I have a tablespace in Z:\1\ and Z:\2\
There exists an alleged symlink in
C:\pgdata\data\base\tablespaces\schmoo -> Z:\1

Can I archive [ C:\pgdata, Z:\1, Z:\2 ], restore them, and have
postgresql working just as before?

It seems a little insane to introduce an OS/filesystem dependency at
the onset of a porting effort especially if you hope to be OS
agnostic for feature sets. I think someone would be crying foul if
a new feature only worked on Linux and not on FreeBSD.

First of all, symlinks are a pretty popular "feature." Even Windows
supports what would be needed. Second of all, PostgreSQL will still
run on OSes without symlinks, tablespaces won't be available, but
PostgreSQL will still run. Since we are all using PostgreSQL without
tablespaces now, it can hardly be argued that tablespaces are a
critical feature.

We aren't talking about a "feature that work[s] on Linux on not on
FreeBSD." We are talking about a feature that works on every OS that
suports symlinks (which includes even operating systems like Windows
that PostgreSQL doesn't currently support).

Hello? What was this response from Tom Lane? "My feeling is that we
need not support tablespaces on OS's without symlinks." That seems to
be indicative of a feature set restriction base on platform.

Additionally, another developer noted the advantage of a text file
is that it would be easy for someone to develop tools to help if it
became difficult to edit or parse. Additionally, there could be a
change away from a flat file format to an XML format to configure
the tablespace area.

The advantage of symlinks is that no tools would have to be written
and 'ls -l' would show everything you would need to know about where
your tablespaces actually were.

Where is 'ls -l' on a win32 box? If you will follow the discussion of
symlinks under MinGW you will see that they don't work as commanded.
And, postgresql is supposed to be compiled under MinGW, but not require
it to run.

From Windows 2000, 'ls' is not recognized as an internal or external
command,
operable program or batch file.

XML files are relatively easy to parse, but they certainly aren't as
easy as simply letting PostgreSQL follow a symlink. Why reinvent the
wheel with what would essentially be PostgreSQL's own implementation
of a symlink?

Is opening a file recreating a symlink? If you are opening file
descriptors why rely on symlinks. If you know the location either from
the system catalog, a or configuration file, is it any terribly more
complicated? Basically, if a tablespace needed to be renamed, or
moved, or changed, your going to have to do file management anyway.
The symlink saves you just a lookup as to what files go where? If you
kept this small hash in memory, it's not a continuous lookup because you
have the redirection internally. And, it's more portable.

To go back to your *tar* example, are we going to rewrite tar so that
it reads PostgreSQL's XML file and *does the right thing*?

I'm not talking about integrating tar with PostgreSQL or uniting the
universal string theory with how to cook apple pie. Why would think we
would have to rewrite tar?

Another argument against the symlink approach is how they may change
in the future. A file location is universal, symlink behavoir may
not be. The symlink behavior on different ports may change. To
rely on symlinks introduces an unnecessary OS dependency. All that
is needed is the file location. This can be derived from a flat
file, an XML file, a sysetm catalog.

Yes, and someone can whip out vi and edit a file or fire up psql and
change the system catalog. Heck, someone could issue a 'mv' or 'rm'
command on the actual files, for that matter. PostgreSQL can't
protect itself from careless sysadmins, and symlinks are no more
inherently dangerous than normal files. They don't just *disappear*.

Also, to support some features on some platforms is a real poor
prospect. ( This application requires the Linux port of PostgreSQL
7.5.1 ) seems like a poor choice for advocating PostgreSQL. The
extra effort insures that all ports, current and future, can get the
same set of features and nhancements. I like Unix and Linux as
much as the next guy, but I have to be real and make the presumption
that there are and will be other operating systems and it would be
wise to plan a little for that.

The fact of the matter is that PostgreSQL runs better on some
platforms than others, and it probably always will. Heck, as of
today, PostgreSQL is officially supported on the Gamecube. Does that
mean that the PostgreSQL developers should limit themselves to the
features offered by the Gamecube? What, precisely, is the point of
developing for the lowest common denominator?

The other option proposed was to give win32 a subset of features that
would be available to other platforms. In this case, that would be that
the win32 port could support tablespaces. This is strikingly different
than a performance issue. It would be one thing for tablespaces to
perform poorly, it's another for them to fail or not exist altogether.

Perhaps if you could give us an example of an actual case where some
actual PostgreSQL users (or potential users) might be affected?

See the comment from Tom Lane on limiting features. Look at the
potential Win32 market which outnumbers the unix market in number of
computers and developers by a large margin.

#91Richard Huxton
dev@archonet.com
In reply to: Thomas Swan (#90)
Re: [HACKERS] Tablespaces

On Friday 05 March 2004 07:51, Thomas Swan wrote:

Apparently, I have failed tremendously in addressing a concern. The
question is does PostgreSQL need to rely on symlinks and will that
dependency introduce problems?

There is an active win32 port underway (see this mailing list). One
proposal was to try to use an OS specific filesystem feature to perform
a symlink on NTFS. Can the special symlink that NTFS allegedly supports
be archived the same way symlinks are archived on Unix? If so, is there
a utility that can do this (zip, tar, etc).

Why not try and find out?
http://www.nedprod.com/programs/Win32/SymLink/
This will apparently create symlinks on NTFS5 volumes via a simple dialogue
box.

You'd probably want to test using pkzip/winzip, Symantec/Norton - anyone think
of any others?

--
Richard Huxton
Archonet Ltd

#92Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Richard Huxton (#91)
Re: [HACKERS] Tablespaces

First of all, symlinks are a pretty popular "feature." Even Windows
supports what would be needed. Second of all, PostgreSQL will still
run on OSes without symlinks, tablespaces won't be available, but
PostgreSQL will still run. Since we are all using PostgreSQL without

My idea for platforms that don't support symlinks would be to simply create
a tblspaceoid directory inplace instead of the symlink (maybe throw a warning).
My feeling is, that using the same syntax on such platforms is important,
but actual distribution is not (since they will most likely be small systems).

Andreas

#93Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Thomas Swan (#90)
Re: [HACKERS] Tablespaces

Thomas Swan wrote:

The fact of the matter is that PostgreSQL runs better on some
platforms than others, and it probably always will. Heck, as of
today, PostgreSQL is officially supported on the Gamecube. Does that
mean that the PostgreSQL developers should limit themselves to the
features offered by the Gamecube? What, precisely, is the point of
developing for the lowest common denominator?

The other option proposed was to give win32 a subset of features that
would be available to other platforms. In this case, that would be that
the win32 port could support tablespaces. This is strikingly different
than a performance issue. It would be one thing for tablespaces to
perform poorly, it's another for them to fail or not exist altogether.

Perhaps if you could give us an example of an actual case where some
actual PostgreSQL users (or potential users) might be affected?

See the comment from Tom Lane on limiting features. Look at the
potential Win32 market which outnumbers the unix market in number of
computers and developers by a large margin.

Let me put it this way. Unix tools like tar can already use symlinks,
as can administrators from the command line, so symlinks are best on
that platform, period.

If we have a platform that doesn't have symlinks (and I think Win32
might), we will have to implement a different way to do tablespace
lookups _only_ for those platforms.

The use of symlinks on Unix has just too many advantages to use a
another sub-optimal solution on that platform.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#94Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Zeugswetter Andreas SB SD (#92)
Re: [HACKERS] Tablespaces

Zeugswetter Andreas SB SD wrote:

First of all, symlinks are a pretty popular "feature." Even Windows
supports what would be needed. Second of all, PostgreSQL will still
run on OSes without symlinks, tablespaces won't be available, but
PostgreSQL will still run. Since we are all using PostgreSQL without

My idea for platforms that don't support symlinks would be to simply create
a tblspaceoid directory inplace instead of the symlink (maybe throw a warning).
My feeling is, that using the same syntax on such platforms is important,
but actual distribution is not (since they will most likely be small systems).

Nice idea --- so you can create tablespaces, but can't specify a
different location for it --- makes sense.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#95scott.marlowe
scott.marlowe@ihs.com
In reply to: Thomas Swan (#90)
Re: [HACKERS] Tablespaces

On Fri, 5 Mar 2004, Thomas Swan wrote:

jearl@bullysports.com wrote:

<tswan@idigx.com> writes:

tswan@idigx.com wrote:

"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:

My feeling is that we need not support tablespaces on OS's without
symlinks.

To create symlinked directories on Win2k NTFS see:
http://www.sysinternals.com/ntw2k/source/misc.shtml#junction
I think Win2000 or XP would be a reasonable restriction for Win32 PG
installations that want tablespaces.

Oh, good --- symlinks for directories are all that we need for this
design. I think that settles it then.

What archival tools are there that would restore this to this back to
the
filesystem: tar? zip? What would happen if a symlink were removed or
pointed to an invalid location while the postmaste was running?

Well, for backup, just run tar or find on /data with a flag to
follow symlinks, and you are done. Can't get much easier than
that.

I'm ruferring to NTFS and the win32 platforms. How does tar handle
these symlinks on the NTFS filesystem? What about if someone finds
that FAT32 is significantly better for the database?

tar doesn't know anything about PostgreSQL system catalogs. If we use
symlinks for tablespaces then it would be possible to backup downed
databases with a simple tar command on every platform *I* care about
(and probably Windows too). Using system catalogs for this stuff
would simply guarantee that I would have to read the system catalogs
and then back up each tablespace manually. In short, your idea would
trade off (maybe) having to backup tablespaces manually on a few
platforms for the certainty of having to backup tablespaces manually
on all platforms.

How is that a win?

Apparently, I have failed tremendously in addressing a concern. The
question is does PostgreSQL need to rely on symlinks and will that
dependency introduce problems?

There is an active win32 port underway (see this mailing list). One
proposal was to try to use an OS specific filesystem feature to perform
a symlink on NTFS. Can the special symlink that NTFS allegedly supports
be archived the same way symlinks are archived on Unix? If so, is there
a utility that can do this (zip, tar, etc). The backup operator would
still need to know what directories needed to be archived in addtion to
the pgdata directory. Is this symlink structure a normal/special file
that can be archived by normal means (tar,zip, etc)?

According to this page:

http://www.linuxinfor.com/en/man1/ln.1.html

from the linux man pages,
"On existing implementations, if it is at all possible to make a hard link
to a directory, this may be done by the superuser only. POSIX forbids the
system call link(2) and the utility ln to make hard links to directories
(but does not forbid hard links to cross filesystem boundaries)."

and states that the command ln is "POSIX 1003.2. However, POSIX 1003.2
(1996) does not discuss soft links. Soft links were introduced by BSD, and
do not occur in System V release 3 (and older) systems."

I fear the more useful of the two would be soft links, but if soft links
are not a part of the POSIX standard, then postgresql probably shouldn't
base key features on them unless said features would be onerous to
implement without soft links.

Oddly enough though, Microsoft has now released their unix services
package for free, and it comes with the commands to create a symbolic
link, and runs on Windows NT 4.0. Windows 2000. Windows XP. and Windows
Server 2003.

So, soft links would appear to not be a real non-starter, as long as the
ability to make softlinks on those systems won't rely on having some
strange package installed (like MS's Unix services package.)

I imagine there's a standard OS call in modern MS OSes that will let you
create a symbolic link with no special libs installed, and if that's the
case, the the argument that maybe FAT would be a better file system comes
under the same heading as running your database on NFS. Neither is a good
idea, and PGSQL can't guarantee normal, reliable operation.

Example:

PGDATA is C:\pgdata
I have a tablespace in Z:\1\ and Z:\2\
There exists an alleged symlink in
C:\pgdata\data\base\tablespaces\schmoo -> Z:\1

Can I archive [ C:\pgdata, Z:\1, Z:\2 ], restore them, and have
postgresql working just as before?

Have you tried building said structure and backing up and restoring it to
see?

We aren't talking about a "feature that work[s] on Linux on not on
FreeBSD." We are talking about a feature that works on every OS that
suports symlinks (which includes even operating systems like Windows
that PostgreSQL doesn't currently support).

Hello? What was this response from Tom Lane? "My feeling is that we
need not support tablespaces on OS's without symlinks." That seems to
be indicative of a feature set restriction base on platform.

If every OS we run on supports sym links is that such a restriction?

Which OSes (I'm asking because I really don't know) that are currently
supported don't support symbolic links?

Additionally, another developer noted the advantage of a text file
is that it would be easy for someone to develop tools to help if it
became difficult to edit or parse. Additionally, there could be a
change away from a flat file format to an XML format to configure
the tablespace area.

The advantage of symlinks is that no tools would have to be written
and 'ls -l' would show everything you would need to know about where
your tablespaces actually were.

Where is 'ls -l' on a win32 box? If you will follow the discussion of
symlinks under MinGW you will see that they don't work as commanded.
And, postgresql is supposed to be compiled under MinGW, but not require
it to run.

It's free from Microsoft. Postgresql wouldn't require it. It's a
standard file system level call on Windows, and has been since NT 4.0 was
a baby. That was 1996-1997 when I built an NT4.0 server, downloaded the
free gnu utils and found out that they worked.

Plain and simple, Windows has supported symbolic links, soft and hard, for
some time, and every flavor of unix known does as well. While MVS on a
mainframe or VMS might present some problems, those operating systems all
have LVMs and you could just mount and unmount space under your database.

In reply to: scott.marlowe (#95)
Re: [HACKERS] Tablespaces

Quoting Bruce Momjian <pgman@candle.pha.pa.us>:

Zeugswetter Andreas SB SD wrote:

My idea for platforms that don't support symlinks would be to simply

create

a tblspaceoid directory inplace instead of the symlink (maybe throw a

warning).

My feeling is, that using the same syntax on such platforms is important,
but actual distribution is not (since they will most likely be small

systems).

Nice idea --- so you can create tablespaces, but can't specify a
different location for it --- makes sense.

If this is the way it is handled then I don't guess I have any objections. You
would end up with the files for a single database spread out, possibly over
multiple disks. I think I would still prefer the option of using an XML file to
this, but I can also see the need to limit the platform specific code. I guess
if I feel strongly enough about it, through the magic of Open Source I can patch
it to work the way I want. If the way Junctions work on NTFS becomes an issue
after the port, it won't be a problem, it will be an opportunity.

Quoting Bruce Momjian <pgman@candle.pha.pa.us>:

The use of symlinks on Unix has just too many advantages to use a
another sub-optimal solution on that platform.

And creating separate solutions for different platforms has too many
disadvantages to warrent a compile time option. Ok, implement this and move on
to porting so that I can use it in the world I work in. :-)

Seriously, thanks for doing all you guys do. I will be activly evangelizing as
soon as I can get a version working standalone where I am.

Lawrence E. Smithmier, Jr.
MCP, MCAD
(919) 522-9738
Larry@Smithmier.com

#97Noname
jearl@bullysports.com
In reply to: Thomas Swan (#90)
Re: [HACKERS] Tablespaces

"Thomas Swan" <tswan@idigx.com> writes:

jearl@bullysports.com wrote:

[snip]

Apparently, I have failed tremendously in addressing a concern. The
question is does PostgreSQL need to rely on symlinks and will that
dependency introduce problems?

There is an active win32 port underway (see this mailing list).��
One proposal was to try to use an OS specific filesystem feature to
perform a symlink on NTFS.� Can the special symlink that NTFS
allegedly supports be archived the same way symlinks are archived on
Unix?� If so, is there a utility that can do this (zip, tar, etc).�
The backup operator would still need to know what directories needed
to be archived in addtion to the pgdata directory.��� Is this
symlink structure a normal/special file that can be archived by
normal means (tar,zip, etc)?

Example:

PGDATA is C:\pgdata
I have a tablespace in Z:\1\ and Z:\2\
There exists an alleged symlink in
C:\pgdata\data\base\tablespaces\schmoo -> Z:\1

Can I archive [ C:\pgdata, Z:\1, Z:\2 ], restore them, and have
postgresql working just as before?

My point is that if you are using some sort of sytem table, catalog
file, or whatever else (besides symlinks), you are still going to have
to backup up c:\pgdata z:\1 and z:\2 manually anyhow. Unless, of
course, you have some magical version of tar or zip that reads the
PostgreSQL specific configuration file and somehow "does the right
thing." At least with symlinks you have a fighting chance on some
platforms of being able to do:

tar zcvf /tmp/backup.tgz /usr/local/postgres/pgdata/

and have it do the right thing. Using PostgreSQL specific catalogs
you would force UNIX users (the majority of PostgreSQL users right
now) to back up their tablespaces manually. Forcing platforms with
symlinks to use your wacky symlink replacement just guarantees that
all platforms work equally poorly. It doesn't make the Win32 port any
better. You would still have to backup c:\pgdata z:\1\ and z:\2\
separately on Win32. The only difference is that now your misery
would have the company of all of the rest of us.

It seems a little insane to introduce an OS/filesystem dependency
at the onset of a porting effort especially if you hope to be OS
agnostic for feature sets.� I think someone would be crying foul if
a new feature only worked on Linux and not on FreeBSD. ���

First of all, symlinks are a pretty popular "feature."� Even Windows
supports what would be needed.� Second of all, PostgreSQL will still
run on OSes without symlinks, tablespaces won't be available, but
PostgreSQL will still run.� Since we are all using PostgreSQL
without tablespaces now, it can hardly be argued that tablespaces
are a critical feature.

We aren't talking about a "feature that work[s] on Linux on not on
FreeBSD."� We are talking about a feature that works on every OS
that suports symlinks (which includes even operating systems like
Windows that PostgreSQL doesn't currently support).

Hello?� What was this response from Tom Lane? "My feeling is that we
need not support tablespaces on OS's without symlinks."� That seems to
be indicative of a feature set restriction base on platform.

Tom Lane works for Red Hat. You can't hardly expect him to spend all
of his time working around the limitations of the competition's
operating system.

Additionally, another developer noted the advantage of a text file
is that it would be easy for someone to develop tools to help if it
became difficult to edit or parse.� Additionally, there could be a
change away from a flat file format to an XML format to configure
the tablespace area.

The advantage of symlinks is that no tools would have to be written
and 'ls -l' would show everything you would need to know about where
your tablespaces actually were.

Where is 'ls -l' on a win32 box?� If you will follow the discussion
of symlinks under MinGW you will see that they don't work as
commanded. And, postgresql is supposed to be compiled under MinGW,
but not require it to run.

From Windows 2000, 'ls' is not recognized as an internal or external
command, operable program or batch file.

Yes, Windows lacks 'ls', but it has similar tools.

XML files are relatively easy to parse, but they certainly aren't as
easy as simply letting PostgreSQL follow a symlink.� Why reinvent the
wheel with what would essentially be PostgreSQL's own implementation
of a symlink?
�

Is opening a file recreating a symlink?� If you are opening file
descriptors why rely on symlinks.� If you know the location either
from the system catalog, a or configuration file, is it any terribly
more complicated?�� Basically, if a tablespace needed to be renamed,
or moved, or changed, your going to have to do file management
anyway.� The symlink saves you just a lookup as to what files go
where?� If you kept this small hash in memory, it's not a continuous
lookup because you have the redirection internally.� And, it's more
portable.

Yes, and the implementation of tablespaces apparently keeps track of
all of that in the system catalogs. In other words, the new table
space mechanism will be fancier than what we currently have (which is
just the ability to move the files themselves somewhere else and
create symlinks).

What the symlink does is allow the core part of PostgreSQL to pretend
that the files are still in one directory just like they have always
been. It also allows folks using OSes that support symlinks to be
able to simply tar up the directory :).

To go back to your *tar* example, are we going to rewrite tar so that
it reads PostgreSQL's XML file and *does the right thing*?

I'm not talking about integrating tar with PostgreSQL or uniting the
universal string theory with how to cook apple pie.� Why would think
we would have to rewrite tar?

Your tar example was the only argument that you made against symlinks
that made any sense at all. Here's an example that you wanted to see
work.

Example:

PGDATA is C:\pgdata
I have a tablespace in Z:\1\ and Z:\2\
There exists an alleged symlink in
C:\pgdata\data\base\tablespaces\schmoo -> Z:\1

Can I archive [ C:\pgdata, Z:\1, Z:\2 ], restore them, and have
postgresql working just as before?

Now, using some sort of XML catalog can you archive [ C:\pgdata, Z:\1,
Z:\2 ] and have PostgreSQL working just as before?

The answer to that question is *no*, unless of course you read the XML
file yourself, archive each tablespace manually, and then restore it
all piece by piece. After all, zip or tar doesn't know anything about
your XML file, and so it doesn't have a clue about your various
tablespaces outside of c:\pgdata. Since this is *precisely* what you
would have to do with the symlink implementation (the difference being
that you would have to read the system catalogs instead of an XML
file) what's the point in handcapping systems with symlinks that work?

[snip]

The other option proposed was to give win32 a subset of features that
would be available to other platforms.� In this case, that would be that
the win32 port could support tablespaces.� This is strikingly different
than a performance issue.�� It would be one thing for tablespaces to
perform poorly, it's another for them to fail or not exist altogether.

Perhaps if you could give us an example of an actual case where some
actual PostgreSQL users (or potential users) might be affected?

See the comment from Tom Lane on limiting features.� Look at the
potential Win32 market which outnumbers the unix market in number of
computers and developers by a large margin.

I *agree* with Tom Lane. Personally I am more concerned about
PostgreSQL running well on Linux (the platform I use) then I am about
being able to possibly win potential Windows installations. As far as
I am concerned my use of PostgreSQL on Linux gives me a competitive
advantage :). I see no reason to dumb down the Linux version of
PostgreSQL simply so that I can share the misery that a Windows user
has to face on a daily basis.

Besides which, directory symlinks actually do exist on Windows. I
have spent a bit of time playing with sysinternals 'junction' program,
and while not quite as cool as symlinks, it would certainly work.

Jason

#98Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noname (#97)
Re: [HACKERS] Tablespaces

[ lots of opinions about depending on symlinks for tablespaces ]

One thing that I think hasn't been noted in this thread is that our
initial implementation won't bind us forever. If it becomes clear that
a symlink-based implementation has real problems, we can change it.
But if we spend extra effort to avoid symlinks from the start, we will
never know whether that effort was justified or wasted.

We have enough work to do to make tablespaces happen that I feel we
should take the path of least resistance to start with. I do not think
it would cost us much wasted effort if the decision proves out wrong.

regards, tom lane

#99Marko Karppinen
marko@karppinen.fi
In reply to: Bruce Momjian (#75)
Re: Tablespaces

On 3 March 2004, at 19:52, Bruce Momjian wrote:

The advantage of symlinks is that an administrator could see how things
are laid out from the command line.

One thing to keep in mind is that system administrators don't see
symlinks as being informational -- they see them as the actual UI
for the redirection in question. So their expectation is that they'll
be able to move the actual directory around at will (as long as they
update the symlink to match).

If symlinks are used, the rule of least surprise would mean that
no information whatsoever about the physical location of a tablespace
should be stored in the system catalogs. Otherwise their relationship
with the information stored in the symlink is ambiguous.

mk

#100Andrew Sullivan
ajs@crankycanuck.ca
In reply to: Marko Karppinen (#99)
Re: Tablespaces

On Mon, Mar 08, 2004 at 02:07:35AM +0200, Marko Karppinen wrote:

One thing to keep in mind is that system administrators don't see
symlinks as being informational -- they see them as the actual UI
for the redirection in question. So their expectation is that they'll
be able to move the actual directory around at will (as long as they
update the symlink to match).

This is a good point. It's worth keeping in mind, too, that in large
shops, the DBAs and the sysadmins often are in separate departments
with separate management, precisely because the database system has
traditionally been somewhat divorced from the OS (as an aside, I
suspect that this sort of separation is part of the reason for the
popularity of raw filesystems among DBAs. Even if they didn't
provide better speed, it's just preferable not to have to involve
another department). System administrators in such places have been
known to decide to "reorganise the disks", assuming that the database
just has its own home. For such a sysadmin, a pile of symlinks would
be fair game for reorganisation.

A

--
Andrew Sullivan | ajs@crankycanuck.ca

#101Andreas Pflug
pgadmin@pse-consulting.de
In reply to: Andrew Sullivan (#100)
Re: Tablespaces

Andrew Sullivan wrote:

eorganise the disks", assuming that the database
just has its own home. For such a sysadmin, a pile of symlinks would
be fair game for reorganisation.

Please take into consideration that symlinks might be every day work for
*nix admins, but for win admins it's very uncommon. Additionally, win
admins are accustomed to gui tools, and many of them will stumble if
forced to use a command line. For worse, junctions are not
distinguishable in explorer; only the shell's dir command knows about
junctions. This obfuscation makes junctions quite a bad choice for admin
purposes.

IMHO there are only two viable options:
- no tablespaces for win32, i.e. recommend *ix for bigger installations
- a different tablespace storage approach., e.g. simple desktop links
(or alike) redirecting to a directory.

Regards,
Andreas

#102Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Andrew Sullivan (#100)
Re: Tablespaces

Andrew Sullivan wrote:

On Mon, Mar 08, 2004 at 02:07:35AM +0200, Marko Karppinen wrote:

One thing to keep in mind is that system administrators don't see
symlinks as being informational -- they see them as the actual UI
for the redirection in question. So their expectation is that they'll
be able to move the actual directory around at will (as long as they
update the symlink to match).

This is a good point. It's worth keeping in mind, too, that in large
shops, the DBAs and the sysadmins often are in separate departments
with separate management, precisely because the database system has
traditionally been somewhat divorced from the OS (as an aside, I
suspect that this sort of separation is part of the reason for the
popularity of raw filesystems among DBAs. Even if they didn't
provide better speed, it's just preferable not to have to involve
another department). System administrators in such places have been
known to decide to "reorganise the disks", assuming that the database
just has its own home. For such a sysadmin, a pile of symlinks would
be fair game for reorganisation.

Agreed. I think the idea is to use lstat to query the symlink, rather
than storing that information in the database. My idea was to create an
lstat server-side function that could be used by pg_dump and friends.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#103Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#78)
Re: Out of space situation and WAL log pre-allocation (was Tablespaces)

Please excuse the delay in replying..

Tom Lane
Joe Conway <mail@joeconway.com> writes:

Simon Riggs wrote:

O... and other dbms will freeze when this situation is hit, rather
than continue and drop archive logs.]

Been there, done that, don't see how it's any better. I hesitate to

be

real specific here, but let's just say the end result was restore

from

backup :-(

Myself also. I accept your experience and insight, I apologise if my own
seemed overblown. My take on that is that if you're in a situation that
has a high probability of going bad, the last thing you would want is to
drop xlogs. Same technical experience, different viewpoint on what to
learn from it.

It's hard for me to imagine a situation in which killing the database
would be considered a more attractive option than dropping old log
data. You may or may not ever need the old log data, but you darn

well

do need a functioning database. (If you don't, you wouldn't be going

to

all this work.)

The main point here for me is that the choice of keeping archived (not
old) log files against keeping the database up isn't actually mine to
make; that choice belongs to the owner of the database, not me as
developer or administrator, consultant or whatever.

Although I admit I did not at first comprehend that such a view was
possible, I did flex to allow yours and Joe's perspective when that was
voiced.

The point is one of risk: does the owner wish to risk the possibility
that a transaction may be lost in order to keep the database up? The
possibility of lost rows must be balanced against the probably higher
possibility of being unable to write new data. But which is worse? Who
can say?

In some environments where I have worked, (again forgive any seeming
personal arrogance or posturing), such as banks or finance generally, it
has been desirable to stop the system rather than risk losing even a
single row. In other situations, lost rows must be balanced against the
money lost through downtime. Guess it depends whether you've got a
contract for uptime or for data integrity?? ;)

I repeat: code that pushes
logs into a secondary area is not ours to write. We should

concentrate

on providing an API that lets users write it.

Agreed.

We have only limited
manpower for this project and we need to spend it on getting the core
functionality done right, not on inventing frammishes.

Love that word "frammish"...seriously, I understand and agree.

My understanding is that existing logic will cause a PANIC if the xlog
directory cannot be written to. Helping the database stay up by dropping
logs would require extra code...

This was an edge case anyhow...

Best Regards, Simon Riggs

#104Simon Riggs
simon@2ndquadrant.com
In reply to: Joe Conway (#77)
Re: Out of space situation and WAL log pre-allocation (was Tablespaces)

Joe Conway [mailto:mail@joeconway.com]
Simon Riggs wrote:

Tom Lane [mailto:tgl@sss.pgh.pa.us] That should be user-scriptable
policy, in my worldview.

O... and other dbms will freeze when this situation is hit, rather
than continue and drop archive logs.]

Been there, done that, don't see how it's any better. I hesitate to be
real specific here, but let's just say the end result was restore from
backup :-(

So, if we had a parameter called Wal_archive_policy that has 3
settings: None = no archiving Optimistic = archive, but if for some
reason log space runs out then make space by dropping the oldest
archive logs Strict = if log space runs out, stop further write
transactions from committing, by whatever means, even if this takes
down dbms.

That sounds good to me. For the "Optimistic" case, we need to yell
loudly if we do find ourselves needing to drop segments. For the
"Strict" case, we just need to be sure it works correctly ;-)

Good.

Yell loudly really needs to happen sometime earlier, which is as Gavin
originally thought something to do with tablespaces.

Strict behaviour is fairly straightforward, you just PANIC!

I'd think we could rename these to
Fail Operational rather than Optimistic
Fail Safe rather than Strict
...the other names were a bit like "I'm right" and "but I'll do yours
too" ;}

Best Regards, Simon Riggs

#105Simon Riggs
simon@2ndquadrant.com
In reply to: Bruce Momjian (#79)
Re: Out of space situation and WAL log pre-allocation (was

Bruce Momjian
Simon Riggs wrote:

User-selectable behaviour? OK. That's how we deal with fsync; I can
relate to that. That hadn't been part of my thinking because of the
importance I'd attached to the log files themselves, but I can go

with

that, if that's what was meant.

So, if we had a parameter called Wal_archive_policy that has 3

settings:

None = no archiving
Optimistic = archive, but if for some reason log space runs out then
make space by dropping the oldest archive logs
Strict = if log space runs out, stop further write transactions from
committing, by whatever means, even if this takes down dbms.

That way, we've got something akin to transaction isolation level

with

various levels of protection.

Yep, we will definately need something like that. Basically whenever
the logs are being archived, you have to stop the database if you

can't

archive, no?

That certainly was my initial feeling, though I believe it is possible
to accommodate both viewpoints. I would not want to have only the
alternative viewpoint, I must confess.

Best Regards, Simon Riggs

#106Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Simon Riggs (#105)
Re: Out of space situation and WAL log pre-allocation (was

Simon Riggs wrote:

Bruce Momjian
Simon Riggs wrote:

User-selectable behaviour? OK. That's how we deal with fsync; I can
relate to that. That hadn't been part of my thinking because of the
importance I'd attached to the log files themselves, but I can go

with

that, if that's what was meant.

So, if we had a parameter called Wal_archive_policy that has 3

settings:

None = no archiving
Optimistic = archive, but if for some reason log space runs out then
make space by dropping the oldest archive logs
Strict = if log space runs out, stop further write transactions from
committing, by whatever means, even if this takes down dbms.

That way, we've got something akin to transaction isolation level

with

various levels of protection.

Yep, we will definately need something like that. Basically whenever
the logs are being archived, you have to stop the database if you

can't

archive, no?

That certainly was my initial feeling, though I believe it is possible
to accommodate both viewpoints. I would not want to have only the
alternative viewpoint, I must confess.

Added to PITR TODO list. Anything else to add:

http://momjian.postgresql.org/main/writings/pgsql/project

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#107Greg Stark
gsstark@mit.edu
In reply to: Simon Riggs (#104)
Re: Out of space situation and WAL log pre-allocation (was Tablespaces)

"Simon Riggs" <simon@2ndquadrant.com> writes:

Strict behaviour is fairly straightforward, you just PANIC!

There is another mode possible as well. Oracle for example neither panics nor
continues, it just freezes. It keeps retrying the transaction until it finds
it has space.

The sysadmin or dba just has to somehow create additional space by removing
old files or however and the database will continue where it left off. That
seems a bit nicer than panicing.

When I first heard that I was shocked. It means implementing archive logs
*created* a new failure mode where there was none before. I thought that was
the dumbest idea in the world: who needed a backup process that increased the
chances of an outage? Now I can see the logic, but I'm still not sure which
mode I would pick if it was up to me. As others have said, I guess it would
depend on the situation.

--
greg

#108Greg Stark
gsstark@mit.edu
In reply to: Dennis Bjorklund (#6)
Re: Tablespaces

Dennis Bjorklund <db@zigo.dhs.org> writes:

On Thu, 26 Feb 2004, Gavin Sherry wrote:

Comments? Questions? Suggestions?

Is that plan that in the future one can split a single table into
different table spaces? Like storing all rows with year < 1999 in one
tablespace and the rest in another?

That's a separate orthogonal feature called "partitioned tables". There's some
amount of resistance to the idea amongst postgres people, and there's nobody
who has spoken up interested in implementing it, but there's also lots of
interest from users. A good patch would probably go a long way to convincing
people :)

Table spaces are being able to store different tables in different physical
locations on disk. A first version of this has actually been implemented for
7.5 using symlinks.

Partitioned tables and tablespaces do indeed have a certain amount of synergy.
But even in a single tablespace your example makes sense.

With the rule system and two underlying tables one could make it work by
hand I think.

The rule system could be used to do this, but there was some discussion of
using inherited tables to handle it. However neither handles the really hard
part of detecting queries that use only a part of the table and taking that
into account in generating the plan.

--
greg

#109Noname
pgsql@mohawksoft.com
In reply to: Andrew Sullivan (#33)
Re: Tablespaces

I don't think we want features for their own sake, though, and I'm
not convinced that raw filesystems are actually useful. Course, it's
not my itch, and PostgreSQL _is_ free software.

I agree that raw file systems are seldom useful with one caveat, more
advanced file systems are sometimes detrimental to database access.

Conceptually, a file system and a database are redundant, both are doing
their best to preserve data integrity. This is especially true with
journalling file systems. Not to mention technologies like reiserfs which
attempts to do sub-block allocation.

What I think would go a long way to improving database performance on
non-raw partitions would be a simplified file system -- SFS anyone? The
simplified file system would not track access time. It would not overly
try to manage disk space. The target applications are going to allocate
disk space on a block level, rather than quibble about 4K here or 8K here,
have a user defined standard allocation unit of 64K, 128K, or so on.
Reduction on allocation overhead also reduces meta-data updating I/O. I
can almost imagine 32BIT FAT with large clusers, only with real inodes.
The idea would be that a database, like PostgreSQL, would be managing the
data not the file system. The file systems job would only to be the most
minimalist interface to the OS.

The benefts would be awesome, near-raw partition access and standard OS
tools for maintainence.