Big 7.1 open items

Started by Bruce Momjianover 25 years ago342 messages
#1Bruce Momjian
pgman@candle.pha.pa.us

Here is the list I have gotten of open 7.1 items:

bit type
inheritance
drop column
vacuum index speed
cached query plans
memory context cleanup
TOAST
WAL
fmgr redesign
encrypt pg_shadow passwords
redesign pg_hba.conf password file option
new location for config files

I have some of my own that are not on the list, as do others who are
working on their own items. Just thought a list of major items that
need work would be helpful.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#2Karel Zak
zakkr@zf.jcu.cz
In reply to: Bruce Momjian (#1)
Re: Big 7.1 open items

On Tue, 13 Jun 2000, Bruce Momjian wrote:

Here is the list I have gotten of open 7.1 items:

bit type
inheritance
drop column
vacuum index speed
cached query plans

^^^^^^^^^^^^^^^^^

I have already down it and I send patch for _testing_ next week (or
later), but I think that not will for 7.1, but 7.2.

memory context cleanup
TOAST
WAL
fmgr redesign
encrypt pg_shadow passwords
redesign pg_hba.conf password file option
new location for config files

+ new ACL? (please :-)

BTW. --- really cool list.

Karel

#3Vince Vielhaber
vev@michvhf.com
In reply to: Bruce Momjian (#1)
Re: Big 7.1 open items

On Tue, 13 Jun 2000, Bruce Momjian wrote:

Here is the list I have gotten of open 7.1 items:

encrypt pg_shadow passwords

This will be for 7.1? For some reason I thought it was being pushed
off to 7.2.

Vince.
--
==========================================================================
Vince Vielhaber -- KA8CSH email: vev@michvhf.com http://www.pop4.net
128K ISDN from $22.00/mo - 56K Dialup from $16.00/mo at Pop4 Networking
Online Campground Directory http://www.camping-usa.com
Online Giftshop Superstore http://www.cloudninegifts.com
==========================================================================

#4Sergio A. Kessler
sak@tribctas.gba.gov.ar
In reply to: Bruce Momjian (#1)
Re: Big 7.1 open items

Bruce Momjian <pgman@candle.pha.pa.us> el d�a Tue, 13 Jun 2000 05:05:53
-0400 (EDT), escribi�:

[...]

new location for config files

can I suggest /etc/postgresql ?

sergio

#5Peter Eisentraut
e99re41@DoCS.UU.SE
In reply to: Bruce Momjian (#1)
Re: Big 7.1 open items

On Tue, 13 Jun 2000, Bruce Momjian wrote:

Here is the list I have gotten of open 7.1 items:

bit type
inheritance
drop column
vacuum index speed
cached query plans
memory context cleanup
TOAST
WAL
fmgr redesign
encrypt pg_shadow passwords
redesign pg_hba.conf password file option

Any details?

new location for config files

Are you referring to pushing internal files to `$PGDATA/global'?

--
Peter Eisentraut Sernanders v�g 10:115
peter_e@gmx.net 75262 Uppsala
http://yi.org/peter-e/ Sweden

#6The Hermit Hacker
scrappy@hub.org
In reply to: Sergio A. Kessler (#4)
Re: Big 7.1 open items

On Tue, 13 Jun 2000, Sergio A. Kessler wrote:

Bruce Momjian <pgman@candle.pha.pa.us> el d���a Tue, 13 Jun 2000 05:05:53
-0400 (EDT), escribi���:

[...]

new location for config files

can I suggest /etc/postgresql ?

you can ... but everything related to postgresql has always been designed
not to require any special permissions to install, and /etc/postgresql
would definitely require root access to install :(

#7Vince Vielhaber
vev@michvhf.com
In reply to: The Hermit Hacker (#6)
Re: Big 7.1 open items

On Tue, 13 Jun 2000, The Hermit Hacker wrote:

On Tue, 13 Jun 2000, Sergio A. Kessler wrote:

Bruce Momjian <pgman@candle.pha.pa.us> el d���a Tue, 13 Jun 2000 05:05:53
-0400 (EDT), escribi���:

[...]

new location for config files

can I suggest /etc/postgresql ?

you can ... but everything related to postgresql has always been designed
not to require any special permissions to install, and /etc/postgresql
would definitely require root access to install :(

~postgres/etc ??

Vince.
--
==========================================================================
Vince Vielhaber -- KA8CSH email: vev@michvhf.com http://www.pop4.net
128K ISDN from $22.00/mo - 56K Dialup from $16.00/mo at Pop4 Networking
Online Campground Directory http://www.camping-usa.com
Online Giftshop Superstore http://www.cloudninegifts.com
==========================================================================

#8Peter Eisentraut
e99re41@DoCS.UU.SE
In reply to: Vince Vielhaber (#7)
Re: Big 7.1 open items

On Tue, 13 Jun 2000, Vince Vielhaber wrote:

new location for config files

can I suggest /etc/postgresql ?

you can ... but everything related to postgresql has always been designed
not to require any special permissions to install, and /etc/postgresql
would definitely require root access to install :(

~postgres/etc ??

You need root access to create a postgres user. What's wrong with just
keeping it in $PGDATA and making symlinks whereever you would prefer it?

--
Peter Eisentraut Sernanders v�g 10:115
peter_e@gmx.net 75262 Uppsala
http://yi.org/peter-e/ Sweden

#9The Hermit Hacker
scrappy@hub.org
In reply to: Vince Vielhaber (#7)
Re: Big 7.1 open items

that one works ...

On Tue, 13 Jun 2000, Vince Vielhaber wrote:

On Tue, 13 Jun 2000, The Hermit Hacker wrote:

On Tue, 13 Jun 2000, Sergio A. Kessler wrote:

Bruce Momjian <pgman@candle.pha.pa.us> el d���a Tue, 13 Jun 2000 05:05:53
-0400 (EDT), escribi���:

[...]

new location for config files

can I suggest /etc/postgresql ?

you can ... but everything related to postgresql has always been designed
not to require any special permissions to install, and /etc/postgresql
would definitely require root access to install :(

~postgres/etc ??

Vince.
--
==========================================================================
Vince Vielhaber -- KA8CSH email: vev@michvhf.com http://www.pop4.net
128K ISDN from $22.00/mo - 56K Dialup from $16.00/mo at Pop4 Networking
Online Campground Directory http://www.camping-usa.com
Online Giftshop Superstore http://www.cloudninegifts.com
==========================================================================

Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Vince Vielhaber (#3)
Re: Big 7.1 open items

Vince Vielhaber <vev@michvhf.com> writes:

encrypt pg_shadow passwords

This will be for 7.1? For some reason I thought it was being pushed
off to 7.2.

I don't know of anything that would force delaying it --- it's not
dependent on querytree redesign, for example. The real question is,
do we have anyone who's committed to do the work? I heard a lot of
discussion but I didn't hear anyone taking responsibility for it...

regards, tom lane

#11Vince Vielhaber
vev@michvhf.com
In reply to: Tom Lane (#10)
Re: Big 7.1 open items

On Tue, 13 Jun 2000, Tom Lane wrote:

Vince Vielhaber <vev@michvhf.com> writes:

encrypt pg_shadow passwords

This will be for 7.1? For some reason I thought it was being pushed
off to 7.2.

I don't know of anything that would force delaying it --- it's not
dependent on querytree redesign, for example. The real question is,
do we have anyone who's committed to do the work? I heard a lot of
discussion but I didn't hear anyone taking responsibility for it...

I offered to do the work and I have the md5 routine here and tested on
a number of platforms. But as I said, I thought someone wanted to delay
it until 7.2, if that's not the case then I'll get to it. There was also
a lack of interest in testing it, but I think we have most platforms
covered.

Vince.
--
==========================================================================
Vince Vielhaber -- KA8CSH email: vev@michvhf.com http://www.pop4.net
128K ISDN from $22.00/mo - 56K Dialup from $16.00/mo at Pop4 Networking
Online Campground Directory http://www.camping-usa.com
Online Giftshop Superstore http://www.cloudninegifts.com
==========================================================================

#12Vince Vielhaber
vev@michvhf.com
In reply to: Vince Vielhaber (#11)
Re: Big 7.1 open items

On Tue, 13 Jun 2000, Ed Loehr wrote:

Vince Vielhaber wrote:

[...]

new location for config files

can I suggest /etc/postgresql ?

you can ... but everything related to postgresql has always been designed
not to require any special permissions to install, and /etc/postgresql
would definitely require root access to install :(

~postgres/etc ??

I would suggest you don't *require* or assume the creation of a postgres
user, except as an overridable default.

I *knew* somebody would bring this up. Before I sent that I tried to
describe the intent a few ways and just opted for simple. PostgreSQL
has to run as SOMEONE. Substitute that SOMEONE for ~postgres above.

Vince.
--
==========================================================================
Vince Vielhaber -- KA8CSH email: vev@michvhf.com http://www.pop4.net
128K ISDN from $22.00/mo - 56K Dialup from $16.00/mo at Pop4 Networking
Online Campground Directory http://www.camping-usa.com
Online Giftshop Superstore http://www.cloudninegifts.com
==========================================================================

#13Ed Loehr
eloehr@austin.rr.com
In reply to: Vince Vielhaber (#7)
Re: Big 7.1 open items

Vince Vielhaber wrote:

[...]

new location for config files

can I suggest /etc/postgresql ?

you can ... but everything related to postgresql has always been designed
not to require any special permissions to install, and /etc/postgresql
would definitely require root access to install :(

~postgres/etc ??

I would suggest you don't *require* or assume the creation of a postgres
user, except as an overridable default.

Regards,
Ed Loehr

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: The Hermit Hacker (#6)
Re: Big 7.1 open items

The Hermit Hacker <scrappy@hub.org> writes:

new location for config files

can I suggest /etc/postgresql ?

you can ... but everything related to postgresql has always been designed
not to require any special permissions to install, and /etc/postgresql
would definitely require root access to install :(

Even more to the point, the config files are always kept in the data
directory so that it's possible to run multiple installations on the
same machine. Keeping the config files under /etc (or any other fixed
location) would destroy that capability.

regards, tom lane

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Vince Vielhaber (#11)
Re: Big 7.1 open items

Vince Vielhaber <vev@michvhf.com> writes:

do we have anyone who's committed to do the work? I heard a lot of
discussion but I didn't hear anyone taking responsibility for it...

I offered to do the work and I have the md5 routine here and tested on
a number of platforms. But as I said, I thought someone wanted to delay
it until 7.2, if that's not the case then I'll get to it.

Far as I can see, you should go for it.

regards, tom lane

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#1)
Re: Big 7.1 open items

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Here is the list I have gotten of open 7.1 items:

There were a whole bunch of issues about the type system --- automatic
coercion rules, default type selection for both numeric and string
literals, etc. Not sure how to describe this in five words or less...

regards, tom lane

#17Kaare Rasmussen
kar@webline.dk
In reply to: Bruce Momjian (#1)
Re: Big 7.1 open items

Here is the list I have gotten of open 7.1 items:

I thought that someone was working on
outer joins
better views (or rewriting the rules system, not sure what the direction was)
better SQL92 compliance
also, I think that at some time there was discussion about a better interface
for procedures, enabling them to work on several tuples. May be wrong though.

But if all, or just most, of the items on your list will be finished, it ought
to be a 8.0 release :-)

--
Kaare Rasmussen --Linux, spil,-- Tlf: 3816 2582
Kaki Data tshirts, merchandize Fax: 3816 2582
Howitzvej 75 �ben 14.00-18.00 Email: kar@webline.dk
2000 Frederiksberg L�rdag 11.00-17.00 Web: www.suse.dk

#18Thomas Lockhart
lockhart@alumni.caltech.edu
In reply to: Bruce Momjian (#1)
Re: Big 7.1 open items

Since there are several people interested in contributing, we should
list:

Support multiple simultaneous character sets, per SQL92

- Thomas

#19Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Karel Zak (#2)
Re: Big 7.1 open items

memory context cleanup
TOAST
WAL
fmgr redesign
encrypt pg_shadow passwords
redesign pg_hba.conf password file option
new location for config files

+ new ACL? (please :-)

BTW. --- really cool list.

Updated TODO.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#20Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Karel Zak (#2)
Re: Big 7.1 open items

+ new ACL? (please :-)

Updated TODO.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#21Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Peter Eisentraut (#5)
Re: Big 7.1 open items

[ Charset ISO-8859-1 unsupported, converting... ]

On Tue, 13 Jun 2000, Bruce Momjian wrote:

Here is the list I have gotten of open 7.1 items:

bit type
inheritance
drop column
vacuum index speed
cached query plans
memory context cleanup
TOAST
WAL
fmgr redesign
encrypt pg_shadow passwords
redesign pg_hba.conf password file option

Any details?

I would like to remove our pg_passwd script that allows
username/passwords to be specified in a file, change that file to lists
of users, or allow lists of users in pg_hba.conf.

new location for config files

Are you referring to pushing internal files to `$PGDATA/global'?

Yes.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#22Bruce Momjian
pgman@candle.pha.pa.us
In reply to: The Hermit Hacker (#9)
Re: Big 7.1 open items

that one works ...

On Tue, 13 Jun 2000, Vince Vielhaber wrote:

On Tue, 13 Jun 2000, The Hermit Hacker wrote:

On Tue, 13 Jun 2000, Sergio A. Kessler wrote:

Bruce Momjian <pgman@candle.pha.pa.us> el d���a Tue, 13 Jun 2000 05:05:53
-0400 (EDT), escribi���:

[...]

new location for config files

can I suggest /etc/postgresql ?

you can ... but everything related to postgresql has always been designed
not to require any special permissions to install, and /etc/postgresql
would definitely require root access to install :(

~postgres/etc ??

Remember, that file has to be specific for each data tree, so it has to
be under /data.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#23Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#10)
Re: Big 7.1 open items

Vince Vielhaber <vev@michvhf.com> writes:

encrypt pg_shadow passwords

This will be for 7.1? For some reason I thought it was being pushed
off to 7.2.

I don't know of anything that would force delaying it --- it's not
dependent on querytree redesign, for example. The real question is,
do we have anyone who's committed to do the work? I heard a lot of
discussion but I didn't hear anyone taking responsibility for it...

Agreed. No reason not to be in 7.1.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#24Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#16)
Re: Big 7.1 open items

I just kept your e-mails. I will make a TODO.detail mailbox with them.

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Here is the list I have gotten of open 7.1 items:

There were a whole bunch of issues about the type system --- automatic
coercion rules, default type selection for both numeric and string
literals, etc. Not sure how to describe this in five words or less...

regards, tom lane

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#25Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Kaare Rasmussen (#17)
Re: Big 7.1 open items

Here is the list I have gotten of open 7.1 items:

I thought that someone was working on
outer joins
better views (or rewriting the rules system, not sure what the direction was)
better SQL92 compliance
also, I think that at some time there was discussion about a better interface
for procedures, enabling them to work on several tuples. May be wrong though.

But if all, or just most, of the items on your list will be finished, it ought
to be a 8.0 release :-)

Most of these are planned for 7.2.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#26Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Thomas Lockhart (#18)
Re: Big 7.1 open items

Added to TODO.

Since there are several people interested in contributing, we should
list:

Support multiple simultaneous character sets, per SQL92

- Thomas

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#27Michael Robinson
robinson@netrinsics.com
In reply to: Bruce Momjian (#26)
Re: Big 7.1 open items

While people are working on that, they might want to add some sanity checking
to the multibyte character decoders. Currently they fail to check for
"illegal" character sequences (i.e. sequences with no valid multibyte mapping),
and fail to do something reasonable (like return an error, silently drop the
offending characters, or anything else besides just returning random garbage
and crashing the backend).

The last time this failed to get on the TODO list because Bruce wanted
more than one person to verify that it was an issue. If several people
are going to work on the NATIONAL CHARACTER stuff, maybe they could look
into this issue, too.

-Michael Robinson

Show quoted text

Added to TODO.

Since there are several people interested in contributing, we should
list:

Support multiple simultaneous character sets, per SQL92

#28Oliver Elphick
olly@lfix.co.uk
In reply to: Peter Eisentraut (#5)
Re: Big 7.1 open items

On Tue, 13 Jun 2000, Bruce Momjian wrote:

Here is the list I have gotten of open 7.1 items:

Rolling back a transaction after dropping a table creates a corrupted
database. (Yes, I know it warns you not to do that, but users are
fallible and sometimes just plain stupid.) Although the system catalog
entries are rolled back, the file on disk is permanently destroyed.

I suggest that DROP TABLE in a transaction should not be allowed.

--
Oliver Elphick Oliver.Elphick@lfix.co.uk
Isle of Wight http://www.lfix.co.uk/oliver
PGP: 1024R/32B8FAA1: 97 EA 1D 47 72 3F 28 47 6B 7E 39 CC 56 E4 C1 47
GPG: 1024D/3E1D0C1C: CA12 09E0 E8D5 8870 5839 932A 614D 4C34 3E1D 0C1C
========================================
"I beseech you therefore, brethren, by the mercies of
God, that ye present your bodies a living sacrifice,
holy, acceptable unto God, which is your reasonable
service." Romans 12:1

#29Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Michael Robinson (#27)
Re: Re: Big 7.1 open items

While people are working on that, they might want to add some sanity checking
to the multibyte character decoders. Currently they fail to check for
"illegal" character sequences (i.e. sequences with no valid multibyte mapping),
and fail to do something reasonable (like return an error, silently drop the
offending characters, or anything else besides just returning random garbage
and crashing the backend).

The last time this failed to get on the TODO list because Bruce wanted
more than one person to verify that it was an issue. If several people
are going to work on the NATIONAL CHARACTER stuff, maybe they could look
into this issue, too.

The issue is that some people felt we shouldn't be performing such
checks, and some did.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#30Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Bruce Momjian (#29)
Re: Big 7.1 open items

[ Charset ISO-8859-1 unsupported, converting... ]

On Tue, 13 Jun 2000, Karel Zak wrote:

+ new ACL? (please :-)

Not if we're shipping in August. :(

I hear you.
-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Oliver Elphick (#28)
Re: Big 7.1 open items

"Oliver Elphick" <olly@lfix.co.uk> writes:

I suggest that DROP TABLE in a transaction should not be allowed.

I had actually made it do that for a short time early this year,
and was shouted down. On reflection I have to agree; it's too useful
to be able to do

begin;
drop table foo;
create table foo(new schema);
...
end;

You do indeed lose big if you suffer an error partway through, but
the answer to that is to fix our file naming conventions so that we
can support rollback of drop table.

Also note the complaints we've been getting about CREATE USER not
working inside a transaction block. That is a case where someone
(Peter IIRC) took the more hard-line approach of emitting an error
instead of a warning. I think it was not the right choice to make.

regards, tom lane

#32Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#31)
Re: Big 7.1 open items

Tom Lane writes:

Also note the complaints we've been getting about CREATE USER not
working inside a transaction block. That is a case where someone
(Peter IIRC) took the more hard-line approach of emitting an error
instead of a warning. I think it was not the right choice to make.

Probably. Remember that you can claim your lunch any time. :)

In all truth, the problem is that the ODBC driver isn't very flexible
about putting BEGIN/END blocks around things. Perhaps that is also
something to look at.

--
Peter Eisentraut Sernanders v�g 10:115
peter_e@gmx.net 75262 Uppsala
http://yi.org/peter-e/ Sweden

#33Noname
JanWieck@t-online.de
In reply to: Tom Lane (#31)
Re: Big 7.1 open items

Tom Lane wrote:

"Oliver Elphick" <olly@lfix.co.uk> writes:

I suggest that DROP TABLE in a transaction should not be allowed.

I had actually made it do that for a short time early this year,
and was shouted down. On reflection I have to agree; it's too useful
to be able to do

begin;
drop table foo;
create table foo(new schema);
...
end;

You do indeed lose big if you suffer an error partway through, but
the answer to that is to fix our file naming conventions so that we
can support rollback of drop table.

Belongs IMHO to the discussion to keep separate what is
separate (having indices/toast-relations/etc. in separate
directories and whatnot).

I've never been really happy with the file naming
conventions. The need of a filesystem entry to have the same
name of the DB object that is associated with it isn't right.
I know, some people love to be able to easily identify the
files with ls(1). OTOH what is that good for?

Well, someone can easily see how big the disk footprint of
his data is. Whow - what an info. Anything else?

Why not changing the naming to be something like this:

<dbroot>/catalog_tables/pg_...
<dbroot>/catalog_index/pg_...
<dbroot>/user_tables/oid_...
<dbroot>/user_index/oid_...
<dbroot>/temp_tables/oid_...
<dbroot>/temp_index/oid_...
<dbroot>/toast_tables/oid_...
<dbroot>/toast_index/oid_...
<dbroot>/whatnot_???/...

This way, it would be much easier to separate all the
different object types to different physical media. We would
loose some transparency, but I've allways wondered what
people USE that for (except for just wanna know). For
convinience we could implement another little utility that
tells the object size like

DESCRIBE TABLE/VIEW/whatnot <object-name>

that returns the physical location and storage details of the
object. And psql could use it to print this info additional
on the \d commands. Would give unprivileged users access to
this info, so be it, it's not a security issue IMHO.

The subdirectory an object goes into has to be controlled by
the relkind. So we need to tidy up that a little too. I think
it's worth it.

The objects storage location (the bare file) now would
contain the OID. So we avoid naming conflicts for temp
tables, naming conflicts during DROP/CREATE in a transaction
and all the like.

Comments?

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#34Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Noname (#33)
Re: Big 7.1 open items

Tom Lane wrote:

"Oliver Elphick" <olly@lfix.co.uk> writes:

I suggest that DROP TABLE in a transaction should not be allowed.

I had actually made it do that for a short time early this year,
and was shouted down. On reflection I have to agree; it's too useful
to be able to do

begin;
drop table foo;
create table foo(new schema);
...
end;

You do indeed lose big if you suffer an error partway through, but
the answer to that is to fix our file naming conventions so that we
can support rollback of drop table.

Belongs IMHO to the discussion to keep separate what is
separate (having indices/toast-relations/etc. in separate
directories and whatnot).

I've never been really happy with the file naming
conventions. The need of a filesystem entry to have the same
name of the DB object that is associated with it isn't right.
I know, some people love to be able to easily identify the
files with ls(1). OTOH what is that good for?

Well, I have no problem just appending some serial number to the end of
our existing names. That solves both purposes, no? Seems Vadim is
going to have a new storage manager in 7.2 anyway.

If/when we lose file name/object mapping, we will have to write
command-line utilities to report the mappings so people can do
administration properly. It certainly makes it hard for administrators.

Well, someone can easily see how big the disk footprint of
his data is. Whow - what an info. Anything else?

Why not changing the naming to be something like this:

<dbroot>/catalog_tables/pg_...
<dbroot>/catalog_index/pg_...
<dbroot>/user_tables/oid_...
<dbroot>/user_index/oid_...
<dbroot>/temp_tables/oid_...
<dbroot>/temp_index/oid_...
<dbroot>/toast_tables/oid_...
<dbroot>/toast_index/oid_...
<dbroot>/whatnot_???/...

This way, it would be much easier to separate all the
different object types to different physical media. We would
loose some transparency, but I've allways wondered what
people USE that for (except for just wanna know). For
convinience we could implement another little utility that
tells the object size like

Yes, we could do that.

DESCRIBE TABLE/VIEW/whatnot <object-name>

that returns the physical location and storage details of the
object. And psql could use it to print this info additional
on the \d commands. Would give unprivileged users access to
this info, so be it, it's not a security issue IMHO.

You need something that works from the command line, and something that
works if PostgreSQL is not running. How would you restore one file from
a tape. I guess you could bring back the whole thing, then do the
query, and move the proper table file back in, but that is a pain.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#35Don Baccus
dhogaza@pacifier.com
In reply to: Bruce Momjian (#34)
Re: Big 7.1 open items

At 07:13 PM 6/14/00 -0400, Bruce Momjian wrote:

This way, it would be much easier to separate all the
different object types to different physical media. We would
loose some transparency, but I've allways wondered what
people USE that for (except for just wanna know). For
convinience we could implement another little utility that
tells the object size like

Yes, we could do that.

It's a poor man's substitute for a proper create tablespace on
storage 'filesystem' - style dml statement, but it's a step in
the right direction.

- Don Baccus, Portland OR <dhogaza@pacifier.com>
Nature photos, on-line guides, Pacific Northwest
Rare Bird Alert Service and other goodies at
http://donb.photo.net.

#36Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noname (#33)
Re: Big 7.1 open items

JanWieck@t-online.de (Jan Wieck) writes:

I've never been really happy with the file naming
conventions. The need of a filesystem entry to have the same
name of the DB object that is associated with it isn't right.
I know, some people love to be able to easily identify the
files with ls(1). OTOH what is that good for?

I agree with Jan on this: let's just change the file names over to
be OIDs. Then we can have rollbackable DROP and RENAME TABLE easily.
Naming the files after the logical names of the tables is nice if it
doesn't cost anything, but it is *not* worth the trouble to preserve
a relationship between filename and tablename when it is costing us.
And it's costing us big time. That single feature is hurting us on
functionality, robustness, and portability, and for what benefit?
Not nearly enough. It's time to just let go of it.

Why not changing the naming to be something like this:

<dbroot>/catalog_tables/pg_...
<dbroot>/catalog_index/pg_...
<dbroot>/user_tables/oid_...
<dbroot>/user_index/oid_...
<dbroot>/temp_tables/oid_...
<dbroot>/temp_index/oid_...
<dbroot>/toast_tables/oid_...
<dbroot>/toast_index/oid_...
<dbroot>/whatnot_???/...

I don't see a lot of value in that. Better to do something like
tablespaces:

<dbroot>/<oidoftablespace>/<oidofobject>

regards, tom lane

#37Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#34)
Re: Big 7.1 open items

Bruce Momjian <pgman@candle.pha.pa.us> writes:

You need something that works from the command line, and something that
works if PostgreSQL is not running. How would you restore one file from
a tape.

"Restore one file from a tape"? How are you going to do that anyway?
You can't save and restore portions of a database like that, because
of transaction commit status problems. To restore table X correctly,
you'd have to restore pg_log as well, and then your other tables are
hosed --- unless you also restore all of them from the backup. Only
a complete database restore from tape would work, and for that you
don't need to tell which file is which. So the above argument is a
red herring.

I realize it's nice to be able to tell which table file is which by
eyeball, but the price we are paying for that small convenience is
just too high. Give that up, and we can have rollbackable DROP and
RENAME now (I'll personally commit to making it happen for 7.1).
Continue to insist on it, and I don't think we'll *ever* have those
features in a really robust form. It's just not possible to do
multiple file renames atomically.

regards, tom lane

#38Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#37)
Re: Big 7.1 open items

Bruce Momjian <pgman@candle.pha.pa.us> writes:

You need something that works from the command line, and something that
works if PostgreSQL is not running. How would you restore one file from
a tape.

"Restore one file from a tape"? How are you going to do that anyway?
You can't save and restore portions of a database like that, because
of transaction commit status problems. To restore table X correctly,
you'd have to restore pg_log as well, and then your other tables are
hosed --- unless you also restore all of them from the backup. Only
a complete database restore from tape would work, and for that you
don't need to tell which file is which. So the above argument is a
red herring.

I realize it's nice to be able to tell which table file is which by
eyeball, but the price we are paying for that small convenience is
just too high. Give that up, and we can have rollbackable DROP and
RENAME now (I'll personally commit to making it happen for 7.1).
Continue to insist on it, and I don't think we'll *ever* have those
features in a really robust form. It's just not possible to do
multiple file renames atomically.

OK, I am flexible. (Yea, right.) :-)

But seriously, let me give some background. I used Ingres, that used
the VMS file system, but used strange sequential AAAF324 numbers for
tables. When someone deleted a table, or we were looking at what tables
were using disk space, it was impossible to find the Ingres table names
that went with the file. There was a system table that showed it, but
it was poorly documented, and if you deleted the table, there was no way
to look on the tape to find out which file to restore.

As far as pg_log, you certainly would not expect to get any information
back from the time of the backup table to current, so the current pg_log
would be just fine.

Basically, I guess we have to do it, but we have to print the proper
error messages for cases in the backend we just print the file name.
Also, we have to now replace the 'ls -l' command with something that
will be meaningful.

Right now, we use 'ps' with args to display backend information, and ls
-l to show disk information. We are going to lose that here.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#39Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#38)
Re: Big 7.1 open items

Bruce Momjian <pgman@candle.pha.pa.us> writes:

But seriously, let me give some background. I used Ingres, that used
the VMS file system, but used strange sequential AAAF324 numbers for
tables. When someone deleted a table, or we were looking at what tables
were using disk space, it was impossible to find the Ingres table names
that went with the file. There was a system table that showed it, but
it was poorly documented, and if you deleted the table, there was no way
to look on the tape to find out which file to restore.

Fair enough, but it seems to me that the answer is to expend some effort
on system admin support tools. We could do a lot in that line with less
effort than trying to make a fundamentally mismatched filesystem
representation do what we need.

regards, tom lane

#40Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#39)
Re: Big 7.1 open items

Bruce Momjian <pgman@candle.pha.pa.us> writes:

But seriously, let me give some background. I used Ingres, that used
the VMS file system, but used strange sequential AAAF324 numbers for
tables. When someone deleted a table, or we were looking at what tables
were using disk space, it was impossible to find the Ingres table names
that went with the file. There was a system table that showed it, but
it was poorly documented, and if you deleted the table, there was no way
to look on the tape to find out which file to restore.

Fair enough, but it seems to me that the answer is to expend some effort
on system admin support tools. We could do a lot in that line with less
effort than trying to make a fundamentally mismatched filesystem
representation do what we need.

That was my point --- that in doing this change, we are taking on more
TODO items, that may detract from our main TODO items. I am also
concerned that the filename/tablename mapping is supported by so many
Unix toolks like ls, lsof/fstat, and tar, that we could be in for
needing to support tons of utilities to enable administrators to do what
they can so easily do now.

Even gdb shows us the filename/tablename in backtraces. We are never
going to be able to reproduce that. I guess I didn't want to bit off
that much work until we had a _convincing_ need. I guess I don't
consider table schema commands inside transactions and such to be as big
an items as the utility features we will need to build.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#41Don Baccus
dhogaza@pacifier.com
In reply to: Bruce Momjian (#38)
Re: Big 7.1 open items

At 10:28 PM 6/14/00 -0400, Bruce Momjian wrote:

As far as pg_log, you certainly would not expect to get any information
back from the time of the backup table to current, so the current pg_log
would be just fine.

In reality, very few people are going to be interested in restoring
a table in a way that breaks referential integrity and other
normal assumptions about what exists in the database. The reality
is that most people are going to engage in a little time travel
to a past, consistent backup rather than do as you suggest.

This is going to be more and more true as Postgres gains more and
more acceptance in (no offense intended) the real world.

Right now, we use 'ps' with args to display backend information, and ls
-l to show disk information. We are going to lose that here.

Dependence on "ls -l" is, IMO, a very weak argument.

- Don Baccus, Portland OR <dhogaza@pacifier.com>
Nature photos, on-line guides, Pacific Northwest
Rare Bird Alert Service and other goodies at
http://donb.photo.net.

#42Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#40)
Re: Big 7.1 open items

Bruce Momjian <pgman@candle.pha.pa.us> writes:

That was my point --- that in doing this change, we are taking on more
TODO items, that may detract from our main TODO items.

True, but they are also TODO items that could be handled by people other
than the inner circle of key developers. The actual rejiggering of
table-to-filename mapping is going to have to be done by one of the
small number of people who are fully up to speed on backend internals.
But we've got a lot more folks who would be able (and, hopefully,
willing) to design and code whatever tools are needed to make the
dbadmin's job easier in the face of the new filesystem layout. I'd
rather not expend a lot of core time to avoid needing those tools,
especially when I feel the old approach is fatally flawed anyway.

Even gdb shows us the filename/tablename in backtraces. We are never
going to be able to reproduce that.

Backtraces from *what*, exactly? 99% of the backend is still going
to be dealing with the same data as ever. It might be that poking
around in fd.c will be a little harder, but considering that fd.c
doesn't really know or care what the files it's manipulating are
anyway, I'm not convinced that this is a real issue.

I guess I don't consider table schema commands inside transactions and
such to be as big an items as the utility features we will need to
build.

You've *got* to be kidding. We're constantly seeing complaints about
the fact that rolling back DROP or RENAME TABLE fails --- and worse,
leaves the table in a corrupted/inconsistent state. As far as I can
tell, that's one of the worst robustness problems we've got left to
fix. This is a big deal IMHO, and I want it to be fixed and fixed
right. I don't see how to fix it right if we try to keep physical
filenames tied to logical tablenames.

Moreover, that restriction will continue to hurt us if we try to
preserve it while implementing tablespaces, ANSI schemas, etc.

regards, tom lane

#43Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#42)
Re: Big 7.1 open items

Bruce Momjian <pgman@candle.pha.pa.us> writes:

That was my point --- that in doing this change, we are taking on more
TODO items, that may detract from our main TODO items.

True, but they are also TODO items that could be handled by people other
than the inner circle of key developers. The actual rejiggering of
table-to-filename mapping is going to have to be done by one of the
small number of people who are fully up to speed on backend internals.
But we've got a lot more folks who would be able (and, hopefully,
willing) to design and code whatever tools are needed to make the
dbadmin's job easier in the face of the new filesystem layout. I'd
rather not expend a lot of core time to avoid needing those tools,
especially when I feel the old approach is fatally flawed anyway.

Yes, it is clearly fatally flawed. I agree.

Even gdb shows us the filename/tablename in backtraces. We are never
going to be able to reproduce that.

Backtraces from *what*, exactly? 99% of the backend is still going
to be dealing with the same data as ever. It might be that poking
around in fd.c will be a little harder, but considering that fd.c
doesn't really know or care what the files it's manipulating are
anyway, I'm not convinced that this is a real issue.

I was just throwing gdb out as an example. The bigger ones are ls,
lsof/fstat, and tar.

I guess I don't consider table schema commands inside transactions and
such to be as big an items as the utility features we will need to
build.

You've *got* to be kidding. We're constantly seeing complaints about
the fact that rolling back DROP or RENAME TABLE fails --- and worse,
leaves the table in a corrupted/inconsistent state. As far as I can
tell, that's one of the worst robustness problems we've got left to
fix. This is a big deal IMHO, and I want it to be fixed and fixed
right. I don't see how to fix it right if we try to keep physical
filenames tied to logical tablenames.

Moreover, that restriction will continue to hurt us if we try to
preserve it while implementing tablespaces, ANSI schemas, etc.

Well, we did have someone do a test implementation of oid file names,
and their report was that is looked pretty ugly. However, if people are
convinced it has to be done, we can get started. I guess I was waiting
for Vadim's storage manager, where the whole idea of separate files is
going to go away anyway, I suspect. We would then have to re-write all
our admin tools for the new format.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#44Noname
JanWieck@t-online.de
In reply to: Bruce Momjian (#34)
Re: Big 7.1 open items

Bruce Momjian wrote:

DESCRIBE TABLE/VIEW/whatnot <object-name>

that returns the physical location and storage details of the
object. And psql could use it to print this info additional
on the \d commands. Would give unprivileged users access to
this info, so be it, it's not a security issue IMHO.

You need something that works from the command line, and something that
works if PostgreSQL is not running. How would you restore one file from
a tape. I guess you could bring back the whole thing, then do the
query, and move the proper table file back in, but that is a pain.

Think you messed up some basics of PG here.

It's totally useless to restore single files of a PostgreSQL
database. You could either put back anything below ./data, or
nothing - the reason is pg_log.

You don't need something that work's if PostgreSQL is not
running. You cannot restore ONE file from a tape! You can
restore a PostgreSQL instance (only a complete one - not a
single DB, nor a single table or any other object). While
your backup is writing to the tape, each number of backends
could concurrently modify single blocks of the heap, it's
indices and pg_log. So what does the tape contain the?

I'd like to ask you, are you sure the backups you're making
are worth the power consumption of the tape drive? You're
talking about restoring a file - and sould be aware of the
fact, that any file based backup would never be able to get
consistent snapshot of the database, like pg_dump is able to.

As long as you don't take the postmaster down during the
entire saving of ./data, you aren't in a safe position. And
the only safe RESTORE is to restore ./data completely or
nothing. It's not even (easily) possible to initdb and
restore a single DB from tape (it is, but requires some deep
knowledge and more than just restoring some files from tape).

YOU REALLY DON'T NEED ANY FILENAMES IN THERE!

The more I think about it, the more I feel these file names,
easily associatable with the objects they represent, are more
dangerous than useful in practice. Maybe we should obfuscate
the entire ./data like Oracle does with it's tablespace
files. Just that our tablespaces will be directories,
containing totally cryptic named files.

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#45Noname
JanWieck@t-online.de
In reply to: Tom Lane (#36)
Re: Big 7.1 open items

Tom Lane wrote:

JanWieck@t-online.de (Jan Wieck) writes:

Why not changing the naming to be something like this:

<dbroot>/catalog_tables/pg_...
<dbroot>/catalog_index/pg_...
<dbroot>/user_tables/oid_...
<dbroot>/user_index/oid_...
<dbroot>/temp_tables/oid_...
<dbroot>/temp_index/oid_...
<dbroot>/toast_tables/oid_...
<dbroot>/toast_index/oid_...
<dbroot>/whatnot_???/...

I don't see a lot of value in that. Better to do something like
tablespaces:

<dbroot>/<oidoftablespace>/<oidofobject>

*Slap* - yes!

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#46Michael Robinson
robinson@netrinsics.com
In reply to: Bruce Momjian (#29)
Re: Re: Big 7.1 open items

Bruce Momjian <pgman@candle.pha.pa.us> writes:

The issue is that some people felt we shouldn't be performing such
checks, and some did.

Well, more precisely, the issue was stalemated at "one person felt we
should perform such checks" and "one person (who, incidentally, wrote the
code) felt we shouldn't".

Obviously, as it stands, the issue is going nowhere.

I was just hoping to encourage more people to examine the problem, so that
we might get a consensus one way or the other.

-Michael Robinson

#47Ross J. Reedstrom
reedstrm@rice.edu
In reply to: Bruce Momjian (#43)
Re: Big 7.1 open items

On Wed, Jun 14, 2000 at 11:21:15PM -0400, Bruce Momjian wrote:

Well, we did have someone do a test implementation of oid file names,
and their report was that is looked pretty ugly.

That someone would be me. Did my mails from this morning fall into a black
hole? I've got a patch that does either oid filenames or relname_<oid>,
take your pick. It doesn't do tablespaces, just leaves the files where
they are. TO do relname_<oid>, I add a relphysname field to pg_class.

I'll update it to current and throw it at the PATCHES list this weekend,
unless someone more central wants to do tablespaces first. I tried
out rollinging back ALTER TABLE RENAME. Works fine. Biggest problem
with it is that I played silly buggers with the relcache for no good
reason. Hiroshi stripped that out and said it works fone, otherwise. I
also haven't touched DROP TABLE yet. The physical file be deleted at
transaction commit time, then? Hmm, we're the 'things to do at commit'
queue?

convinced it has to be done, we can get started. I guess I was waiting
for Vadim's storage manager, where the whole idea of separate files is
going to go away anyway, I suspect. We would then have to re-write all
our admin tools for the new format.

Any strong objections to the mixed relname_oid solution? It gets us
everything oids does, and still lets Bruce use 'ls -l' to find the big
tables, putting off writing any admin tools that'll need to be rewritten,
anyway.

Ross
--
Ross J. Reedstrom, Ph.D., <reedstrm@rice.edu>
NSBRI Research Scientist/Programmer
Computer and Information Technology Institute
Rice University, 6100 S. Main St., Houston, TX 77005

#48Thomas Lockhart
lockhart@alumni.caltech.edu
In reply to: Bruce Momjian (#38)
Re: Big 7.1 open items

But seriously, let me give some background. I used Ingres, that used
the VMS file system, but used strange sequential AAAF324 numbers for
tables. When someone deleted a table, or we were looking at what
tables were using disk space, it was impossible to find the Ingres
table names that went with the file. There was a system table that
showed it, but it was poorly documented, and if you deleted the table,
there was no way to look on the tape to find out which file to
restore.

I had the same experience, but let's put the blame where it belongs: it
wasn't the filename's fault, it was poor design and support from the
Ingres company.

- Thomas

#49Chris Bitmead
chrisb@nimrod.itg.telstra.com.au
In reply to: Tom Lane (#42)
Re: Big 7.1 open items

"Ross J. Reedstrom" wrote:

Any strong objections to the mixed relname_oid solution? It gets us
everything oids does, and still lets Bruce use 'ls -l' to find the big
tables, putting off writing any admin tools that'll need to be rewritten,
anyway.

Doesn't relname_oid defeat the purpose of oid file names, which is that
they don't change when the table is renamed? Wasn't it going to be oids
with a tool to create a symlink of relname -> oid ?

#50Thomas Lockhart
lockhart@alumni.caltech.edu
In reply to: Michael Robinson (#46)
Re: Re: Big 7.1 open items

The issue is that some people felt we shouldn't be performing such
checks, and some did.

Well, more precisely, the issue was stalemated at "one person felt we
should perform such checks" and "one person (who, incidentally, wrote
the code) felt we shouldn't".
I was just hoping to encourage more people to examine the problem, so
that we might get a consensus one way or the other.

I hope that the issue is clearer once we have a trial implementation to
play with.

- Thomas

#51Tom Lane
tgl@sss.pgh.pa.us
In reply to: Ross J. Reedstrom (#47)
Re: Big 7.1 open items

"Ross J. Reedstrom" <reedstrm@rice.edu> writes:

Any strong objections to the mixed relname_oid solution?

Yes!

You cannot make it work reliably unless the relname part is the original
relname and does not track ALTER TABLE RENAME. IMHO having an obsolete
relname in the filename is worse than not having the relname at all;
it's a recipe for confusion, it means you still need admin tools to tell
which end is really up, and what's worst is you might think you don't.

Furthermore it requires an additional column in pg_class to keep track
of the original relname, which is a waste of space and effort.

It also creates a portability risk, or at least fails to remove one,
since you are critically dependent on the assumption that the OS
supports long filenames --- on a filesystem that truncates names to less
than about 45 characters you're in very deep trouble. An OID-only
approach still works on traditional 14-char-filename Unix filesystems
(it'd mostly even work on DOS 8+3, though I doubt we care about that).

Finally, one of the reasons I want to go to filenames based only on OID
is that that'll make life easier for mdblindwrt. Original relname + OID
doesn't help, in fact it makes life harder (more shmem space needed to
keep track of the filename for each buffer).

Can we *PLEASE JUST LET GO* of this bad idea? No relname in the
filename. Period.

regards, tom lane

#52Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#43)
Re: Big 7.1 open items

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Well, we did have someone do a test implementation of oid file names,
and their report was that is looked pretty ugly. However, if people are
convinced it has to be done, we can get started. I guess I was waiting
for Vadim's storage manager, where the whole idea of separate files is
going to go away anyway, I suspect. We would then have to re-write all
our admin tools for the new format.

I seem to recall him saying that he wanted to go to filename == OID
just like I'm suggesting. But I agree we probably ought to hold off
doing anything until he gets back from Russia and can let us know
whether that's still his plan. If he is planning one-huge-file or
something like that, we might as well let these issues go unfixed
for one more release cycle.

regards, tom lane

#53Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Tom Lane (#52)
AW: Big 7.1 open items

Bruce Momjian <pgman@candle.pha.pa.us> writes:

You need something that works from the command line, and

something that

works if PostgreSQL is not running. How would you restore

one file from

a tape.

"Restore one file from a tape"? How are you going to do that anyway?
You can't save and restore portions of a database like that, because
of transaction commit status problems. To restore table X correctly,
you'd have to restore pg_log as well, and then your other tables are
hosed --- unless you also restore all of them from the backup. Only
a complete database restore from tape would work, and for that you
don't need to tell which file is which. So the above argument is a
red herring.

From what I know it is possible to simply restore one table file
since pg_log keeps all tid's. Of course it cannot guarantee integrity
and does not work if the table was altered.

I realize it's nice to be able to tell which table file is which by
eyeball, but the price we are paying for that small convenience is
just too high. Give that up, and we can have rollbackable DROP and
RENAME now (I'll personally commit to making it happen for 7.1).
Continue to insist on it, and I don't think we'll *ever* have those
features in a really robust form. It's just not possible to do
multiple file renames atomically.

In the last proposal Bruce and I had it all layed out for tabname + oid
with no overhead in the normal situation, and little overhead if a rename
table crashed or was not rolled back or committed properly
which imho had all advantages combined.

Andreas

#54Karel Zak
zakkr@zf.jcu.cz
In reply to: Zeugswetter Andreas SB (#53)
Re: Big 7.1 open items

On Wed, 14 Jun 2000, Peter Eisentraut wrote:

On Tue, 13 Jun 2000, Karel Zak wrote:

+ new ACL? (please :-)

Not if we're shipping in August. :(

I understand you. I said it as dream :-)

BTW. --- Are you sure, how idea for ACL will good?

1/ your original idea with one-line-for-one-privilage in pg_privilage
(IMHO it's good idea).

2/ more priv. in one-line.

Karel

#55Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Karel Zak (#54)
AW: Big 7.1 open items

In reality, very few people are going to be interested in restoring
a table in a way that breaks referential integrity and other
normal assumptions about what exists in the database.

This is not true. In my DBA history it would have saved me manweeks
of work if an easy and efficient restore of one single table from backup
would have been available in Informix and Oracle.
We allways had to restore most of the whole system to another machine only
to get back at some table info that would then be manually re-added
to the production system.
A restore of one table to a different/new tablename would have been
very convenient, and this is currently possible in PostgreSQL.
(create new table with same schema, then replace new table data file
with file from backup)

The reality
is that most people are going to engage in a little time travel
to a past, consistent backup rather than do as you suggest.

No, this is what is done most of the time, but it is very inconvenient
to tell people that they loose all work from past days, so it is usually
done as I noted above if possible. We once had a situation where all data
was deleted from a table, but the problem was only noticed 3 weeks later.

This is going to be more and more true as Postgres gains more and
more acceptance in (no offense intended) the real world.

Right now, we use 'ps' with args to display backend

information, and ls

-l to show disk information. We are going to lose that here.

Dependence on "ls -l" is, IMO, a very weak argument.

In normal situations where everything works I agree, it is the
error situations where it really helps if you see what data is where.
debugging, lsof, Bruce already named them.

Andreas

#56Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Zeugswetter Andreas SB (#55)
AW: Big 7.1 open items

"Ross J. Reedstrom" <reedstrm@rice.edu> writes:

Any strong objections to the mixed relname_oid solution?

Yes!

You cannot make it work reliably unless the relname part is
the original
relname and does not track ALTER TABLE RENAME.

It does, or should at least. Only problem case is where db crashes during
alter or commit/rollback. This could be fixed by first open that fails to
find the file
or vacuum, or some other utility.

IMHO having
an obsolete
relname in the filename is worse than not having the relname at all;
it's a recipe for confusion, it means you still need admin
tools to tell
which end is really up, and what's worst is you might think you don't.

Furthermore it requires an additional column in pg_class to keep track
of the original relname, which is a waste of space and effort.

it does not.

Finally, one of the reasons I want to go to filenames based
only on OID
is that that'll make life easier for mdblindwrt. Original
relname + OID
doesn't help, in fact it makes life harder (more shmem space needed to
keep track of the filename for each buffer).

I do not see this. filename is constructed from relname+oid.
if not found, do directory scan for *_<OID>.dat, if found --> rename.

Andreas

#57Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Zeugswetter Andreas SB (#56)
AW: Big 7.1 open items

It's just not possible to do
multiple file renames atomically.

This is not necessary, since *_<OID> is unique regardless of relname prefix.

Andreas

#58Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Michael Robinson (#27)
Re: Re: Big 7.1 open items

While people are working on that, they might want to add some sanity checking
to the multibyte character decoders. Currently they fail to check for
"illegal" character sequences (i.e. sequences with no valid multibyte mapping),
and fail to do something reasonable (like return an error, silently drop the
offending characters, or anything else besides just returning random garbage
and crashing the backend).

Hum.. I thought Michael Robinson is the one who is against the idea of
rejecting "illegal" character sequences before they are put in the DB.
I like the idea but I haven't time to do that (However I'm not sure I
would like to do it for EUC-CN, since he dislikes the codes I write).

Bruce, I would like to see followings in the TODO. I also would like
to hear from Thomas and Peter or whoever being interested in
implementing NATIONAL CHARACTER stuffs if they are reasonable.

o Don't accept character sequences those are not valid as their charset
(signaling ERROR seems appropriate IMHO)

o Make PostgreSQL more multibyte aware (for example, TRIM function and
NAME data type)

o Regard n of CHAR(n)/VARCHAR(n) as the number of letters, rather than
the number of bytes
--
Tatsuo Ishii

#59The Hermit Hacker
scrappy@hub.org
In reply to: Noname (#33)
Re: Big 7.1 open items

On Wed, 14 Jun 2000, Jan Wieck wrote:

Why not changing the naming to be something like this:

<dbroot>/catalog_tables/pg_...
<dbroot>/catalog_index/pg_...
<dbroot>/user_tables/oid_...
<dbroot>/user_index/oid_...
<dbroot>/temp_tables/oid_...
<dbroot>/temp_index/oid_...
<dbroot>/toast_tables/oid_...
<dbroot>/toast_index/oid_...
<dbroot>/whatnot_???/...

This way, it would be much easier to separate all the
different object types to different physical media. We would
loose some transparency, but I've allways wondered what
people USE that for (except for just wanna know). For
convinience we could implement another little utility that
tells the object size like

Wow, I've been advocating this one for how many months now? :) You won't
get any arguments from me ...

#60The Hermit Hacker
scrappy@hub.org
In reply to: Bruce Momjian (#43)
Re: Big 7.1 open items

On Wed, 14 Jun 2000, Bruce Momjian wrote:

Backtraces from *what*, exactly? 99% of the backend is still going
to be dealing with the same data as ever. It might be that poking
around in fd.c will be a little harder, but considering that fd.c
doesn't really know or care what the files it's manipulating are
anyway, I'm not convinced that this is a real issue.

I was just throwing gdb out as an example. The bigger ones are ls,
lsof/fstat, and tar.

You've lost me on this one ... if someone does an lsof of the process, it
will still provide them a list of open files ... are you complaining about
the extra step required to translate the file name to a "valid table"?

Oh, one point here ... this whole 'filenaming issue' ... as far as ls is
concerned, at least, only affects the superuser, since he's the only one
that can go 'ls'ng around i nthe directories ...

And, ummm, how hard would it be to have \d in psql display the "physical
table name" as part of its output?

Slight tangent here:

One thing that I think would be great if we could add is some sort of:

SELECT db_name, disk_space;

query wher a database owner, not the superuser, could see how much disk
space their tables are using up ... possible?

#61Mark Hollomon
mhh@nortelnetworks.com
In reply to: Tom Lane (#42)
Re: Big 7.1 open items

Ross J. Reedstrom wrote:

Any strong objections to the mixed relname_oid solution? It gets us
everything oids does, and still lets Bruce use 'ls -l' to find the big
tables, putting off writing any admin tools that'll need to be rewritten,
anyway.

I would object to the mixed name.

Consider:

CREATE TABLE FOO ....
ALTER TABLE FOO RENAME FOO_OLD;
CREATE TABLE FOO ....

For the same atomicity reason, rename can't change the
name of the files. So, which foo_<oid> is the FOO_OLD
and which is FOO?

In other words, in the presence of rename, putting
relname in the filename is misleading at best.

--

Mark Hollomon
mhh@nortelnetworks.com
ESN 451-9008 (302)454-9008

#62Brian E Gallew
geek+@cmu.edu
In reply to: Tom Lane (#39)
Re: Big 7.1 open items

Then <tgl@sss.pgh.pa.us> spoke up and said:

Precedence: bulk

Bruce Momjian <pgman@candle.pha.pa.us> writes:

But seriously, let me give some background. I used Ingres, that used
the VMS file system, but used strange sequential AAAF324 numbers for
tables. When someone deleted a table, or we were looking at what tables
were using disk space, it was impossible to find the Ingres table names
that went with the file. There was a system table that showed it, but
it was poorly documented, and if you deleted the table, there was no way
to look on the tape to find out which file to restore.

Fair enough, but it seems to me that the answer is to expend some effort
on system admin support tools. We could do a lot in that line with less
effort than trying to make a fundamentally mismatched filesystem
representation do what we need.

We've been an Ingres shop as long as there's been an Ingres. While
we've also had the problem Bruce noticed with table names, we've
*also* used the trivial fix of running a (simple) Report Writer job
each night, immediately before the backup, that lists all of the
database tables/indicies and the underlying files.

True, if someone drops/recreates a table twice between backups we
can't find the intermediate file name, but since we also haven't
backed up that filename, this isn't an issue.

Also, the consistency issue is really not as important as you would
think. If you are restoring a table, you want the information in it,
whether or not it's consistent with anything else. I've done hundreds
of table restores (can you say "modify table to heap"?) and never once
has inconsistency been an issue. Oh, yeah, and we don't shut the
database down for this, either. (That last isn't my choice, BTW.)

--
=====================================================================
| JAVA must have been developed in the wilds of West Virginia. |
| After all, why else would it support only single inheritance?? |
=====================================================================
| Finger geek@cmu.edu for my public key. |
=====================================================================

#63Don Baccus
dhogaza@pacifier.com
In reply to: Zeugswetter Andreas SB (#55)
Re: AW: Big 7.1 open items

At 10:04 AM 6/15/00 +0200, Zeugswetter Andreas SB wrote:

In reality, very few people are going to be interested in restoring
a table in a way that breaks referential integrity and other
normal assumptions about what exists in the database.

This is not true. In my DBA history it would have saved me manweeks
of work if an easy and efficient restore of one single table from backup
would have been available in Informix and Oracle.
We allways had to restore most of the whole system to another machine only
to get back at some table info that would then be manually re-added
to the production system.

I'm missing something, I guess. You would do a createdb, do a filesystem
copy of pg_log and one file into it, and then read data from the table
without having to restore the other tables in the database?

I'm just curious - when was the last time you restored a Postgres
database in this piecemeal manner, and how often do you do it?

- Don Baccus, Portland OR <dhogaza@pacifier.com>
Nature photos, on-line guides, Pacific Northwest
Rare Bird Alert Service and other goodies at
http://donb.photo.net.

#64Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Thomas Lockhart (#48)
Re: Big 7.1 open items

But seriously, let me give some background. I used Ingres, that used
the VMS file system, but used strange sequential AAAF324 numbers for
tables. When someone deleted a table, or we were looking at what
tables were using disk space, it was impossible to find the Ingres
table names that went with the file. There was a system table that
showed it, but it was poorly documented, and if you deleted the table,
there was no way to look on the tape to find out which file to
restore.

I had the same experience, but let's put the blame where it belongs: it
wasn't the filename's fault, it was poor design and support from the
Ingres company.

Yes, that certainly was part of the cause. Also, if the PostgreSQL
files are backed up using tar while no database activity is happening,
there is no reason the tar restore will not work. You just create a
table with the same schema, stop the postmaster, have the backup file
replace the newly created table file, and restart the postmaster.

I can't tell you how many times I have said, "Man, whoever did this
Ingres naming schema was an idiot. Do they know how many problems they
caused for us?"

Also, Informix standard engine uses the tablename_oid setup for its
table names, and it works fine. It grabs the first 8 characters of the
table, and plops some unique number on the end of it. Works fine for
administrators.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#65Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Zeugswetter Andreas SB (#56)
Re: AW: Big 7.1 open items

Finally, one of the reasons I want to go to filenames based
only on OID
is that that'll make life easier for mdblindwrt. Original
relname + OID
doesn't help, in fact it makes life harder (more shmem space needed to
keep track of the filename for each buffer).

I do not see this. filename is constructed from relname+oid.
if not found, do directory scan for *_<OID>.dat, if found --> rename.

That is kind if nifty.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#66Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tatsuo Ishii (#58)
Re: Re: Big 7.1 open items

Hum.. I thought Michael Robinson is the one who is against the idea of
rejecting "illegal" character sequences before they are put in the DB.
I like the idea but I haven't time to do that (However I'm not sure I
would like to do it for EUC-CN, since he dislikes the codes I write).

Bruce, I would like to see followings in the TODO. I also would like
to hear from Thomas and Peter or whoever being interested in
implementing NATIONAL CHARACTER stuffs if they are reasonable.

o Don't accept character sequences those are not valid as their charset
(signaling ERROR seems appropriate IMHO)

o Make PostgreSQL more multibyte aware (for example, TRIM function and
NAME data type)

o Regard n of CHAR(n)/VARCHAR(n) as the number of letters, rather than
the number of bytes

Added to TODO:

* Reject character sequences those are not valid in their charset
* Make functions more multi-byte aware, i.e. trim()
* Make n of CHAR(n)/VARCHAR(n) the number of letters, not bytes

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#67Thomas Lockhart
lockhart@alumni.caltech.edu
In reply to: Michael Robinson (#27)
Re: Re: Big 7.1 open items

o Don't accept character sequences those are not valid as their
charset (signaling ERROR seems appropriate IMHO)
o Make PostgreSQL more multibyte aware (for example, TRIM function and
NAME data type)
o Regard n of CHAR(n)/VARCHAR(n) as the number of letters, rather than
the number of bytes

All good, and important features when we are done.

One issue: I can see (or imagine ;) how we can use the Postgres type
system to manage multiple character sets. But allowing arbitrary
character sets in, say, table names forces us to cope with allowing a
mix of character sets in a single column of a system table. afaik this
general capability is not mandated by SQL9x (the SQL_TEXT character set
is used for all system resources??). Would it be acceptable to have a
"default database character set" which is allowed to creep into the
pg_xxx tables? Even that seems to be a difficult thing to accomplish at
the moment (we'd need to get some of the text manipulation functions
from the catalogs, not from hardcoded references as we do now).

We should itemize all of these issues so we can keep track of what is
necessary, possible, and/or "easy".

- Thomas

#68Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Thomas Lockhart (#67)
AW: AW: Big 7.1 open items

In reality, very few people are going to be interested in restoring
a table in a way that breaks referential integrity and other
normal assumptions about what exists in the database.

This is not true. In my DBA history it would have saved me manweeks
of work if an easy and efficient restore of one single table

from backup

would have been available in Informix and Oracle.
We allways had to restore most of the whole system to

another machine only

to get back at some table info that would then be manually re-added
to the production system.

I'm missing something, I guess. You would do a createdb, do
a filesystem
copy of pg_log and one file into it, and then read data from the table
without having to restore the other tables in the database?

No if you want to restore to a separate postgres instance you need to
restore all pg system tables as well.
What I meant is create a new table in your production server and replace
the new 0 byte file with your backup file (rename it accordingly).

Andreas

#69Ross J. Reedstrom
reedstrm@rice.edu
In reply to: Tom Lane (#51)
Re: Big 7.1 open items

On Thu, Jun 15, 2000 at 03:11:52AM -0400, Tom Lane wrote:

"Ross J. Reedstrom" <reedstrm@rice.edu> writes:

Any strong objections to the mixed relname_oid solution?

Yes!

You cannot make it work reliably unless the relname part is the original
relname and does not track ALTER TABLE RENAME. IMHO having an obsolete
relname in the filename is worse than not having the relname at all;
it's a recipe for confusion, it means you still need admin tools to tell
which end is really up, and what's worst is you might think you don't.

The plan here was to let VACUUM handle renaming the file, since it
will already have all the necessary locks. This shortens the window
of confusion. ALTER TABLE RENAME doesn't happen that often, really -
the relname is there just for human consumption, then.

Furthermore it requires an additional column in pg_class to keep track
of the original relname, which is a waste of space and effort.

I actually started down this path thinking about implementing SCHEMA,
since tables in the same DB but in different schema can have the same
relname, I figured I needed to change that. We'll need something in
pg_class to keep track of what schema a relation is in, instead.

It also creates a portability risk, or at least fails to remove one,
since you are critically dependent on the assumption that the OS
supports long filenames --- on a filesystem that truncates names to less
than about 45 characters you're in very deep trouble. An OID-only
approach still works on traditional 14-char-filename Unix filesystems
(it'd mostly even work on DOS 8+3, though I doubt we care about that).

Actually, no. Since I store the filename in a name attribute, I used this
nifty function somebody wrote, makeObjectName, to trim the relname part,
but leave the oid. (Yes, I know it's yours ;-)

Finally, one of the reasons I want to go to filenames based only on OID
is that that'll make life easier for mdblindwrt. Original relname + OID
doesn't help, in fact it makes life harder (more shmem space needed to
keep track of the filename for each buffer).

Can you explain in more detail how this helps? Not by letting the bufmgr
know that oid == filename, I hope. We need to improving the abstraction
of the smgr, not add another violation. Ah, sorry, mdblindwrt _is_
in the smgr.

Hmm, grovelling through that code, I see how it could be simpler if reloid
== filename. Heck, we even get to save shmem in the buffdesc.blind part,
since we only need the dbname in there, now.

Hmm, I see I missed the relpath_blind() in my patch - oops. (relpath()
is always called with RelationGetPhysicalRelationName(), and that's
where I was putting in the relphysname)

Hmm, what's all this with functions in catalog.c that are only called by
smgr/md.c? seems to me that anything having to do with physical storage
(like the path!) belongs in the smgr abstraction.

Can we *PLEASE JUST LET GO* of this bad idea? No relname in the
filename. Period.

Gee, so dogmatic. No one besides Bruce and Hiroshi discussed this _at
all_ when I first put up patches two month ago. O.K., I'll do the oids
only version (and fix up relpath_blind)

Ross

--
Ross J. Reedstrom, Ph.D., <reedstrm@rice.edu>
NSBRI Research Scientist/Programmer
Computer and Information Technology Institute
Rice University, 6100 S. Main St., Houston, TX 77005

#70Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Ross J. Reedstrom (#69)
Re: Big 7.1 open items

Can we *PLEASE JUST LET GO* of this bad idea? No relname in the
filename. Period.

Gee, so dogmatic. No one besides Bruce and Hiroshi discussed this _at
all_ when I first put up patches two month ago. O.K., I'll do the oids
only version (and fix up relpath_blind)

Hold on. I don't think we want that work done yet. Seems even Tom is
thinking that if Vadim is going to re-do everything later anyway, we may
be better with a relname/oid solution that does require additional
administration apps.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#71Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Bruce Momjian (#70)
RE: Big 7.1 open items

-----Original Message-----
From: pgsql-hackers-owner@hub.org
[mailto:pgsql-hackers-owner@hub.org]On Behalf Of Bruce Momjian

Can we *PLEASE JUST LET GO* of this bad idea? No relname in the
filename. Period.

Gee, so dogmatic. No one besides Bruce and Hiroshi discussed this _at
all_ when I first put up patches two month ago. O.K., I'll do the oids
only version (and fix up relpath_blind)

Hold on. I don't think we want that work done yet. Seems even Tom is
thinking that if Vadim is going to re-do everything later anyway, we may
be better with a relname/oid solution that does require additional
administration apps.

Hmm,why is naming rule first ?

I've never enphasized naming rule except that it should be unique.
It has been my main point to reduce the necessity of naming rule
as possible. IIRC,by keeping the stored place in pg_class,Ross's
trial patch remains only 2 places where naming rule is required.
So wouldn't we be free from naming rule(it would not be so difficult
to change naming rule if the rule is found to be bad) ?

I've also mentioned many times neither relname nor oid is sufficient
for the uniqueness. In addiiton neither relname nor oid would be
necessary for the uniqueness.
IMHO,it's bad to rely on the item which is neither necessary nor
sufficient.
I proposed relname+unique_id naming once. The unique_id is
independent from oid. The relname is only for convinience for
DBA and so we don't have to change it due to RENAME.
Db's consistency is much more important than dba's satis-
faction.

Comments ?

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#72Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Hiroshi Inoue (#71)
Re: Big 7.1 open items

I've also mentioned many times neither relname nor oid is sufficient
for the uniqueness. In addiiton neither relname nor oid would be
necessary for the uniqueness.
IMHO,it's bad to rely on the item which is neither necessary nor
sufficient.
I proposed relname+unique_id naming once. The unique_id is
independent from oid. The relname is only for convinience for
DBA and so we don't have to change it due to RENAME.
Db's consistency is much more important than dba's satis-
faction.

Comments ?

I am happy not to rename the file on 'RENAME', but seems no one likes
that.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#73Noname
JanWieck@t-online.de
In reply to: Don Baccus (#63)
Re: AW: Big 7.1 open items

Don Baccus wrote:

At 10:04 AM 6/15/00 +0200, Zeugswetter Andreas SB wrote:

This is not true. In my DBA history it would have saved me manweeks
of work if an easy and efficient restore of one single table from backup
would have been available in Informix and Oracle.
We allways had to restore most of the whole system to another machine only
to get back at some table info that would then be manually re-added
to the production system.

I'm missing something, I guess. You would do a createdb, do a filesystem
copy of pg_log and one file into it, and then read data from the table
without having to restore the other tables in the database?

I'm just curious - when was the last time you restored a Postgres
database in this piecemeal manner, and how often do you do it?

More curios to me is that people seem to use physical file
based backup at all. Do they shutdown the postmaster during
backup or do they live with the fact that maybe not every
backup is a vital one?

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#74Ross J. Reedstrom
reedstrm@rice.edu
In reply to: Bruce Momjian (#72)
Re: Big 7.1 open items

On Thu, Jun 15, 2000 at 05:48:59PM -0400, Bruce Momjian wrote:

I've also mentioned many times neither relname nor oid is sufficient
for the uniqueness. In addiiton neither relname nor oid would be
necessary for the uniqueness.
IMHO,it's bad to rely on the item which is neither necessary nor
sufficient.
I proposed relname+unique_id naming once. The unique_id is
independent from oid. The relname is only for convinience for
DBA and so we don't have to change it due to RENAME.
Db's consistency is much more important than dba's satis-
faction.

Comments ?

I am happy not to rename the file on 'RENAME', but seems no one likes
that.

Good, 'cause that's how I've implemented it so far. Actually, all
I've done is port my previous patch to current, with one little
change: I added a macro RelationGetRealRelationName which does what
RelationGetPhysicalRelationName used to do: i.e. return the relname with
no temptable funny business, and used that for the relcache macros. It
passes all the serial regression tests: I haven't run the parallel tests
yet. ALTER TABLE RENAME rollsback nicely. I'll need to learn some omre
about xacts to get DROP TABLE rolling back.

I'll drop it on PATCHES right now, for comment.

Ross
--
Ross J. Reedstrom, Ph.D., <reedstrm@rice.edu>
NSBRI Research Scientist/Programmer
Computer and Information Technology Institute
Rice University, 6100 S. Main St., Houston, TX 77005

#75Ross J. Reedstrom
reedstrm@rice.edu
In reply to: Bruce Momjian (#72)
1 attachment(s)
filename patch (was Re: [HACKERS] Big 7.1 open items)

Here's the patch I promised on HACKERS. Comments anyone?

Ross
--
Ross J. Reedstrom, Ph.D., <reedstrm@rice.edu>
NSBRI Research Scientist/Programmer
Computer and Information Technology Institute
Rice University, 6100 S. Main St., Houston, TX 77005

Attachments:

oid_names.difftext/plain; charset=us-asciiDownload
Index: backend/catalog/heap.c
===================================================================
RCS file: /home/projects/pgsql/cvsroot/pgsql/src/backend/catalog/heap.c,v
retrieving revision 1.131
diff -u -r1.131 heap.c
--- backend/catalog/heap.c	2000/06/15 03:32:01	1.131
+++ backend/catalog/heap.c	2000/06/15 22:52:22
@@ -56,6 +56,7 @@
 #include "parser/parse_relation.h"
 #include "parser/parse_target.h"
 #include "parser/parse_type.h"
+#include "parser/analyze.h" /* for makeObjectName */
 #include "rewrite/rewriteRemove.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
@@ -187,6 +188,8 @@
 	int			i;
 	Oid			relid;
 	Relation	rel;
+	char		*relphysname;
+	char		*tmpname;
 	int			len;
 	bool		nailme = false;
 	int			natts = tupDesc->natts;
@@ -242,6 +245,31 @@
 		relid = RelOid_pg_type;
 		nailme = true;
 	}
+	else if (relname && !strcmp(DatabaseRelationName, relname))
+	{
+		relid = RelOid_pg_database;
+		nailme = true;
+	}
+	else if (relname && !strcmp(GroupRelationName, relname))
+	{
+		relid = RelOid_pg_group;
+		nailme = true;
+	}
+	else if (relname && !strcmp(LogRelationName, relname))
+	{
+		relid = RelOid_pg_log;
+		nailme = true;
+	}
+	else if (relname && !strcmp(ShadowRelationName, relname))
+	{
+		relid = RelOid_pg_shadow;
+		nailme = true;
+	}
+	else if (relname && !strcmp(VariableRelationName, relname))
+	{
+		relid = RelOid_pg_variable;
+		nailme = true;
+	}
 	else
 		relid = newoid();
 
@@ -259,6 +287,14 @@
 		snprintf(relname, NAMEDATALEN, "pg_temp.%d.%u", MyProcPid, uniqueId++);
 	}
 
+	/* now that we have the oid and name, we can set the physical filename
+	 * Use makeObjectName() since we need to store this in a fix length
+	 * (NAMEDATALEN) Name field and don't want the OID part truncated
+	 */
+	tmpname = palloc(NAMEDATALEN);
+	snprintf(tmpname, NAMEDATALEN, "%d", relid);
+	relphysname = makeObjectName(relname,NULL,tmpname);
+
 	/* ----------------
 	 *	allocate a new relation descriptor.
 	 * ----------------
@@ -293,7 +329,8 @@
 	 * ----------------
 	 */
 	MemSet((char *) rel->rd_rel, 0, sizeof *rel->rd_rel);
-	strcpy(RelationGetPhysicalRelationName(rel), relname);
+	strcpy(RelationGetRelationName(rel), relname);
+	strcpy(RelationGetPhysicalRelationName(rel), relphysname);
 	rel->rd_rel->relkind = RELKIND_UNCATALOGED;
 	rel->rd_rel->relnatts = natts;
 	if (tupDesc->constr)
Index: backend/commands/rename.c
===================================================================
RCS file: /home/projects/pgsql/cvsroot/pgsql/src/backend/commands/rename.c,v
retrieving revision 1.45
diff -u -r1.45 rename.c
--- backend/commands/rename.c	2000/05/25 21:30:20	1.45
+++ backend/commands/rename.c	2000/06/15 22:52:22
@@ -312,6 +312,10 @@
 	 * XXX smgr.c ought to provide an interface for this; doing it directly
 	 * is bletcherous.
 	 */
+#ifdef NOT_USED
+	/* took this out to try OID only filenames, left it out while
+	trying relname_oid names  RJR */
+
 	strcpy(oldpath, relpath(oldrelname));
 	strcpy(newpath, relpath(newrelname));
 	if (rename(oldpath, newpath) < 0)
@@ -333,4 +337,5 @@
 				 toldpath, tnewpath);
 		}
 	}
+#endif /* oidnames */
 }
Index: backend/parser/analyze.c
===================================================================
RCS file: /home/projects/pgsql/cvsroot/pgsql/src/backend/parser/analyze.c,v
retrieving revision 1.147
diff -u -r1.147 analyze.c
--- backend/parser/analyze.c	2000/06/12 19:40:40	1.147
+++ backend/parser/analyze.c	2000/06/15 22:52:22
@@ -498,7 +498,7 @@
  *	from the truncated characters.	Currently it seems best to keep it simple,
  *	so that the generated names are easily predictable by a person.
  */
-static char *
+char *
 makeObjectName(char *name1, char *name2, char *typename)
 {
 	char	   *name;
Index: backend/postmaster/postmaster.c
===================================================================
RCS file: /home/projects/pgsql/cvsroot/pgsql/src/backend/postmaster/postmaster.c,v
retrieving revision 1.148
diff -u -r1.148 postmaster.c
--- backend/postmaster/postmaster.c	2000/06/14 18:17:38	1.148
+++ backend/postmaster/postmaster.c	2000/06/15 22:52:23
@@ -47,6 +47,7 @@
 #include <fcntl.h>
 #include <time.h>
 #include <sys/param.h>
+#include <catalog/catname.h>
 
  /* moved here to prevent double define */
 #ifdef HAVE_NETDB_H
@@ -316,8 +317,9 @@
 		char		path[MAXPGPATH];
 		FILE	   *fp;
 
-		snprintf(path, sizeof(path), "%s%cbase%ctemplate1%cpg_class",
-				 DataDir, SEP_CHAR, SEP_CHAR, SEP_CHAR);
+		snprintf(path, sizeof(path), "%s%cbase%ctemplate1%c%s",
+				DataDir, SEP_CHAR, SEP_CHAR, SEP_CHAR,RelationPhysicalRelationName);
+
 		fp = AllocateFile(path, PG_BINARY_R);
 		if (fp == NULL)
 		{
Index: backend/storage/lmgr/lmgr.c
===================================================================
RCS file: /home/projects/pgsql/cvsroot/pgsql/src/backend/storage/lmgr/lmgr.c,v
retrieving revision 1.41
diff -u -r1.41 lmgr.c
--- backend/storage/lmgr/lmgr.c	2000/06/08 22:37:24	1.41
+++ backend/storage/lmgr/lmgr.c	2000/06/15 22:52:23
@@ -112,7 +112,7 @@
 	Assert(RelationIsValid(relation));
 	Assert(OidIsValid(RelationGetRelid(relation)));
 
-	relname = (char *) RelationGetPhysicalRelationName(relation);
+	relname = (char *) RelationGetRelationName(relation);
 
 	relation->rd_lockInfo.lockRelId.relId = RelationGetRelid(relation);
 
Index: backend/utils/cache/relcache.c
===================================================================
RCS file: /home/projects/pgsql/cvsroot/pgsql/src/backend/utils/cache/relcache.c,v
retrieving revision 1.99
diff -u -r1.99 relcache.c
--- backend/utils/cache/relcache.c	2000/06/02 15:57:30	1.99
+++ backend/utils/cache/relcache.c	2000/06/15 22:52:24
@@ -60,6 +60,7 @@
 #include "utils/fmgroids.h"
 #include "utils/relcache.h"
 #include "utils/temprel.h"
+#include "parser/analyze.h" /* for makeObjectName */
 
 
 /* ----------------
@@ -128,7 +129,7 @@
 do { \
 	RelIdCacheEnt *idhentry; RelNameCacheEnt *namehentry; \
 	char *relname; Oid reloid; bool found; \
-	relname = RelationGetPhysicalRelationName(RELATION); \
+	relname = RelationGetRealRelationName(RELATION); \
 	namehentry = (RelNameCacheEnt*)hash_search(RelationNameCache, \
 											   relname, \
 											   HASH_ENTER, \
@@ -181,7 +182,7 @@
 do { \
 	RelNameCacheEnt *namehentry; RelIdCacheEnt *idhentry; \
 	char *relname; Oid reloid; bool found; \
-	relname = RelationGetPhysicalRelationName(RELATION); \
+	relname = RelationGetRealRelationName(RELATION); \
 	namehentry = (RelNameCacheEnt*)hash_search(RelationNameCache, \
 											   relname, \
 											   HASH_REMOVE, \
@@ -1055,6 +1056,7 @@
 	Relation	relation;
 	Size		len;
 	u_int		i;
+	char		*tmpname;
 
 	/* ----------------
 	 *	allocate new relation desc
@@ -1083,7 +1085,7 @@
 	relation->rd_rel = (Form_pg_class)
 		palloc((Size) (sizeof(*relation->rd_rel)));
 	MemSet(relation->rd_rel, 0, sizeof(FormData_pg_class));
-	strcpy(RelationGetPhysicalRelationName(relation), relationName);
+	strcpy(RelationGetRealRelationName(relation), relationName);
 
 	/* ----------------
 	   initialize attribute tuple form
@@ -1131,6 +1133,14 @@
 	 * ----------------
 	 */
 	RelationGetRelid(relation) = relation->rd_att->attrs[0]->attrelid;
+
+	/* ----------------
+	 *	initialize relation physical name, now that we have the oid
+	 * ----------------
+	 */
+	tmpname = palloc(NAMEDATALEN);
+	snprintf(tmpname, NAMEDATALEN, "%u", RelationGetRelid(relation));
+	strcpy (RelationGetPhysicalRelationName(relation), makeObjectName(relationName,NULL,tmpname));
 
 	/* ----------------
 	 *	initialize the relation lock manager information
Index: backend/utils/init/globals.c
===================================================================
RCS file: /home/projects/pgsql/cvsroot/pgsql/src/backend/utils/init/globals.c,v
retrieving revision 1.45
diff -u -r1.45 globals.c
--- backend/utils/init/globals.c	2000/05/31 00:28:32	1.45
+++ backend/utils/init/globals.c	2000/06/15 22:52:24
@@ -113,6 +113,8 @@
  *		is done on it in catalog.c!
  *
  *		XXX this is a serious hack which should be fixed -cim 1/26/90
+ *		XXX Really bogus addition of fixed OIDs, to test
+ *		relname -> filename linkage  (RJR 08Feb2000)
  * ----------------
  */
 char	   *SharedSystemRelationNames[] = {
@@ -123,5 +125,10 @@
 	LogRelationName,
 	ShadowRelationName,
 	VariableRelationName,
+	DatabasePhysicalRelationName,
+	GroupPhysicalRelationName,
+	LogPhysicalRelationName,
+	ShadowPhysicalRelationName,
+	VariablePhysicalRelationName,
 	0
 };
Index: backend/utils/misc/database.c
===================================================================
RCS file: /home/projects/pgsql/cvsroot/pgsql/src/backend/utils/misc/database.c,v
retrieving revision 1.38
diff -u -r1.38 database.c
--- backend/utils/misc/database.c	2000/06/02 15:57:34	1.38
+++ backend/utils/misc/database.c	2000/06/15 22:52:24
@@ -143,8 +143,8 @@
 	char	   *dbfname;
 	Form_pg_database tup_db;
 
-	dbfname = (char *) palloc(strlen(DataDir) + strlen(DatabaseRelationName) + 2);
-	sprintf(dbfname, "%s%c%s", DataDir, SEP_CHAR, DatabaseRelationName);
+	dbfname = (char *) palloc(strlen(DataDir) + strlen(DatabasePhysicalRelationName) + 2);
+	sprintf(dbfname, "%s%c%s", DataDir, SEP_CHAR, DatabasePhysicalRelationName);
 
 	if ((dbfd = open(dbfname, O_RDONLY | PG_BINARY, 0)) < 0)
 		elog(FATAL, "cannot open %s: %s", dbfname, strerror(errno));
Index: include/catalog/catname.h
===================================================================
RCS file: /home/projects/pgsql/cvsroot/pgsql/src/include/catalog/catname.h,v
retrieving revision 1.12
diff -u -r1.12 catname.h
--- include/catalog/catname.h	2000/01/26 05:57:56	1.12
+++ include/catalog/catname.h	2000/06/15 22:52:25
@@ -45,6 +45,13 @@
 #define  RelCheckRelationName "pg_relcheck"
 #define  TriggerRelationName "pg_trigger"
 
+#define	DatabasePhysicalRelationName 	"pg_database_1262"
+#define	GroupPhysicalRelationName   	"pg_group_1261"
+#define	LogPhysicalRelationName 	 	"pg_log_1269"
+#define	ShadowPhysicalRelationName 	 	"pg_shadow_1260"
+#define	VariablePhysicalRelationName 	"pg_variable_1264"
+#define	RelationPhysicalRelationName	"pg_class_1259"
+
 extern char *SharedSystemRelationNames[];
 
 #endif	 /* CATNAME_H */
Index: include/catalog/pg_attribute.h
===================================================================
RCS file: /home/projects/pgsql/cvsroot/pgsql/src/include/catalog/pg_attribute.h,v
retrieving revision 1.59
diff -u -r1.59 pg_attribute.h
--- include/catalog/pg_attribute.h	2000/06/12 03:40:52	1.59
+++ include/catalog/pg_attribute.h	2000/06/15 22:52:25
@@ -412,46 +412,48 @@
  */
 #define Schema_pg_class \
 { 1259, {"relname"},	   19, 0, NAMEDATALEN,	1, 0, -1, -1, '\0', 'p', '\0', 'i', '\0', '\0' }, \
-{ 1259, {"reltype"},	   26, 0,	4,	2, 0, -1, -1, '\001', 'p', '\0', 'i', '\0', '\0' }, \
-{ 1259, {"relowner"},	   23, 0,	4,	3, 0, -1, -1, '\001', 'p', '\0', 'i', '\0', '\0' }, \
-{ 1259, {"relam"},		   26, 0,	4,	4, 0, -1, -1, '\001', 'p', '\0', 'i', '\0', '\0' }, \
-{ 1259, {"relpages"},	   23, 0,	4,	5, 0, -1, -1, '\001', 'p', '\0', 'i', '\0', '\0' }, \
-{ 1259, {"reltuples"},	   23, 0,	4,	6, 0, -1, -1, '\001', 'p', '\0', 'i', '\0', '\0' }, \
-{ 1259, {"rellongrelid"},  26, 0,	4,	7, 0, -1, -1, '\001', 'p', '\0', 'i', '\0', '\0' }, \
-{ 1259, {"relhasindex"},   16, 0,	1,	8, 0, -1, -1, '\001', 'p', '\0', 'c', '\0', '\0' }, \
-{ 1259, {"relisshared"},   16, 0,	1,	9, 0, -1, -1, '\001', 'p', '\0', 'c', '\0', '\0' }, \
-{ 1259, {"relkind"},	   18, 0,	1, 10, 0, -1, -1, '\001', 'p', '\0', 'c', '\0', '\0' }, \
-{ 1259, {"relnatts"},	   21, 0,	2, 11, 0, -1, -1, '\001', 'p', '\0', 's', '\0', '\0' }, \
-{ 1259, {"relchecks"},	   21, 0,	2, 12, 0, -1, -1, '\001', 'p', '\0', 's', '\0', '\0' }, \
-{ 1259, {"reltriggers"},   21, 0,	2, 13, 0, -1, -1, '\001', 'p', '\0', 's', '\0', '\0' }, \
-{ 1259, {"relukeys"},	   21, 0,	2, 14, 0, -1, -1, '\001', 'p', '\0', 's', '\0', '\0' }, \
-{ 1259, {"relfkeys"},	   21, 0,	2, 15, 0, -1, -1, '\001', 'p', '\0', 's', '\0', '\0' }, \
-{ 1259, {"relrefs"},	   21, 0,	2, 16, 0, -1, -1, '\001', 'p', '\0', 's', '\0', '\0' }, \
-{ 1259, {"relhaspkey"},    16, 0,	1, 17, 0, -1, -1, '\001', 'p', '\0', 'c', '\0', '\0' }, \
-{ 1259, {"relhasrules"},   16, 0,	1, 18, 0, -1, -1, '\001', 'p', '\0', 'c', '\0', '\0' }, \
-{ 1259, {"relhassubclass"},16, 0,	1, 19, 0, -1, -1, '\001', 'p', '\0', 'c', '\0', '\0' }, \
-{ 1259, {"relacl"},		 1034, 0,  -1, 20, 0, -1, -1,	'\0', 'p', '\0', 'i', '\0', '\0' }
+{ 1259, {"relphysname"},   19, 0, NAMEDATALEN,	2, 0, -1, -1, '\0', 'p', '\0', 'i', '\0', '\0' }, \
+{ 1259, {"reltype"},	   26, 0,	4,	3, 0, -1, -1, '\001', 'p', '\0', 'i', '\0', '\0' }, \
+{ 1259, {"relowner"},	   23, 0,	4,	4, 0, -1, -1, '\001', 'p', '\0', 'i', '\0', '\0' }, \
+{ 1259, {"relam"},		   26, 0,	4,	5, 0, -1, -1, '\001', 'p', '\0', 'i', '\0', '\0' }, \
+{ 1259, {"relpages"},	   23, 0,	4,	6, 0, -1, -1, '\001', 'p', '\0', 'i', '\0', '\0' }, \
+{ 1259, {"reltuples"},	   23, 0,	4,	7, 0, -1, -1, '\001', 'p', '\0', 'i', '\0', '\0' }, \
+{ 1259, {"rellongrelid"},  26, 0,	4,	8, 0, -1, -1, '\001', 'p', '\0', 'i', '\0', '\0' }, \
+{ 1259, {"relhasindex"},   16, 0,	1,	9, 0, -1, -1, '\001', 'p', '\0', 'c', '\0', '\0' }, \
+{ 1259, {"relisshared"},   16, 0,	1, 10, 0, -1, -1, '\001', 'p', '\0', 'c', '\0', '\0' }, \
+{ 1259, {"relkind"},	   18, 0,	1, 11, 0, -1, -1, '\001', 'p', '\0', 'c', '\0', '\0' }, \
+{ 1259, {"relnatts"},	   21, 0,	2, 12, 0, -1, -1, '\001', 'p', '\0', 's', '\0', '\0' }, \
+{ 1259, {"relchecks"},	   21, 0,	2, 13, 0, -1, -1, '\001', 'p', '\0', 's', '\0', '\0' }, \
+{ 1259, {"reltriggers"},   21, 0,	2, 14, 0, -1, -1, '\001', 'p', '\0', 's', '\0', '\0' }, \
+{ 1259, {"relukeys"},	   21, 0,	2, 15, 0, -1, -1, '\001', 'p', '\0', 's', '\0', '\0' }, \
+{ 1259, {"relfkeys"},	   21, 0,	2, 16, 0, -1, -1, '\001', 'p', '\0', 's', '\0', '\0' }, \
+{ 1259, {"relrefs"},	   21, 0,	2, 17, 0, -1, -1, '\001', 'p', '\0', 's', '\0', '\0' }, \
+{ 1259, {"relhaspkey"},    16, 0,	1, 18, 0, -1, -1, '\001', 'p', '\0', 'c', '\0', '\0' }, \
+{ 1259, {"relhasrules"},   16, 0,	1, 19, 0, -1, -1, '\001', 'p', '\0', 'c', '\0', '\0' }, \
+{ 1259, {"relhassubclass"},16, 0,	1, 20, 0, -1, -1, '\001', 'p', '\0', 'c', '\0', '\0' }, \
+{ 1259, {"relacl"},		 1034, 0,  -1, 21, 0, -1, -1,	'\0', 'p', '\0', 'i', '\0', '\0' }
 
 DATA(insert OID = 0 ( 1259 relname			19 0 NAMEDATALEN   1 0 -1 -1 f p f i f f));
-DATA(insert OID = 0 ( 1259 reltype			26 0  4   2 0 -1 -1 t p f i f f));
-DATA(insert OID = 0 ( 1259 relowner			23 0  4   3 0 -1 -1 t p f i f f));
-DATA(insert OID = 0 ( 1259 relam			26 0  4   4 0 -1 -1 t p f i f f));
-DATA(insert OID = 0 ( 1259 relpages			23 0  4   5 0 -1 -1 t p f i f f));
-DATA(insert OID = 0 ( 1259 reltuples		23 0  4   6 0 -1 -1 t p f i f f));
-DATA(insert OID = 0 ( 1259 rellongrelid		26 0  4   7 0 -1 -1 t p f i f f));
-DATA(insert OID = 0 ( 1259 relhasindex		16 0  1   8 0 -1 -1 t p f c f f));
-DATA(insert OID = 0 ( 1259 relisshared		16 0  1   9 0 -1 -1 t p f c f f));
-DATA(insert OID = 0 ( 1259 relkind			18 0  1  10 0 -1 -1 t p f c f f));
-DATA(insert OID = 0 ( 1259 relnatts			21 0  2  11 0 -1 -1 t p f s f f));
-DATA(insert OID = 0 ( 1259 relchecks		21 0  2  12 0 -1 -1 t p f s f f));
-DATA(insert OID = 0 ( 1259 reltriggers		21 0  2  13 0 -1 -1 t p f s f f));
-DATA(insert OID = 0 ( 1259 relukeys			21 0  2  14 0 -1 -1 t p f s f f));
-DATA(insert OID = 0 ( 1259 relfkeys			21 0  2  15 0 -1 -1 t p f s f f));
-DATA(insert OID = 0 ( 1259 relrefs			21 0  2  16 0 -1 -1 t p f s f f));
-DATA(insert OID = 0 ( 1259 relhaspkey		16 0  1  17 0 -1 -1 t p f c f f));
-DATA(insert OID = 0 ( 1259 relhasrules		16 0  1  18 0 -1 -1 t p f c f f));
-DATA(insert OID = 0 ( 1259 relhassubclass	16 0  1   19 0 -1 -1 t p f c f f));
-DATA(insert OID = 0 ( 1259 relacl		  1034 0 -1  20 0 -1 -1 f p f i f f));
+DATA(insert OID = 0 ( 1259 relphysname			19 0 NAMEDATALEN   2 0 -1 -1 f p f i f f));
+DATA(insert OID = 0 ( 1259 reltype			26 0  4   3 0 -1 -1 t p f i f f));
+DATA(insert OID = 0 ( 1259 relowner			23 0  4   4 0 -1 -1 t p f i f f));
+DATA(insert OID = 0 ( 1259 relam			26 0  4   5 0 -1 -1 t p f i f f));
+DATA(insert OID = 0 ( 1259 relpages			23 0  4   6 0 -1 -1 t p f i f f));
+DATA(insert OID = 0 ( 1259 reltuples		23 0  4   7 0 -1 -1 t p f i f f));
+DATA(insert OID = 0 ( 1259 rellongrelid		26 0  4   8 0 -1 -1 t p f i f f));
+DATA(insert OID = 0 ( 1259 relhasindex		16 0  1   9 0 -1 -1 t p f c f f));
+DATA(insert OID = 0 ( 1259 relisshared		16 0  1  10 0 -1 -1 t p f c f f));
+DATA(insert OID = 0 ( 1259 relkind			18 0  1  11 0 -1 -1 t p f c f f));
+DATA(insert OID = 0 ( 1259 relnatts			21 0  2  12 0 -1 -1 t p f s f f));
+DATA(insert OID = 0 ( 1259 relchecks		21 0  2  13 0 -1 -1 t p f s f f));
+DATA(insert OID = 0 ( 1259 reltriggers		21 0  2  14 0 -1 -1 t p f s f f));
+DATA(insert OID = 0 ( 1259 relukeys			21 0  2  15 0 -1 -1 t p f s f f));
+DATA(insert OID = 0 ( 1259 relfkeys			21 0  2  16 0 -1 -1 t p f s f f));
+DATA(insert OID = 0 ( 1259 relrefs			21 0  2  17 0 -1 -1 t p f s f f));
+DATA(insert OID = 0 ( 1259 relhaspkey		16 0  1  18 0 -1 -1 t p f c f f));
+DATA(insert OID = 0 ( 1259 relhasrules		16 0  1  19 0 -1 -1 t p f c f f));
+DATA(insert OID = 0 ( 1259 relhassubclass	16 0  1   20 0 -1 -1 t p f c f f));
+DATA(insert OID = 0 ( 1259 relacl		  1034 0 -1  21 0 -1 -1 f p f i f f));
 DATA(insert OID = 0 ( 1259 ctid				27 0  6  -1 0 -1 -1 f p f i f f));
 DATA(insert OID = 0 ( 1259 oid				26 0  4  -2 0 -1 -1 t p f i f f));
 DATA(insert OID = 0 ( 1259 xmin				28 0  4  -3 0 -1 -1 t p f i f f));
Index: include/catalog/pg_class.h
===================================================================
RCS file: /home/projects/pgsql/cvsroot/pgsql/src/include/catalog/pg_class.h,v
retrieving revision 1.37
diff -u -r1.37 pg_class.h
--- include/catalog/pg_class.h	2000/06/12 03:40:53	1.37
+++ include/catalog/pg_class.h	2000/06/15 22:52:25
@@ -54,6 +54,7 @@
 CATALOG(pg_class) BOOTSTRAP
 {
 	NameData	relname;
+	NameData	relphysname;
 	Oid			reltype;
 	int4		relowner;
 	Oid			relam;
@@ -103,60 +104,62 @@
  *		relacl field.
  * ----------------
  */
-#define Natts_pg_class_fixed			19
-#define Natts_pg_class					20
+#define Natts_pg_class_fixed			20
+#define Natts_pg_class					21
 #define Anum_pg_class_relname			1
-#define Anum_pg_class_reltype			2
-#define Anum_pg_class_relowner			3
-#define Anum_pg_class_relam				4
-#define Anum_pg_class_relpages			5
-#define Anum_pg_class_reltuples			6
-#define Anum_pg_class_rellongrelid		7
-#define Anum_pg_class_relhasindex		8
-#define Anum_pg_class_relisshared		9
-#define Anum_pg_class_relkind			10
-#define Anum_pg_class_relnatts			11
-#define Anum_pg_class_relchecks			12
-#define Anum_pg_class_reltriggers		13
-#define Anum_pg_class_relukeys			14
-#define Anum_pg_class_relfkeys			15
-#define Anum_pg_class_relrefs			16
-#define Anum_pg_class_relhaspkey		17
-#define Anum_pg_class_relhasrules		18
-#define Anum_pg_class_relhassubclass		19
-#define Anum_pg_class_relacl			20
+#define Anum_pg_class_relphysname		2
+#define Anum_pg_class_reltype			3
+#define Anum_pg_class_relowner			4
+#define Anum_pg_class_relam				5
+#define Anum_pg_class_relpages			6
+#define Anum_pg_class_reltuples			7
+#define Anum_pg_class_rellongrelid		8
+#define Anum_pg_class_relhasindex		9
+#define Anum_pg_class_relisshared		10
+#define Anum_pg_class_relkind			11
+#define Anum_pg_class_relnatts			12
+#define Anum_pg_class_relchecks			13
+#define Anum_pg_class_reltriggers		14
+#define Anum_pg_class_relukeys			15
+#define Anum_pg_class_relfkeys			16
+#define Anum_pg_class_relrefs			17
+#define Anum_pg_class_relhaspkey		18
+#define Anum_pg_class_relhasrules		19
+#define Anum_pg_class_relhassubclass	20
+#define Anum_pg_class_relacl			21
 
 /* ----------------
  *		initial contents of pg_class
  * ----------------
  */
 
-DATA(insert OID = 1247 (  pg_type 71		  PGUID 0 0 0 0 f f r 16 0 0 0 0 0 f f f _null_ ));
+DATA(insert OID = 1247 (  pg_type "pg_type_1247" 71		  PGUID 0 0 0 0 f f r 16 0 0 0 0 0 f f f _null_ ));
 DESCR("");
-DATA(insert OID = 1249 (  pg_attribute 75	  PGUID 0 0 0 0 f f r 15 0 0 0 0 0 f f f _null_ ));
+DATA(insert OID = 1249 (  pg_attribute "pg_attribute_1249" 75	  PGUID 0 0 0 0 f f r 15 0 0 0 0 0 f f f _null_ ));
 DESCR("");
-DATA(insert OID = 1255 (  pg_proc 81		  PGUID 0 0 0 0 f f r 17 0 0 0 0 0 f f f _null_ ));
+DATA(insert OID = 1255 (  pg_proc "pg_proc_1255" 81		  PGUID 0 0 0 0 f f r 17 0 0 0 0 0 f f f _null_ ));
 DESCR("");
-DATA(insert OID = 1259 (  pg_class 83		  PGUID 0 0 0 0 f f r 20 0 0 0 0 0 f f f _null_ ));
+DATA(insert OID = 1259 (  pg_class "pg_class_1259" 83		  PGUID 0 0 0 0 f f r 21 0 0 0 0 0 f f f _null_ ));
 DESCR("");
-DATA(insert OID = 1260 (  pg_shadow 86		  PGUID 0 0 0 0 f t r 8  0 0 0 0 0 f f f _null_ ));
+DATA(insert OID = 1260 (  pg_shadow "pg_shadow_1260" 86		  PGUID 0 0 0 0 f t r 8  0 0 0 0 0 f f f _null_ ));
 DESCR("");
-DATA(insert OID = 1261 (  pg_group 87		  PGUID 0 0 0 0 f t r 3  0 0 0 0 0 f f f _null_ ));
+DATA(insert OID = 1261 (  pg_group "pg_group_1261" 87		  PGUID 0 0 0 0 f t r 3  0 0 0 0 0 f f f _null_ ));
 DESCR("");
-DATA(insert OID = 1262 (  pg_database 88	  PGUID 0 0 0 0 f t r 4  0 0 0 0 0 f f f _null_ ));
+DATA(insert OID = 1262 (  pg_database "pg_database_1262" 88	  PGUID 0 0 0 0 f t r 4  0 0 0 0 0 f f f _null_ ));
 DESCR("");
-DATA(insert OID = 1264 (  pg_variable 90	  PGUID 0 0 0 0 f t s 1  0 0 0 0 0 f f f _null_ ));
+DATA(insert OID = 1264 (  pg_variable "pg_variable_1264" 90	  PGUID 0 0 0 0 f t s 1  0 0 0 0 0 f f f _null_ ));
 DESCR("");
-DATA(insert OID = 1269 (  pg_log  99		  PGUID 0 0 0 0 f t s 1  0 0 0 0 0 f f f _null_ ));
+DATA(insert OID = 1269 (  pg_log "pg_log_1269"  99		  PGUID 0 0 0 0 f t s 1  0 0 0 0 0 f f f _null_ ));
 DESCR("");
-DATA(insert OID = 376  (  pg_xactlock  0	  PGUID 0 0 0 0 f t s 1  0 0 0 0 0 f f f _null_ ));
+DATA(insert OID = 376  (  pg_xactlock "pg_xactlock_376"  0	  PGUID 0 0 0 0 f t s 1  0 0 0 0 0 f f f _null_ ));
 DESCR("");
-DATA(insert OID = 1215 (  pg_attrdef 109	  PGUID 0 0 0 0 t t r 4  0 0 0 0 0 f f f _null_ ));
+DATA(insert OID = 1215 (  pg_attrdef "pg_attrdef_1215" 109	  PGUID 0 0 0 0 t t r 4  0 0 0 0 0 f f f _null_ ));
 DESCR("");
-DATA(insert OID = 1216 (  pg_relcheck 110	  PGUID 0 0 0 0 t t r 4  0 0 0 0 0 f f f _null_ ));
+DATA(insert OID = 1216 (  pg_relcheck "pg_relcheck_1216" 110  PGUID 0 0 0 0 t t r 4  0 0 0 0 0 f f f _null_ ));
 DESCR("");
-DATA(insert OID = 1219 (  pg_trigger 111	  PGUID 0 0 0 0 t t r 13  0 0 0 0 0 f f f _null_ ));
+DATA(insert OID = 1219 (  pg_trigger "pg_trigger_1219" 111	  PGUID 0 0 0 0 t t r 13  0 0 0 0 0 f f f _null_ ));
 DESCR("");
+
 
 #define RelOid_pg_type			1247
 #define RelOid_pg_attribute		1249
Index: include/parser/analyze.h
===================================================================
RCS file: /home/projects/pgsql/cvsroot/pgsql/src/include/parser/analyze.h,v
retrieving revision 1.10
diff -u -r1.10 analyze.h
--- include/parser/analyze.h	2000/01/26 05:58:26	1.10
+++ include/parser/analyze.h	2000/06/15 22:52:25
@@ -20,4 +20,8 @@
 extern void create_select_list(Node *ptr, List **select_list, bool *unionall_present);
 extern Node *A_Expr_to_Expr(Node *ptr, bool *intersect_present);
 
+/* Routine to make names that are less than NAMEDATALEN long */
+
+extern char *makeObjectName(char *name1, char *name2, char *typename);
+
 #endif	 /* ANALYZE_H */
Index: include/utils/rel.h
===================================================================
RCS file: /home/projects/pgsql/cvsroot/pgsql/src/include/utils/rel.h,v
retrieving revision 1.36
diff -u -r1.36 rel.h
--- include/utils/rel.h	2000/04/12 17:16:55	1.36
+++ include/utils/rel.h	2000/06/15 22:52:25
@@ -184,22 +184,29 @@
  */
 #define RelationGetRelationName(relation) \
 (\
-	(strncmp(RelationGetPhysicalRelationName(relation), \
+	(strncmp((NameStr((relation)->rd_rel->relname)), \
 	 "pg_temp.", strlen("pg_temp.")) != 0) \
 	? \
-		RelationGetPhysicalRelationName(relation) \
+		(NameStr((relation)->rd_rel->relname)) \
 	: \
 		get_temp_rel_by_physicalname( \
-			RelationGetPhysicalRelationName(relation)) \
+			(NameStr((relation)->rd_rel->relname))) \
 )
 
+/*
+ * RelationGetRealRelationName
+ *
+ *	  Returns a Relation Name
+ */
+#define RelationGetRealRelationName(relation) (NameStr((relation)->rd_rel->relname))
+
 
 /*
  * RelationGetPhysicalRelationName
  *
  *	  Returns a Relation Name
  */
-#define RelationGetPhysicalRelationName(relation) (NameStr((relation)->rd_rel->relname))
+#define RelationGetPhysicalRelationName(relation) (NameStr((relation)->rd_rel->relphysname))
 
 /*
  * RelationGetNumberOfAttributes
#76Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Noname (#73)
Re: AW: Big 7.1 open items

I'm just curious - when was the last time you restored a Postgres
database in this piecemeal manner, and how often do you do it?

More curios to me is that people seem to use physical file
based backup at all. Do they shutdown the postmaster during
backup or do they live with the fact that maybe not every
backup is a vital one?

I sure hope they shut down the postmaster, or know that nothing is
happening during the backup.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#77Tom Lane
tgl@sss.pgh.pa.us
In reply to: Ross J. Reedstrom (#69)
Re: Big 7.1 open items

"Ross J. Reedstrom" <reedstrm@rice.edu> writes:

On Thu, Jun 15, 2000 at 03:11:52AM -0400, Tom Lane wrote:

"Ross J. Reedstrom" <reedstrm@rice.edu> writes:

Any strong objections to the mixed relname_oid solution?

Yes!

The plan here was to let VACUUM handle renaming the file, since it
will already have all the necessary locks. This shortens the window
of confusion. ALTER TABLE RENAME doesn't happen that often, really -
the relname is there just for human consumption, then.

Yeah, I've seen tons of discussion of how if we do this, that, and
the other thing, and be prepared to fix up some other things in case
of crash recovery, we can make it work with filename == relname + OID
(where relname tracks logical name, at least at some remove).

Probably. Assuming nobody forgets anything.

I'm just trying to point out that that's a huge amount of pretty
delicate mechanism. The amount of work required to make it trustworthy
looks to me to dwarf the admin tools that Bruce is complaining about.
And we only have a few people competent to do the work. (With all
due respect, Ross, if you weren't already aware of the implications
for mdblindwrt, I have to wonder what else you missed.)

Filename == OID is so simple, reliable, and straightforward by
comparison that I think the decision is a no-brainer.

If we could afford to sink unlimited time into this one issue then
it might make sense to do it the hard way, but we have enough
important stuff on our TODO list to keep us all busy for years ---
I cannot believe that it's an effective use of our time to do this.

Hmm, what's all this with functions in catalog.c that are only called by
smgr/md.c? seems to me that anything having to do with physical storage
(like the path!) belongs in the smgr abstraction.

Yeah, there's a bunch of stuff that should have been implemented by
adding new smgr entry points, but wasn't. It should be pushed down.
(I can't resist pointing out that one of those things is physical
relation rename, which will go away and not *need* to be pushed down
if we do it the way I want.)

regards, tom lane

#78Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#70)
Re: Big 7.1 open items

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Gee, so dogmatic. No one besides Bruce and Hiroshi discussed this _at
all_ when I first put up patches two month ago. O.K., I'll do the oids
only version (and fix up relpath_blind)

Hold on. I don't think we want that work done yet. Seems even Tom is
thinking that if Vadim is going to re-do everything later anyway, we may
be better with a relname/oid solution that does require additional
administration apps.

Don't put words in my mouth, please. If we are going to throw the
work away later, it'd be foolish to do the much greater amount of
work needed to make filename=relname+OID fly than is needed for
filename=OID.

However, I'm pretty sure I recall Vadim stating that he thought
filename=OID would be required for his smgr changes anyway...

regards, tom lane

#79Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Tom Lane (#77)
RE: Big 7.1 open items

-----Original Message-----
From: pgsql-hackers-owner@hub.org [mailto:pgsql-hackers-owner@hub.org]On
Behalf Of Tom Lane

"Ross J. Reedstrom" <reedstrm@rice.edu> writes:

On Thu, Jun 15, 2000 at 03:11:52AM -0400, Tom Lane wrote:

"Ross J. Reedstrom" <reedstrm@rice.edu> writes:

Any strong objections to the mixed relname_oid solution?

Yes!

The plan here was to let VACUUM handle renaming the file, since it
will already have all the necessary locks. This shortens the window
of confusion. ALTER TABLE RENAME doesn't happen that often, really -
the relname is there just for human consumption, then.

Yeah, I've seen tons of discussion of how if we do this, that, and
the other thing, and be prepared to fix up some other things in case
of crash recovery, we can make it work with filename == relname + OID
(where relname tracks logical name, at least at some remove).

I've seen little discussion of how to avoid the use of naming rule.
I've proposed many times that we should keep the information
where the table is stored in our database itself. I've never seen
clear objections to it. So I could understand my proposal is OK ?
Isn't it much more important than naming rule ? Under the
mechanism,we could easily replace bad naming rule.
And I believe that Ross's work is mostly around the mechanism
not naming rule.

Now I like neither relname nor oid because it's not sufficient
for my purpose.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#80Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hiroshi Inoue (#79)
Re: Big 7.1 open items

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

Now I like neither relname nor oid because it's not sufficient
for my purpose.

We should probably not do much of anything with this issue until
we have a clearer understanding of what we want to do about
tablespaces and schemas.

My gut feeling is that we will end up with pathnames that look
something like

.../data/base/DBNAME/TABLESPACE/OIDOFRELATION

(with .N attached if a segment of a large relation, of course).

The TABLESPACE "name" should likely be an OID itself, but it wouldn't
have to be if you are willing to say that tablespaces aren't renamable.
(Come to think of it, does anyone care about being able to rename
databases? ;-)) Note that the TABLESPACE will often be a symlink
to storage on another drive, rather than a plain subdirectory of the
DBNAME, but that shouldn't be an issue at this level of discussion.

I think that schemas probably don't enter into this. We should instead
rely on the uniqueness of OIDs to prevent filename collisions. However,
OIDs aren't really unique: different databases in an installation will
use the same OIDs for their system tables. My feeling is that we can
live with a restriction like "you can't store the system tables of
different databases in the same tablespace". Alternatively we could
avoid that issue by inverting the pathname order:

.../data/base/TABLESPACE/DBNAME/OIDOFRELATION

Note that in any case, system tables will have to live in a
predetermined tablespace, since you can't very well look in pg_class
to find out which tablespace pg_class lives in. Perhaps we should
just reserve a tablespace per database for system tables and forget
the whole issue. If we do that, there's not really any need for
the database in the path! Just

.../data/base/TABLESPACE/OIDOFRELATION

would do fine and help reduce lookup overhead.

BTW, schemas do make things interesting for the other camp:
is it possible for the same table to be referenced by different
names in different schemas? If so, just how useful is it to pick
one of those names arbitrarily for the filename? This is an advanced
version of the main objection to using the original relname and not
updating it at RENAME TABLE --- sooner or later, the filenames are
going to be more confusing than helpful.

Comments? Have I missed something important about schemas?

regards, tom lane

#81Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#80)
Re: Big 7.1 open items

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

Now I like neither relname nor oid because it's not sufficient
for my purpose.

We should probably not do much of anything with this issue until
we have a clearer understanding of what we want to do about
tablespaces and schemas.

Here is an analysis of our options:

Work required Disadvantages
----------------------------------------------------------------------------

Keep current system no work rename/create no rollback

relname/oid but less work new pg_class column,
no rename change filename not accurate on
rename

relname/oid with more work complex code
rename change during
vacuum

oid filename less work, but confusing to admins
need admin tools

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#82Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Tom Lane (#80)
RE: Big 7.1 open items

Sorry for my previous mail. It was posted by my mistake.

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

Now I like neither relname nor oid because it's not sufficient
for my purpose.

We should probably not do much of anything with this issue until
we have a clearer understanding of what we want to do about
tablespaces and schemas.

My gut feeling is that we will end up with pathnames that look
something like

.../data/base/DBNAME/TABLESPACE/OIDOFRELATION

Schema is a logical concept and irrevant to physical location.
I strongly object your suggestion unless above means *default*
location.
Tablespace is an encapsulation of table allocation and the
name should be irrevant to the location basically. So above
seems very bad for me.

Anyway I don't see any advantage in fixed mapping impleme
ntation. After renewal,we should at least have a possibility to
allocate a specific table in arbitrary separate directory.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#83Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Bruce Momjian (#81)
RE: Big 7.1 open items

-----Original Message-----
From: Bruce Momjian [mailto:pgman@candle.pha.pa.us]

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

Now I like neither relname nor oid because it's not sufficient
for my purpose.

We should probably not do much of anything with this issue until
we have a clearer understanding of what we want to do about
tablespaces and schemas.

Here is an analysis of our options:

Work required Disadvantages
------------------------------------------------------------------
----------

Keep current system no work rename/create
no rollback

relname/oid but less work new pg_class column,
no rename change filename not
accurate on
rename

relname/oid with more work complex code
rename change during
vacuum

oid filename less work, but confusing to admins
need admin tools

Please add my opinion for naming rule.

relname/unique_id but need some work new pg_class column,
no relname change. for unique-id generation filename not relname

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#84Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hiroshi Inoue (#82)
Re: Big 7.1 open items

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

Tablespace is an encapsulation of table allocation and the
name should be irrevant to the location basically. So above
seems very bad for me.
Anyway I don't see any advantage in fixed mapping impleme
ntation. After renewal,we should at least have a possibility to
allocate a specific table in arbitrary separate directory.

Call a "directory" a "tablespace" and we're on the same page,
aren't we? Actually I'd envision some kind of admin command
"CREATE TABLESPACE foo AS /path/to/wherever". That would make
appropriate system catalog entries and also create a symlink
from ".../data/base/foo" (or some such place) to the target
directory. Then when we make a table in that tablespace,
it's in the right place. Problem solved, no?

It gets a little trickier if you want to be able to split
multi-gig tables across several tablespaces, though, since
you couldn't just append ".N" to the base table path in that
scenario.

I'd be interested to know what sort of facilities Oracle
provides for managing huge tables...

regards, tom lane

#85Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hiroshi Inoue (#83)
Re: Big 7.1 open items

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

Please add my opinion for naming rule.

relname/unique_id but need some work new pg_class column,
no relname change. for unique-id generation filename not relname

Why is a unique ID better than --- or even different from ---
using the relation's OID? It seems pointless to me...

regards, tom lane

#86Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Tom Lane (#85)
RE: Big 7.1 open items

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

Please add my opinion for naming rule.

relname/unique_id but need some work new

pg_class column,

no relname change. for unique-id generation filename not relname

Why is a unique ID better than --- or even different from ---
using the relation's OID? It seems pointless to me...

For example,in the implementation of CLUSTER command,
we would need another new file for the target relation in
order to put sorted rows but don't we want to change the
OID ? It would be needed for table re-construction generally.
If I remember correectly,you once proposed OID+version
naming for the cases.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#87Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Tom Lane (#84)
RE: Big 7.1 open items

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

Tablespace is an encapsulation of table allocation and the
name should be irrevant to the location basically. So above
seems very bad for me.
Anyway I don't see any advantage in fixed mapping impleme
ntation. After renewal,we should at least have a possibility to
allocate a specific table in arbitrary separate directory.

Call a "directory" a "tablespace" and we're on the same page,
aren't we? Actually I'd envision some kind of admin command
"CREATE TABLESPACE foo AS /path/to/wherever".

Yes,I think 'tablespace -> directory' is the most natural
extension under current file_per_table storage manager.
If many_tables_in_a_file storage manager is introduced,we
may be able to change the definiiton of TABLESPACE
to 'tablespace -> files' like Oracle.

That would make
appropriate system catalog entries and also create a symlink
from ".../data/base/foo" (or some such place) to the target
directory.
Then when we make a table in that tablespace,
it's in the right place. Problem solved, no?

I don't like symlink for dbms data files. However it may
be OK,If symlink are limited to 'tablespace->directory'
corrspondence and all tablespaces(including default
etc) are symlink. It is simple and all debugging would
be processed under tablespace_is_symlink environment.

It gets a little trickier if you want to be able to split
multi-gig tables across several tablespaces, though, since
you couldn't just append ".N" to the base table path in that
scenario.

This seems to be not that easy to solve now.
Ross doesn't change this naming rule for multi-gig
tables either in his trial.

I'd be interested to know what sort of facilities Oracle
provides for managing huge tables...

In my knowledge about old Oracle,one TABLESPACE
could have many DATAFILEs which could contain
many tables.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#88Chris Bitmead
chrisb@nimrod.itg.telstra.com.au
In reply to: Noname (#33)
Re: Big 7.1 open items

Tom Lane wrote:

<dbroot>/catalog_tables/pg_...
<dbroot>/catalog_index/pg_...
<dbroot>/user_tables/oid_...
<dbroot>/user_index/oid_...
<dbroot>/temp_tables/oid_...
<dbroot>/temp_index/oid_...
<dbroot>/toast_tables/oid_...
<dbroot>/toast_index/oid_...
<dbroot>/whatnot_???/...

I don't see a lot of value in that. Better to do something like
tablespaces:
I don't see a lot of value in that. Better to do something like
tablespaces:

<dbroot>/<oidoftablespace>/<oidofobject>

What is the benefit of having oidoftablespace in the directory path?
Isn't tablespace an idea so you can store it somewhere completely
different?
Or is there some symlink idea or something?

#89Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hiroshi Inoue (#86)
Re: Big 7.1 open items

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

Why is a unique ID better than --- or even different from ---
using the relation's OID? It seems pointless to me...

For example,in the implementation of CLUSTER command,
we would need another new file for the target relation in
order to put sorted rows but don't we want to change the
OID ? It would be needed for table re-construction generally.
If I remember correectly,you once proposed OID+version
naming for the cases.

Hmm, so you are thinking that the pg_class row for the table would
include this uniqueID, and then committing the pg_class update would
be the atomic action that replaces the old table contents with the
new? It does have some attraction now that I think about it.

But there are other ways we could do the same thing. If we want to
have tablespaces, there will need to be a tablespace identifier in
each pg_class row. So we could do CLUSTER in the same way as we'd
move a table from one tablespace to another: create the new files in
the new tablespace directory, and the commit of the new pg_class row
with the new tablespace value is the atomic action that makes the new
files valid and the old files not.

You will probably say "but I didn't want to move my table to a new
tablespace just to cluster it!" I think we could live with that,
though. A tablespace doesn't need to have any existence more concrete
than a subdirectory, in my vision of the way things would work. We
could do something like making two subdirectories of each place that
the dbadmin designates as a "tablespace", so that we make two logical
tablespaces out of what the dbadmin thinks of as one. Then we can
ping-pong between those directories to do things like clustering "in
place".

Basically I want to keep the bottom-level mechanisms as simple and
reliable as we possibly can. The fewer concepts are known down at
the bottom, the better. If we can keep the pathname constituents
to just "tablespace" and "relation OID" we'll be in great shape ---
but each additional concept that has to be known down there is
another potential problem.

regards, tom lane

#90Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Tom Lane (#89)
RE: Big 7.1 open items

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

Why is a unique ID better than --- or even different from ---
using the relation's OID? It seems pointless to me...

For example,in the implementation of CLUSTER command,
we would need another new file for the target relation in
order to put sorted rows but don't we want to change the
OID ? It would be needed for table re-construction generally.
If I remember correectly,you once proposed OID+version
naming for the cases.

Hmm, so you are thinking that the pg_class row for the table would
include this uniqueID,

No,I just include the place where the table is stored(pathname under
current file_per_table storage manager) in the pg_class row because
I don't want to rely on table allocating rule(naming rule for current)
to access existent relation files. This has always been my main point.
Many_tables_in_a_file storage manager wouldn't be able to live without
keeping this kind of infomation.
This information(where it is stored) is diffrent from tablespace(where
to store) information. There was an idea to keep the information into
opaque entry in pg_class which only a specific storage manager
could handle. There was an idea to have a new system table which
keeps the information. and so on...

and then committing the pg_class update would
be the atomic action that replaces the old table contents with the
new? It does have some attraction now that I think about it.

But there are other ways we could do the same thing. If we want to
have tablespaces, there will need to be a tablespace identifier in
each pg_class row. So we could do CLUSTER in the same way as we'd
move a table from one tablespace to another: create the new files in
the new tablespace directory, and the commit of the new pg_class row
with the new tablespace value is the atomic action that makes the new
files valid and the old files not.

You will probably say "but I didn't want to move my table to a new
tablespace just to cluster it!"

Yes.

I think we could live with that,
though. A tablespace doesn't need to have any existence more concrete
than a subdirectory, in my vision of the way things would work. We
could do something like making two subdirectories of each place that
the dbadmin designates as a "tablespace", so that we make two logical
tablespaces out of what the dbadmin thinks of as one.

Certainly we could design TABLESPACE(where to store) as above.

Then we can
ping-pong between those directories to do things like clustering "in
place".

But maybe we must keep the directory information where the table was
*ping-ponged* in (e.g.) pg_class. Is such an implementation cleaner or
more extensible than mine(keeping the stored place exactly) ?

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#91Tom Lane
tgl@sss.pgh.pa.us
In reply to: Chris Bitmead (#88)
Re: Big 7.1 open items

Chris Bitmead <chrisb@nimrod.itg.telstra.com.au> writes:

Tom Lane wrote:

I don't see a lot of value in that. Better to do something like
tablespaces:

<dbroot>/<oidoftablespace>/<oidofobject>

What is the benefit of having oidoftablespace in the directory path?
Isn't tablespace an idea so you can store it somewhere completely
different?
Or is there some symlink idea or something?

Exactly --- I'm assuming that the tablespace "directory" is likely
to be a symlink to some other mounted volume. The point here is
to keep the low-level file access routines from having to know very
much about tablespaces or file organization. In the above proposal,
all they need to know is the relation's OID and the name (or OID)
of the tablespace the relation's assigned to; then they can form
a valid path using a hardwired rule. There's still plenty of
flexibility of organization, but it's not necessary to know that
where the rubber meets the road (eg, when you're down inside mdblindwrt
trying to dump a dirty buffer to disk with no spare resources to find
out anything about the relation the page belongs to...)

regards, tom lane

#92Noname
JanWieck@t-online.de
In reply to: Tom Lane (#84)
Re: Big 7.1 open items

Tom Lane wrote:

It gets a little trickier if you want to be able to split
multi-gig tables across several tablespaces, though, since
you couldn't just append ".N" to the base table path in that
scenario.

I'd be interested to know what sort of facilities Oracle
provides for managing huge tables...

Oracle tablespaces are a collection of 1...n preallocated
files. Each table then is bound to a tablespace and
allocates extents (chunks) from those files.

There are some per table attributes that control the extent
sizes with default values coming from the tablespace. The
initial extent size, the nextextent and the pctincrease.
There is a hardcoded limit for the number of extents a table
can have at all. In Oracle7 it was 512 (or somewhat below -
don't recall correct). Maybe that's gone with Oracle8, don't
know.

This storage concept has IMHO a couple of advatages over
ours.

The tablespace files are preallocated, so there will
never be a change in block allocation during runtime and
that's the base for fdatasync() beeing sufficient at
syncpoints. All what might be inaccurate after a crash is
the last modified time in the inode, and that's totally
irrelevant for Oracle. The fsck will never fail, and
anything is up to Oracle's recovery.

The number of total tablespace files is limited to a
value that ensures, that the backends can keep them all
open all the time. It's hard to exceed that limit. A
typical SAP installation with more than 20,000
tables/indices doesn't need more than 30 or 40 of them.

It is perfectly prepared for raw devices, since a
tablespace in a raw device installation is simply an area
of blocks on a disk.

There are also disadvantages.

You can run out of space even if there are plenty GB's
free on your disks. You have to create tablespaces
explicitly.

If you've choosen inadequate extent size parameters, you
end up with high fragmented tables (slowing down) or get
stuck with running against maxextents, where only a reorg
(export/import) helps.

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#93Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Thomas Lockhart (#67)
Re: Re: Big 7.1 open items

o Don't accept character sequences those are not valid as their
charset (signaling ERROR seems appropriate IMHO)
o Make PostgreSQL more multibyte aware (for example, TRIM function and
NAME data type)
o Regard n of CHAR(n)/VARCHAR(n) as the number of letters, rather than
the number of bytes

All good, and important features when we are done.

Glad to hear that.

One issue: I can see (or imagine ;) how we can use the Postgres type
system to manage multiple character sets. But allowing arbitrary
character sets in, say, table names forces us to cope with allowing a
mix of character sets in a single column of a system table. afaik this
general capability is not mandated by SQL9x (the SQL_TEXT character set
is used for all system resources??). Would it be acceptable to have a
"default database character set" which is allowed to creep into the
pg_xxx tables? Even that seems to be a difficult thing to accomplish at
the moment (we'd need to get some of the text manipulation functions
from the catalogs, not from hardcoded references as we do now).

"default database character set" idea does not seem to be the solution
for cross-db relations such as pg_database. The only solution I can
imagine so far is using SQL_TEXT.

BTW, I've been thinking about SQL_TEXT for a while and it seems
mule_internal_code or Unicode(UTF-8) would be the candidates for
it. Mule_internal_code looks more acceptable for Asian multi-byte
users like me than Unicode. It's clean, simple and does not require
huge conversion tables between Unicode and other encodings. However,
Unicode has a stronger political power in the real world and for most
single-byte users probably it would be enough. My idea is let users
choose one of them. I mean making it a compile time option.

We should itemize all of these issues so we can keep track of what is
necessary, possible, and/or "easy".

You are right, probably there would be tons of issues in implementing
multiple charsets support.
--
Tatsuo Ishii

#94Thomas Lockhart
lockhart@alumni.caltech.edu
In reply to: Michael Robinson (#27)
Re: Re: Big 7.1 open items

"default database character set" idea does not seem to be the solution
for cross-db relations such as pg_database. The only solution I can
imagine so far is using SQL_TEXT.
BTW, I've been thinking about SQL_TEXT for a while and it seems
mule_internal_code or Unicode(UTF-8) would be the candidates for
it. Mule_internal_code looks more acceptable for Asian multi-byte
users like me than Unicode. It's clean, simple and does not require
huge conversion tables between Unicode and other encodings. However,
Unicode has a stronger political power in the real world and for most
single-byte users probably it would be enough. My idea is let users
choose one of them. I mean making it a compile time option.

Oh. I was recalling SQL_TEXT as being a "subset" character set which
contains only the characters (more or less) that are required for
implementing the SQL92 query language and standard features.

Are you seeing it as being a "superset" character set which can
represent all other character sets??

And, how would you suggest we start tracking this discussion in a design
document? I could put something into the developer's guide, or we could
have a plain-text FAQ, or ??

I'd propose that we start accumulating a feature list, perhaps ordering
it into categories like

o required/suggested by SQL9x
o required/suggested by experience in the real world
o sure would be nice to have
o really bad idea ;)

- Thomas

#95Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noname (#92)
Re: Big 7.1 open items

JanWieck@t-online.de (Jan Wieck) writes:

There are also disadvantages.

You can run out of space even if there are plenty GB's
free on your disks. You have to create tablespaces
explicitly.

Not to mention the reverse: if I read this right, you have to suck
up your GB's long in advance of actually needing them. That's OK
for a machine that's dedicated to Oracle ... not so OK for smaller
installations, playpens, etc.

I'm not convinced that there's anything fundamentally wrong with
doing storage allocation in Unix files the way we have been.

(At least not when we're sitting atop a well-done filesystem,
which may leave the Linux folk out in the cold ;-).)

regards, tom lane

#96Thomas Lockhart
lockhart@alumni.caltech.edu
In reply to: Noname (#92)
Re: Big 7.1 open items

(At least not when we're sitting atop a well-done filesystem,
which may leave the Linux folk out in the cold ;-).)

Those who live in HP houses should not throw stones :))

- Thomas

#97Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noname (#92)
Re: Big 7.1 open items

JanWieck@t-online.de (Jan Wieck) writes:

Tom Lane wrote:

It gets a little trickier if you want to be able to split
multi-gig tables across several tablespaces, though, since
you couldn't just append ".N" to the base table path in that
scenario.

I'd be interested to know what sort of facilities Oracle
provides for managing huge tables...

Oracle tablespaces are a collection of 1...n preallocated
files. Each table then is bound to a tablespace and
allocates extents (chunks) from those files.

OK, to get back to the point here: so in Oracle, tables can't cross
tablespace boundaries, but a tablespace itself could span multiple
disks?

Not sure if I like that better or worse than equating a tablespace
with a directory (so, presumably, all the files within it live on
one filesystem) and then trying to make tables able to span
tablespaces. We will need to do one or the other though, if we want
to have any significant improvement over the current state of affairs
for large tables.

One way is to play the flip-the-path-ordering game some more,
and access multiple-segment tables with pathnames like this:

.../TABLESPACE/RELATION -- first or only segment
.../TABLESPACE/N/RELATION -- N'th extension segment

This isn't any harder for md.c to deal with than what we do now,
but by making the /N subdirectories be symlinks, the dbadmin could
easily arrange for extension segments to go on different filesystems.
Also, since /N subdirectory symlinks can be added as needed,
expanding available space by attaching more disks isn't hard.
(If the admin hasn't pre-made a /N symlink when it's needed,
I'd envision the backend just automatically creating a plain
subdirectory so that it can extend the table.)

A limitation is that the N'th extension segments of all the relations
in a given tablespace have to be in the same place, but I don't see
that as a major objection. Worst case is you make a separate tablespace
for each of your multi-gig relations ... you're probably not going to
have a very large number of such relations, so this doesn't seem like
unmanageable admin complexity.

We'd still want to create some tools to help the dbadmin with slinging
all these symlinks around, of course. But I think it's critical to keep
the low-level file access protocol simple and reliable, which really
means minimizing the amount of information the backend needs to know to
figure out which file to write a page in. With something like the above
you only need to know the tablespace name (or more likely OID), the
relation OID (+name or not, depending on outcome of other argument),
and the offset in the table. No worse than now from the software's
point of view.

Comments?

regards, tom lane

#98Thomas Lockhart
lockhart@alumni.caltech.edu
In reply to: Noname (#92)
Re: Big 7.1 open items

... But I think it's critical to keep
the low-level file access protocol simple and reliable, which really
means minimizing the amount of information the backend needs to know
to figure out which file to write a page in. With something like the
above you only need to know the tablespace name (or more likely OID),
the relation OID (+name or not, depending on outcome of other
argument), and the offset in the table. No worse than now from the
software's point of view.
Comments?

I'm probably missing the context a bit, but imho we should try hard to
stay away from symlinks as the general solution for anything.

Sorry for being behind here, but to make sure I'm on the right page:
o tablespaces decouple storage from logical tables
o a database lives in a default tablespace, unless specified
o by default, a table will live in the default tablespace
o (eventually) a table can be split across tablespaces

Some thoughts:
o the ability to split single tables across disks was essential for
scalability when disks were small. But with RAID, NAS, etc etc isn't
that a smaller issue now?
o "tablespaces" would implement our less-developed "with location"
feature, right? Splitting databases, whole indices and whole tables
across storage is the biggest win for this work since more users will
use the feature.
o location information needs to travel with individual tables anyway.

#99Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Noname (#92)
Re: Big 7.1 open items

There are also disadvantages.

You can run out of space even if there are plenty GB's
free on your disks. You have to create tablespaces
explicitly.

If you've choosen inadequate extent size parameters, you
end up with high fragmented tables (slowing down) or get
stuck with running against maxextents, where only a reorg
(export/import) helps.

Also, Tom Lane pointed out to me that file system read-ahead does not
help if your table is spread around in tablespaces.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#100The Hermit Hacker
scrappy@hub.org
In reply to: Bruce Momjian (#81)
Re: Big 7.1 open items

On Thu, 15 Jun 2000, Bruce Momjian wrote:

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

Now I like neither relname nor oid because it's not sufficient
for my purpose.

We should probably not do much of anything with this issue until
we have a clearer understanding of what we want to do about
tablespaces and schemas.

Here is an analysis of our options:

Work required Disadvantages
----------------------------------------------------------------------------

Keep current system no work rename/create no rollback

relname/oid but less work new pg_class column,
no rename change filename not accurate on
rename

relname/oid with more work complex code
rename change during
vacuum

oid filename less work, but confusing to admins
need admin tools

My vote is with Tom on this one ... oid only ... the admin should be able
to do a quick SELECT on a table to find out the OID->table mapping, and I
believe its already been pointed out that you cant' just restore one file
anyway, so it kinda negates the "server isn't running problem" ...

#101The Hermit Hacker
scrappy@hub.org
In reply to: Tom Lane (#85)
Re: Big 7.1 open items

On Thu, 15 Jun 2000, Tom Lane wrote:

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

Please add my opinion for naming rule.

relname/unique_id but need some work new pg_class column,
no relname change. for unique-id generation filename not relname

Why is a unique ID better than --- or even different from ---
using the relation's OID? It seems pointless to me...

just to open up a whole new bucket of worms here, but ... if we do use OID
(which up until this thought I endorse 100%) ... do we not run a risk if
we run out of OIDs? As far as I know, those are still a finite resource,
no?

or, do we just assume that by the time that comes, everyone will be pretty
much using 64bit machines? :)

#102Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Lockhart (#98)
Re: Big 7.1 open items

Thomas Lockhart <lockhart@alumni.caltech.edu> writes:

... But I think it's critical to keep
the low-level file access protocol simple and reliable, which really
means minimizing the amount of information the backend needs to know
to figure out which file to write a page in. With something like the
above you only need to know the tablespace name (or more likely OID),
the relation OID (+name or not, depending on outcome of other
argument), and the offset in the table. No worse than now from the
software's point of view.
Comments?

I'm probably missing the context a bit, but imho we should try hard to
stay away from symlinks as the general solution for anything.

Why?

regards, tom lane

#103The Hermit Hacker
scrappy@hub.org
In reply to: Bruce Momjian (#76)
Re: AW: Big 7.1 open items

On Thu, 15 Jun 2000, Bruce Momjian wrote:

I'm just curious - when was the last time you restored a Postgres
database in this piecemeal manner, and how often do you do it?

More curios to me is that people seem to use physical file
based backup at all. Do they shutdown the postmaster during
backup or do they live with the fact that maybe not every
backup is a vital one?

I sure hope they shut down the postmaster, or know that nothing is
happening during the backup.

I do a backup based on a pg_dump snapshot at the time of the backup
...

#104Tom Lane
tgl@sss.pgh.pa.us
In reply to: The Hermit Hacker (#101)
Re: Big 7.1 open items

The Hermit Hacker <scrappy@hub.org> writes:

just to open up a whole new bucket of worms here, but ... if we do use OID
(which up until this thought I endorse 100%) ... do we not run a risk if
we run out of OIDs? As far as I know, those are still a finite resource,
no?

They are, and there is some risk involved, but OID collisions in the
system tables will cause you just as much headache. There's not only
the pg_class row to think of, but the pg_attribute rows, etc etc.

If you did have an OID collision with an existing table you'd have to
keep trying until you got a set of OID assignments with no conflicts.
(Now that we have unique indexes on the system tables, this should
work properly, ie, you will hear about it if you have a conflict.)
I don't think the physical table names make this noticeably worse.
Of course we'd better be careful to create table files with O_EXCL,
so as not to tromp on existing files, but we do that already IIRC.

or, do we just assume that by the time that comes, everyone will be pretty
much using 64bit machines? :)

I think we are not too far away from being able to offer 64-bit OID as
a compile-time option (on machines where there is a 64-bit integer type
that is). It's just a matter of someone putting it at the head of their
todo list.

Bottom line is I'm not real worried about this issue.

But having said all that, I am coming round to agree with Hiroshi's idea
anyway. See upcoming message.

regards, tom lane

#105Don Baccus
dhogaza@pacifier.com
In reply to: Tom Lane (#97)
Re: Big 7.1 open items

At 11:46 AM 6/16/00 -0400, Tom Lane wrote:

OK, to get back to the point here: so in Oracle, tables can't cross
tablespace boundaries,

Right, the construct AFAIK is "create table/index foo on tablespace ..."

but a tablespace itself could span multiple
disks?

Right.

Not sure if I like that better or worse than equating a tablespace
with a directory (so, presumably, all the files within it live on
one filesystem) and then trying to make tables able to span
tablespaces. We will need to do one or the other though, if we want
to have any significant improvement over the current state of affairs
for large tables.

Oracle's way does a reasonable job of isolating the datamodel
from the details of the physical layout.

Take the OpenACS web toolkit, for instance. We could take
each module's tables and indices and assign them appropriately
to various dataspaces, then provide a separate .sql files with
only "create tablespace" statements in there.

By modifying that one central file, the toolkit installation
could be customized to run anything from a small site (one
disk with everything on it, ala my own personal webserver at
birdnotes.net) or a very large site with many spindles, with
various index and table structures spread out widely hither
and thither.

Given that the OpenACS datamodel is nearly 10K lines long (including
many comments, of course), being able to customize an installation
to such a degree by modifying a single file filled with "create
tablespaces" would be very attractive.

One way is to play the flip-the-path-ordering game some more,
and access multiple-segment tables with pathnames like this:

.../TABLESPACE/RELATION -- first or only segment
.../TABLESPACE/N/RELATION -- N'th extension segment

This isn't any harder for md.c to deal with than what we do now,
but by making the /N subdirectories be symlinks, the dbadmin could
easily arrange for extension segments to go on different filesystems.

I personally dislike depending on symlinks to move stuff around.
Among other things, a pg_dump/restore (and presumably future
backup tools?) can't recreate the disk layout automatically.

We'd still want to create some tools to help the dbadmin with slinging
all these symlinks around, of course.

OK, if symlinks are simply an implementation detail hidden from the
dbadmin, and if the physical structure is kept in the db so it can
be rebuilt if necessary automatically, then I don't mind symlinks.

But I think it's critical to keep
the low-level file access protocol simple and reliable, which really
means minimizing the amount of information the backend needs to know to
figure out which file to write a page in. With something like the above
you only need to know the tablespace name (or more likely OID), the
relation OID (+name or not, depending on outcome of other argument),
and the offset in the table. No worse than now from the software's
point of view.

Make the code that creates and otherwise manipulates tablespaces
do the work, while keeping the low-level file access protocol simple.

Yes, this approach sounds very good to me.

- Don Baccus, Portland OR <dhogaza@pacifier.com>
Nature photos, on-line guides, Pacific Northwest
Rare Bird Alert Service and other goodies at
http://donb.photo.net.

#106Don Baccus
dhogaza@pacifier.com
In reply to: Thomas Lockhart (#98)
Re: Big 7.1 open items

At 04:27 PM 6/16/00 +0000, Thomas Lockhart wrote:

Sorry for being behind here, but to make sure I'm on the right page:
o tablespaces decouple storage from logical tables
o a database lives in a default tablespace, unless specified
o by default, a table will live in the default tablespace
o (eventually) a table can be split across tablespaces

Or tablespaces across filesystems/mountpoints whatever.

Some thoughts:
o the ability to split single tables across disks was essential for
scalability when disks were small. But with RAID, NAS, etc etc isn't
that a smaller issue now?

Yes for size issues, I should think, especially if you have the
money for a large RAID subsystem. But for throughput performance,
control over which spindles particularly busy tables and indices
go on would still seem to be pretty relevant, when they're being
updated a lot. In order to minimize seek times.

I really can't say how important this is in reality. Oracle-world
folks still talk about this kind of optimization being important,
but I'm not personally running any kind of database-backed website
that's busy enough or contains enough storage to worry about it.

o "tablespaces" would implement our less-developed "with location"
feature, right? Splitting databases, whole indices and whole tables
across storage is the biggest win for this work since more users will
use the feature.
o location information needs to travel with individual tables anyway.

- Don Baccus, Portland OR <dhogaza@pacifier.com>
Nature photos, on-line guides, Pacific Northwest
Rare Bird Alert Service and other goodies at
http://donb.photo.net.

#107Tom Lane
tgl@sss.pgh.pa.us
In reply to: Don Baccus (#105)
Re: Big 7.1 open items

Don Baccus <dhogaza@pacifier.com> writes:

This isn't any harder for md.c to deal with than what we do now,
but by making the /N subdirectories be symlinks, the dbadmin could
easily arrange for extension segments to go on different filesystems.

I personally dislike depending on symlinks to move stuff around.
Among other things, a pg_dump/restore (and presumably future
backup tools?) can't recreate the disk layout automatically.

Good point, we'd need some way of saving/restoring the tablespace
structures.

We'd still want to create some tools to help the dbadmin with slinging
all these symlinks around, of course.

OK, if symlinks are simply an implementation detail hidden from the
dbadmin, and if the physical structure is kept in the db so it can
be rebuilt if necessary automatically, then I don't mind symlinks.

I'm not sure about keeping it in the db --- creates a bit of a
chicken-and-egg problem doesn't it? Maybe there needs to be a
"system database" that has nailed-down pathnames (no tablespaces
for you baby) and contains the critical installation-wide tables
like pg_database, pg_user, pg_tablespace. A restore would have
to restore these tables first anyway.

Make the code that creates and otherwise manipulates tablespaces
do the work, while keeping the low-level file access protocol simple.

Right, that's the bottom line for me.

regards, tom lane

#108Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Don Baccus (#106)
Re: Big 7.1 open items

Some thoughts:
o the ability to split single tables across disks was essential for
scalability when disks were small. But with RAID, NAS, etc etc isn't
that a smaller issue now?

Yes for size issues, I should think, especially if you have the
money for a large RAID subsystem. But for throughput performance,
control over which spindles particularly busy tables and indices
go on would still seem to be pretty relevant, when they're being
updated a lot. In order to minimize seek times.

I really can't say how important this is in reality. Oracle-world
folks still talk about this kind of optimization being important,
but I'm not personally running any kind of database-backed website
that's busy enough or contains enough storage to worry about it.

It is important when you have a few big tables that must be fast. One
objection I have always had to the HP logical volume manager is that it
is difficult to know what drives are being assigned to each logical
volume.

Seems if they don't have RAID, we should allow such drive partitioning.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#109Ross J. Reedstrom
reedstrm@rice.edu
In reply to: Thomas Lockhart (#98)
Re: Big 7.1 open items

On Fri, Jun 16, 2000 at 04:27:22PM +0000, Thomas Lockhart wrote:

... But I think it's critical to keep
the low-level file access protocol simple and reliable, which really
means minimizing the amount of information the backend needs to know
to figure out which file to write a page in. With something like the
above you only need to know the tablespace name (or more likely OID),
the relation OID (+name or not, depending on outcome of other
argument), and the offset in the table. No worse than now from the
software's point of view.
Comments?

I think the backend needs a per table token that indicates how
to get at the physical bits of the file. Whether that's a filename
alone, filename with path, oid, key to a smgr hash table or something
else, it's opaque above the smgr routines.

Hmm, now I'm thinking, since the tablespace discussion has been reopened,
the way to go about coding all this is to reactivate the smgr code: how
about I leave the existing md smgr as is, and clone it, call it md2 or
something, and start messing with adding features there?

I'm probably missing the context a bit, but imho we should try hard to
stay away from symlinks as the general solution for anything.

Sorry for being behind here, but to make sure I'm on the right page:
o tablespaces decouple storage from logical tables
o a database lives in a default tablespace, unless specified
o by default, a table will live in the default tablespace
o (eventually) a table can be split across tablespaces

Some thoughts:
o the ability to split single tables across disks was essential for
scalability when disks were small. But with RAID, NAS, etc etc isn't
that a smaller issue now?
o "tablespaces" would implement our less-developed "with location"
feature, right? Splitting databases, whole indices and whole tables
across storage is the biggest win for this work since more users will
use the feature.
o location information needs to travel with individual tables anyway.

I was juist thinking that that discussion needed some summation.

Some links to historic discussion:

This one is Vadim saying WAL will need oids names:
http://www.postgresql.org/mhonarc/pgsql-hackers/1999-11/msg00809.html

A longer discussion kicked off by Don Baccus:
http://www.postgresql.org/mhonarc/pgsql-hackers/2000-01/msg00510.html

Tom suggesting OIDs to allow rollback:
http://www.postgresql.org/mhonarc/pgsql-hackers/2000-03/msg00119.html

Martin Neumann posted an question on dataspaces:

(can't find it in the offical archives: looks like March 2000, 10-29 is
missing. here's my copy: don't beat on it! n particular, since I threw
it together for local access, it's one _big_ index page)

http://cooker.ir.rice.edu/postgresql/msg20257.html
(in that thread is a post where I mention blindwrites and getting rid
of GetRawDatabaseInfo)

Martin later posted an RFD on tablespaces:

http://cooker.ir.rice.edu/postgresql/msg20490.html

Here's Hor�k Daniel with a patch for discussion, implementing dataspaces
on a per database level:

http://cooker.ir.rice.edu/postgresql/msg20498.html

Ross
--
Ross J. Reedstrom, Ph.D., <reedstrm@rice.edu>
NSBRI Research Scientist/Programmer
Computer and Information Technology Institute
Rice University, 6100 S. Main St., Houston, TX 77005

#110Don Baccus
dhogaza@pacifier.com
In reply to: Tom Lane (#107)
Re: Big 7.1 open items

At 03:00 PM 6/16/00 -0400, Tom Lane wrote:

OK, if symlinks are simply an implementation detail hidden from the
dbadmin, and if the physical structure is kept in the db so it can
be rebuilt if necessary automatically, then I don't mind symlinks.

I'm not sure about keeping it in the db --- creates a bit of a
chicken-and-egg problem doesn't it?

Not if the tablespace creates preceeds the tables stored in them.

Maybe there needs to be a
"system database" that has nailed-down pathnames (no tablespaces
for you baby) and contains the critical installation-wide tables
like pg_database, pg_user, pg_tablespace. A restore would have
to restore these tables first anyway.

Oh, I see. Yes, when I've looked into this and have thought about
it I've assumed that there would always be a known starting point
which would contain the installation-wide tables.

From a practical point of view, I don't think that's really a
problem.

I've not looked into how Oracle does this, I assume it builds
a system tablespace on one of the initial mount points you give
it when you install the thing. The paths to the mount points
are stored in specific files known to Oracle, I think. It's
been over a year (not long enough!) since I've set up Oracle...

- Don Baccus, Portland OR <dhogaza@pacifier.com>
Nature photos, on-line guides, Pacific Northwest
Rare Bird Alert Service and other goodies at
http://donb.photo.net.

#111Ross J. Reedstrom
reedstrm@rice.edu
In reply to: Tom Lane (#77)
Re: Big 7.1 open items

On Thu, Jun 15, 2000 at 07:53:52PM -0400, Tom Lane wrote:

"Ross J. Reedstrom" <reedstrm@rice.edu> writes:

On Thu, Jun 15, 2000 at 03:11:52AM -0400, Tom Lane wrote:

"Ross J. Reedstrom" <reedstrm@rice.edu> writes:

Any strong objections to the mixed relname_oid solution?

Yes!

The plan here was to let VACUUM handle renaming the file, since it
will already have all the necessary locks. This shortens the window
of confusion. ALTER TABLE RENAME doesn't happen that often, really -
the relname is there just for human consumption, then.

Yeah, I've seen tons of discussion of how if we do this, that, and
the other thing, and be prepared to fix up some other things in case
of crash recovery, we can make it work with filename == relname + OID
(where relname tracks logical name, at least at some remove).

Probably. Assuming nobody forgets anything.

I agree, it seems a major undertaking, at first glance. And second. Even
third. Especially for someone who hasn't 'earned his spurs' yet. as
it were.

I'm just trying to point out that that's a huge amount of pretty
delicate mechanism. The amount of work required to make it trustworthy
looks to me to dwarf the admin tools that Bruce is complaining about.
And we only have a few people competent to do the work. (With all
due respect, Ross, if you weren't already aware of the implications
for mdblindwrt, I have to wonder what else you missed.)

Ah, you knew that comment would come back to haunt me (I have a
tendency to think out loud, even if checking and coming back latter
would be better;-) In fact, there's no problem, and never was, since the
buffer->blind.relname is filled in via RelationGetPhysicalRelationName,
just like every other path that requires direct file access. I just
didn't remember that I had in fact checked it (it's been a couple months,
and I just got back from vacation ;-)

Actually, Once I re-checked it, the code looked very familiar. I had
spent time looking at the blind write code in the context of getting
rid of the only non-startup use of GetRawDatabaseInfo.

As to missing things: I'm leaning heavily on Bruce's previous
work for temp tables, to seperate the two uses of relname, via the
RelationGetRelationName and RelationGetPhysicalRelationName. There are
102 uses of the first in the current code (many in elog messages), and
only 11 of the second. If I'd had to do the original work of finding
every use of relname, and catagorizing it, I agree I'm not (yet) up to
it, but I have more confidence in Bruce's (already tested) work.

Filename == OID is so simple, reliable, and straightforward by
comparison that I think the decision is a no-brainer.

Perhaps. Changing the label of the file on disk still requires finding
all the code that assumes it knows what that name is, and changing it.
Same work.

If we could afford to sink unlimited time into this one issue then
it might make sense to do it the hard way, but we have enough
important stuff on our TODO list to keep us all busy for years ---
I cannot believe that it's an effective use of our time to do this.

The joys of Open Development. You've spent a fair amount of time trying
to convince _me_ not to waste my time. Thanks, but I'm pretty bull headed
sometimes. Since I've already done something of the work, take a look
at what I've got, and then tell me I'm wasting my time, o.k.?

Hmm, what's all this with functions in catalog.c that are only called by
smgr/md.c? seems to me that anything having to do with physical storage
(like the path!) belongs in the smgr abstraction.

Yeah, there's a bunch of stuff that should have been implemented by
adding new smgr entry points, but wasn't. It should be pushed down.
(I can't resist pointing out that one of those things is physical
relation rename, which will go away and not *need* to be pushed down
if we do it the way I want.)

Oh, I agree completely. In fact, As I said to Hiroshi last time this came
up, I think of the field in pg_class an an opaque token, to be filled in
by the smgr, and only used by code further up to hand back to the smgr
routines. Same should be true of the buffer->blind struct.

Ross
--
Ross J. Reedstrom, Ph.D., <reedstrm@rice.edu>
NSBRI Research Scientist/Programmer
Computer and Information Technology Institute
Rice University, 6100 S. Main St., Houston, TX 77005

#112Kaare Rasmussen
kar@webline.dk
In reply to: Tom Lane (#95)
Re: Big 7.1 open items

(At least not when we're sitting atop a well-done filesystem,
which may leave the Linux folk out in the cold ;-).)

Exactly what fs of Linux are you talking about? I believe that for a database
server, ReiserFS would be a natural choice.

--
Kaare Rasmussen --Linux, spil,-- Tlf: 3816 2582
Kaki Data tshirts, merchandize Fax: 3816 2582
Howitzvej 75 �ben 14.00-18.00 Email: kar@webline.dk
2000 Frederiksberg L�rdag 11.00-17.00 Web: www.suse.dk

#113Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Tom Lane (#95)
RE: Big 7.1 open items

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]

JanWieck@t-online.de (Jan Wieck) writes:

There are also disadvantages.

You can run out of space even if there are plenty GB's
free on your disks. You have to create tablespaces
explicitly.

Not to mention the reverse: if I read this right, you have to suck
up your GB's long in advance of actually needing them. That's OK
for a machine that's dedicated to Oracle ... not so OK for smaller
installations, playpens, etc.

I've had an anxiety about the way like Oracle's preallocation.
It had not been easy for me to estimate the extent size in
Oracle. Maybe it would lose the simplicity of environment
settings which is one of the biggest advantage of PostgreSQL.
It seems that we should also provide not_preallocated DATAFILE
when many_tables_in_a_file storage manager is introduced.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#114Tom Lane
tgl@sss.pgh.pa.us
In reply to: Ross J. Reedstrom (#109)
Re: Big 7.1 open items

"Ross J. Reedstrom" <reedstrm@rice.edu> writes:

I think the backend needs a per table token that indicates how
to get at the physical bits of the file. Whether that's a filename
alone, filename with path, oid, key to a smgr hash table or something
else, it's opaque above the smgr routines.

Except to the commands that provide the user interface for tablespaces
and so forth. And there aren't all that many places that deal with
physical filenames anyway. It would be a good idea to try to be a
little stricter about this, but I'm not sure you can make the separation
a whole lot cleaner than it is now ... with the exception of the obvious
bogosities like "rename table" being done above the smgr level. (But,
as I said, I want to see that code go away, not just get moved into
smgr...)

Hmm, now I'm thinking, since the tablespace discussion has been reopened,
the way to go about coding all this is to reactivate the smgr code: how
about I leave the existing md smgr as is, and clone it, call it md2 or
something, and start messing with adding features there?

Um, well, you can't have it both ways. If you're going to change/fix
the assumptions of code above the smgr, then you've got to update md
at the same time to match your new definition of the smgr interface.
Won't do much good to have a playpen smgr if the "standard" one is
broken.

One thing I have been thinking would be a good idea is to take the
relcache out of the bufmgr/smgr interfaces. The relcache is a
higher-level concept and ought not be known to bufmgr or smgr; they
ought to work with some low-level data structure or token for relations.
We might be able to eliminate the whole concept of "blind write" if we
do that. There are other problems with the relcache dependency: entries
in relcache can get blown away at inopportune times due to shared cache
inval, and it doesn't provide a good home for tokens for multiple
"versions" of a relation if we go with the fill-a-new-physical-file
approach to CLUSTER and so on.

Hmm, if you replace relcache in the smgr interfaces with pointers to
an smgr-maintained data structure, that might be the same thing that
you are alluding to above about an smgr hash table.

One thing *not* to do is add yet a third layer of data structure on
top of the ones already maintained in fd.c and md.c. Whatever extra
data might be needed here should be added to md.c's tables, I think,
and then the tokens used in the smgr interface would be pointers into
that table.

regards, tom lane

#115Randall Parker
rgparker@west.net
In reply to: Thomas Lockhart (#67)
Re: Re: Big 7.1 open items

Thomas,

A few (hopefully relevant) comments regarding character sets, code pages,
I18N, and all that:

1) I've seen databases (DB2 if memory serves) that allowed the client
side to declare itself to the database back-end engine as being in a
particular code page. For instance, one could have a CP850 Latin-1 client
and an ISO 8859-1 database. The database engine did appropriate
translations in both directions.

2) Mixing code pages in a single column and then having the database
engine support it is not trivial. Either each CHAR/VARCHAR would have to
have some code page settable per row (eg either as a separate column or
as something like mycolumnname.encoding).
Even if you could handle all that you'd still be faced with the issue
is collating sequence. Each individual code page will have a collating
sequence. But how do you collate across code pages? There'd be letters
that were only in a single code page. Plus, it gets messy because with,
for instance, a simple umlauted a that occurs in CP850, CP1252, and ISO
8859-1 (and likely in other code pages as well). That letter is really
the same letter in all those code pages and should treated as such when
sorting.

3) I think it is more important for a database to support lots of
languages in the stored data than in the field names and table names. If
a programmer has to deal with A-Za-z for naming identifiers and that
perseon is Korean or Japanese then that is certain is an imposition on
them. But its a far far bigger imposition if that programmer can't build
a database that will store the letters of his national language and sort
and index and search them in convenient ways.

4) The real solution to the multiple code page dilemma is Unicode.
Yes, its more space. But the can of worms of dealing with multiple
code pages in a column is really no fun and the result is not great.
BTDTHTTS.

5) The problem with enforcing
I've built a database in DB2 where particular columns in it contained
data from many different code pages (each row had a code page field as
well as a text field). For some applications that is okay if that field
is not going to be part of an index.
However, if a database is going to be defined as being in a particular
code page, and if the database engine is going to reject characters that
are not recognized as part of that code page then you can't play the sort
of game I just described _unless_ there is a different datatype that is
similar to CHAR/VARCHAR but for which the RDBMS does not enforce code
page legality on each character. Otherwise you choose some code page for
a column, you go merrily stuffing in all sorts of rows in all sorts of
code pages, and then along come some character that is of a value that is
not a value for some other character in the code page that the RDBMS
thinks it is.

Anyway, I've done lots of I18N database stuff and hopefully a few of my
comments will be useful to the assembled brethren <g>.

In news:<3948E4D7.A3B722E9@alumni.caltech.edu>,
lockhart@alumni.caltech.edu says...

Show quoted text

One issue: I can see (or imagine ;) how we can use the Postgres type
system to manage multiple character sets. But allowing arbitrary
character sets in, say, table names forces us to cope with allowing a
mix of character sets in a single column of a system table. afaik this
general capability is not mandated by SQL9x (the SQL_TEXT character set
is used for all system resources??). Would it be acceptable to have a
"default database character set" which is allowed to creep into the
pg_xxx tables? Even that seems to be a difficult thing to accomplish at
the moment (we'd need to get some of the text manipulation functions
from the catalogs, not from hardcoded references as we do now).

#116Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hiroshi Inoue (#113)
Re: Big 7.1 open items

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

It seems that we should also provide not_preallocated DATAFILE
when many_tables_in_a_file storage manager is introduced.

Several people in this thread have been talking like a
single-physical-file storage manager is in our future, but I can't
recall anyone saying that they were going to do such a thing or even
presenting reasons why it'd be a good idea.

Seems to me that physical file per relation is considerably better for
our purposes. It's easier to figure out what's going on for admin and
debug work, it means less lock contention among different backends
appending concurrently to different relations, and it gives the OS a
better shot at doing effective read-ahead on sequential scans.

So why all the enthusiasm for multi-tables-per-file?

regards, tom lane

#117Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#116)
Re: Big 7.1 open items

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

It seems that we should also provide not_preallocated DATAFILE
when many_tables_in_a_file storage manager is introduced.

Several people in this thread have been talking like a
single-physical-file storage manager is in our future, but I can't
recall anyone saying that they were going to do such a thing or even
presenting reasons why it'd be a good idea.

Seems to me that physical file per relation is considerably better for
our purposes. It's easier to figure out what's going on for admin and
debug work, it means less lock contention among different backends
appending concurrently to different relations, and it gives the OS a
better shot at doing effective read-ahead on sequential scans.

So why all the enthusiasm for multi-tables-per-file?

No idea. I thought Vadim mentioned it, but I am not sure anymore. I
certainly like our current system.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#118Chris Bitmead
chris@bitmead.com
In reply to: Bruce Momjian (#117)
Re: Big 7.1 open items

So why all the enthusiasm for multi-tables-per-file?

It allows you to use raw partitions which stop the OS double buffering
and wasting half of memory, as well as removing the overhead of indirect
blocks in the file system.

#119Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Bruce Momjian (#117)
RE: Big 7.1 open items

-----Original Message-----
From: Bruce Momjian [mailto:pgman@candle.pha.pa.us]

So why all the enthusiasm for multi-tables-per-file?

No idea. I thought Vadim mentioned it, but I am not sure anymore. I
certainly like our current system.

Oops,I'm not so enthusiastic for multi_tables_per_file smgr.
I believe that Ross and I have taken a practical way that doesn't
break current file_per_table smgr.

However it seems very natural to take multi_tables_per_file
smgr into account when we consider TABLESPACE concept.
Because TABLESPACE is an encapsulation,it should have
a possibility to handle multi_tables_per_file smgr IMHO.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#120Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hiroshi Inoue (#119)
Re: Big 7.1 open items

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

However it seems very natural to take multi_tables_per_file
smgr into account when we consider TABLESPACE concept.
Because TABLESPACE is an encapsulation,it should have
a possibility to handle multi_tables_per_file smgr IMHO.

OK, I see: you're just saying that the tablespace stuff should be
designed in such a way that it would work with a non-file-per-table
smgr. Agreed, that'd be a good check of a clean design, and someday
we might need it...

regards, tom lane

#121Kaare Rasmussen
kar@webline.dk
In reply to: Hiroshi Inoue (#113)
RE: Big 7.1 open items

Not to mention the reverse: if I read this right, you have to suck
up your GB's long in advance of actually needing them. That's OK
for a machine that's dedicated to Oracle ... not so OK for smaller
installations, playpens, etc.

To me it looks like a way to make Oracle work on VMS machines. This is the way
files are allocated on Digital hardware.

--
Kaare Rasmussen --Linux, spil,-- Tlf: 3816 2582
Kaki Data tshirts, merchandize Fax: 3816 2582
Howitzvej 75 �ben 14.00-18.00 Email: kar@webline.dk
2000 Frederiksberg L�rdag 11.00-17.00 Web: www.suse.dk

#122Randall Parker
rgparker@west.net
In reply to: Thomas Lockhart (#98)
Re: Big 7.1 open items

[This followup was posted to comp.databases.postgresql.hackers and a copy
was sent to the cited author.]

A few thoughts:

1) There may be reasons why someone might not want to use RAID.
For instance, suppose one wants to put different tables on different
drives so that the seeks for one table doesn't move the drive heads away
from the disk area for another table.
Also, suppose someone wants to use a particular drive for a particular
purpose (eg certain indexes) because it is faster at seeking vs another
drive that is faster at sustained transfer rates.
Also, someone may want to span a drive across multiple SCSI
controllers. Most RAID arrays I'm aware of are per SCSI controller.
I think it is fair to say that there will always be instances where
people want to have more control over where stuff goes because they are
willing to put the effort into more subtle tuning games. Well, there
ought to be a way.

2) Some OSs do not support symlinks. The ability to list a bunch of
devices for where things will go would be of value.
Also, if you aren't putting your data on a real file system (say on
raw partitions instead) you are going to need a way to specify that
anyway.

In news:<394A556A.4EAC8B9A@alumni.caltech.edu>,
lockhart@alumni.caltech.edu says...

Show quoted text

o the ability to split single tables across disks was essential for
scalability when disks were small. But with RAID, NAS, etc etc isn't
that a smaller issue now?
o "tablespaces" would implement our less-developed "with location"
feature, right? Splitting databases, whole indices and whole tables
across storage is the biggest win for this work since more users will
use the feature.
o location information needs to travel with individual tables anyway.

#123Noname
JanWieck@t-online.de
In reply to: Tom Lane (#95)
Re: Big 7.1 open items

Tom Lane wrote:

JanWieck@t-online.de (Jan Wieck) writes:

There are also disadvantages.

You can run out of space even if there are plenty GB's
free on your disks. You have to create tablespaces
explicitly.

Not to mention the reverse: if I read this right, you have to suck
up your GB's long in advance of actually needing them. That's OK
for a machine that's dedicated to Oracle ... not so OK for smaller
installations, playpens, etc.

Right, the design is perfect for a few databases with a more
or less stable size and schema (slow to medium growth). The
problem is, that production databases tend to fall into that
behaviour and that might be a reason for so many people
asking for Oracle compatibility - they want to do development
in the high flexible Postgres environment, while running
their production server under Oracle :-(.

I'm not convinced that there's anything fundamentally wrong with
doing storage allocation in Unix files the way we have been.

(At least not when we're sitting atop a well-done filesystem,
which may leave the Linux folk out in the cold ;-).)

I'm with you on that, even if I'm one of the Linux loosers.
The only point that really strikes me is that in our system
you might end up with a corrupted file system because some
inode changes didn't make it to the disk before a crash. Even
if using fsync() instead of fdatasync() (what we cannot use
at all and that's a pain from the performance PoV). In the
Oracle world, that could only happen during

ALTER TABLESPACE <tsname> ADD DATAFILE ...

which is a fairly seldom command, issued usually by the DB
admin (at least it requires admin privileges) and thus
ensures the "admin is there and already paying attention". A
little detail not to underestimate IMHO.

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#124Noname
JanWieck@t-online.de
In reply to: Thomas Lockhart (#96)
Re: Big 7.1 open items

Thomas Lockhart wrote:

(At least not when we're sitting atop a well-done filesystem,
which may leave the Linux folk out in the cold ;-).)

Those who live in HP houses should not throw stones :))

Huh? Up to HPUX-9 they used to have BSD-FFS - even if it was
a 4.2 BSD one - no?

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#125Noname
JanWieck@t-online.de
In reply to: Tom Lane (#97)
Re: Big 7.1 open items

Tom Lane wrote:

JanWieck@t-online.de (Jan Wieck) writes:

Tom Lane wrote:

It gets a little trickier if you want to be able to split
multi-gig tables across several tablespaces, though, since
you couldn't just append ".N" to the base table path in that
scenario.

I'd be interested to know what sort of facilities Oracle
provides for managing huge tables...

Oracle tablespaces are a collection of 1...n preallocated
files. Each table then is bound to a tablespace and
allocates extents (chunks) from those files.

OK, to get back to the point here: so in Oracle, tables can't cross
tablespace boundaries, but a tablespace itself could span multiple
disks?

They can. The path in

ALTER TABLESPACE <tsname> ADD DATAFILE ...

can point to any location the db system has access to.

Not sure if I like that better or worse than equating a tablespace
with a directory (so, presumably, all the files within it live on
one filesystem) and then trying to make tables able to span
tablespaces. We will need to do one or the other though, if we want
to have any significant improvement over the current state of affairs
for large tables.

One way is to play the flip-the-path-ordering game some more,
and access multiple-segment tables with pathnames like this:

.../TABLESPACE/RELATION -- first or only segment
.../TABLESPACE/N/RELATION -- N'th extension segment

[...]

In most cases all objects in one database are bound to one or
two tablespaces (data and indices). So you do an estimation
of the size required, create the tablespaces (and probably
all their extension files), then create the schema and load
it. The only reason not to do so is if your DB exceeds some
size where you have to fear not beeing able to finish online
backups before getting into Online-Relolog stuck. Has to do
the the online backup procedure of Oracle.

This isn't any harder for md.c to deal with than what we do now,
but by making the /N subdirectories be symlinks, the dbadmin could
easily arrange for extension segments to go on different filesystems.
Also, since /N subdirectory symlinks can be added as needed,
expanding available space by attaching more disks isn't hard.
(If the admin hasn't pre-made a /N symlink when it's needed,
I'd envision the backend just automatically creating a plain
subdirectory so that it can extend the table.)

So the admin allways have to leave enough freespace in the
default location to keep the DB running until he can take it
offline, move the autocreated files and create the symlinks.
What a pain for 24/7 systems.

We'd still want to create some tools to help the dbadmin with slinging
all these symlinks around, of course. But I think it's critical to keep
the low-level file access protocol simple and reliable, which really
means minimizing the amount of information the backend needs to know to
figure out which file to write a page in. With something like the above
you only need to know the tablespace name (or more likely OID), the
relation OID (+name or not, depending on outcome of other argument),
and the offset in the table. No worse than now from the software's
point of view.

Exactly the "low-level file access" protocol is highly
complicated in Postgres. Because nearly every object needs
his own file, we need to deal with virtual file descriptors.
With an Oracle-like tablespace concept and a fixed limit of
total tablespace files (this time OS or installation
specific), we could keep them all open all the time. IMHO a
big win.

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#126Noname
JanWieck@t-online.de
In reply to: Bruce Momjian (#99)
Re: Big 7.1 open items

Bruce Momjian wrote:

There are also disadvantages.

You can run out of space even if there are plenty GB's
free on your disks. You have to create tablespaces
explicitly.

If you've choosen inadequate extent size parameters, you
end up with high fragmented tables (slowing down) or get
stuck with running against maxextents, where only a reorg
(export/import) helps.

Also, Tom Lane pointed out to me that file system read-ahead does not
help if your table is spread around in tablespaces.

Not with our HEAP concept. With the Oracle EXTENT concept it
does pretty good, because they have different block/extent
sizes. Usually an extent spans multiple blocks, so in the
case of sequential reads they read each extent of probably
hundreds of K sequential. And in the case of indexed reads,
they know the extent and offset of the tuple inside of the
extent, so they know the exact location of the record inside
the tablespace to read.

The big problem we allways had (why we need TOAST at all) is
that the logical blocksize (extent size) of a table is bound
to your physical blocksize used in the shared cache. This is
fixed so deeply in the heap storage architecture, that I'm
scared about it.

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#127Noname
JanWieck@t-online.de
In reply to: Don Baccus (#105)
Re: Big 7.1 open items

Don Baccus wrote:

At 11:46 AM 6/16/00 -0400, Tom Lane wrote:

I personally dislike depending on symlinks to move stuff around.
Among other things, a pg_dump/restore (and presumably future
backup tools?) can't recreate the disk layout automatically.

Most impact from this one, IMHO.

Not that Oracle tools are able to do it either. But I think
it's more trivial to recreate a 30+ tablespace layout on the
disks than to recreate all symlinks for a 20,000+
tables/indices database like an SAP R/3 one.

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#128Giles Lean
giles@nemeton.com.au
In reply to: Noname (#124)
Re: Big 7.1 open items

Thomas Lockhart wrote:

(At least not when we're sitting atop a well-done filesystem,
which may leave the Linux folk out in the cold ;-).)

Those who live in HP houses should not throw stones :))

Huh? Up to HPUX-9 they used to have BSD-FFS - even if it was
a 4.2 BSD one - no?

It's still there, along with VxFS from Veritas.

Ciao,

Giles

#129Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Noname (#123)
Re: Big 7.1 open items

OK, I have thought about tablespaces, and here is my proposal. Maybe
there will some good ideas in my design.

My feeling is that intelligent use of directories and symlinks can allow
PostgreSQL to handle tablespaces and allow administrators to use
symlinks outside of PostgreSQL and have PostgreSQL honor those changes
in a reload.

Seems we have three tablespace needs:

locate database in separate disk
locate tables in separate directory/symlink
locate secondary extents on different drives

If we have a new CREATE DATABASE LOCATION command, we can say:

CREATE DATABASE LOCATION dbloc IN '/var/private/pgsql';
CREATE DATABASE newdb IN dbloc;

The first command makes sure /var/private/pgsql exists and is write-able
by postgres. It then creates a dbloc directory and a symlink:

mkdir /var/private/pgsql/dbloc
ln -s /var/private/pgsql/dbloc data/base/dbloc

The CREATE DATABASE command creates data/base/dbloc/newdb and creates
the database there. We would have to store the dbloc location in
pg_database.

To handle placing tables, we can use:

CREATE LOCATION tabloc IN '/var/private/pgsql';
CREATE TABLE newtab ... IN tabloc;

The first command makes sure /var/private/pgsql exists and is write-able
by postgres. It then creates a directory tabloc in /var/private/pgsql,
and does a symlink:

ln -s /var/private/pgsql/tabloc data/base/dbloc/newdb/tabloc

and creates the table in there. These location names have to be stored
in pg_class.

The difference betweeen CREATE LOCATION and CREATE DATABASE LOCATION is
that the first one puts it in the current database, while the latter
puts the symlinks in data/base.

(Can we remove data/base and just make it data/?)

I would also allow a simpler CREATE LOCATION tabloc2 which just creates
a directory in the database directory. These can be moved later using
symlinks. Of course, CREATE DATABASE LOCATION too.

I haven't figured out extent locations yet. One idea is to allow
administrators to create symlinks for tables >1 gig, and to not remove
the symlinks when a table shrinks. Only remove the file pointed to by
the table, but leave the symlink there so if the table grows again, it
can use the symlink. lstat() would allow this.

Now on to preserving this information. My ideas is that PostgreSQL
should never remove a directory or symlink in the data/base directory.
Those represent locations made by the administrator. So, pg_dump with a
-l option can go through the db directory and output CREATE LOCATION
commands for every database, so when reloaded, the locations will be
preserved, assuming the symlinks point to still-valid directories.

What this does allow is someone to create locations during table
population, but to keep them all on the same drive. If they later move
things around on the disk using cp and symlinks, this will be preserved
by pg_dump.

My problem with many of the tablespace systems is that it requires two
changes. One in the file system using symlinks, and another in the
database to point to the new entries, or it does not preserve them
across backups.

If someone does want to remove a location, they would have to remove all
tables in the directory, and the base directory and symlink can be
removed with DROP LOCATION.

My solution basically stores locations for databases and tables in the
database, but does _not_ store information about what locations exist or
if they are symlinks. However, it does allow for preserving of this
information in dumps.

I feel this solution is very flexible.

Comments?

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#130Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noname (#124)
Re: Big 7.1 open items

JanWieck@t-online.de (Jan Wieck) writes:

Thomas Lockhart wrote:

Those who live in HP houses should not throw stones :))

Huh? Up to HPUX-9 they used to have BSD-FFS - even if it was
a 4.2 BSD one - no?

Yeah, the standard HPUX filesystem is still BSD ... and it still runs
rings around Linux extfs2 in my experience. (I've been informed that
Linux has better filesystems than extfs2, but that seems to be what
the average Linux user is running.) I have a realtime data collection
program that usually wants to write several thousand small files during
shutdown. The shutdown typically takes about 3 minutes on an HP 715/75,
upwards of 10 minutes on a Linux box with nominally-faster hardware.

BTW, HP is trying to sell people on using a new journaling filesystem
that they claim outperforms BSD, but my few experiments with it
haven't encouraged me to pursue it.

regards, tom lane

#131Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Thomas Lockhart (#94)
Re: Re: Big 7.1 open items

Oh. I was recalling SQL_TEXT as being a "subset" character set which
contains only the characters (more or less) that are required for
implementing the SQL92 query language and standard features.

Are you seeing it as being a "superset" character set which can
represent all other character sets??

Yes, it's my understanding from the section 19.3.1 of Date's book
(fourth edition). Please correct me if I am wrong.

And, how would you suggest we start tracking this discussion in a design
document? I could put something into the developer's guide, or we could
have a plain-text FAQ, or ??

I'd propose that we start accumulating a feature list, perhaps ordering
it into categories like

o required/suggested by SQL9x
o required/suggested by experience in the real world
o sure would be nice to have
o really bad idea ;)

Sounds good. Could I put "CREATE CHARACTER SET" as the first item of
the list and start a discussion for that?

I have a feeling that you have an idea to treat user defined charset
as a PostgreSQL new data type. So probably "CREATE CHARACTER SET"
could be traslated to our "CREATE TYPE" by the parer, right?
--
Tatsuo Ishii

#132Michael Reifenberger
root@nihil.plaut.de
In reply to: Noname (#123)
Re: Big 7.1 open items

On Sun, 18 Jun 2000, Jan Wieck wrote:
...

ALTER TABLESPACE <tsname> ADD DATAFILE ...

which is a fairly seldom command, issued usually by the DB
admin (at least it requires admin privileges) and thus
ensures the "admin is there and already paying attention". A
little detail not to underestimate IMHO.

...
Esp. in the R/3 area this will become no longer be true the more commonly
commands like "AUTOEXTEND" and "RESIZE" are used (automated at worst).

Bye!
----
Michael Reifenberger
^.*Plaut.*$, IT, R/3 Basis, GPS

#133Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Bruce Momjian (#129)
Re: Big 7.1 open items

I haven't figured out extent locations yet. One idea is to allow
administrators to create symlinks for tables >1 gig, and to not remove
the symlinks when a table shrinks. Only remove the file pointed to by
the table, but leave the symlink there so if the table grows again, it
can use the symlink. lstat() would allow this.

OK, I have an extent idea. It is:

CREATE LOCATION tabloc IN '/var/private/pgsql' EXTENT2
'/usr/pg'.

This creates an /extents directory in the location, with extents/2
symlinked to /usr/pg:

data/base/mydb/tabloc
data/base/mydb/tabloc/extents/2

When extending a table, it looks for an extents/2 directory and uses
that if it exists. Same for extents3. We could even get fancy and
round-robin through all the extents directories, looping around to the
beginning when we run out of them. That sounds nice.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#134Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Bruce Momjian (#133)
Re: Big 7.1 open items

I haven't figured out extent locations yet. One idea is to allow
administrators to create symlinks for tables >1 gig, and to not remove
the symlinks when a table shrinks. Only remove the file pointed to by
the table, but leave the symlink there so if the table grows again, it
can use the symlink. lstat() would allow this.

OK, I have an extent idea. It is:

CREATE LOCATION tabloc IN '/var/private/pgsql' EXTENT2
'/usr/pg'.

Even better:

CREATE LOCATION tabloc IN '/var/private/pgsql'
EXTENT '/usr/pg', '/usr1/pg'

This will create extent/2 and extent/3, and the system can rotate
extents between the primary storage area, and 2 and 3.

Also, CREATE INDEX will need a location specification added.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#135Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#133)
Re: Big 7.1 open items

Bruce Momjian <pgman@candle.pha.pa.us> writes:

... We could even get fancy and
round-robin through all the extents directories, looping around to the
beginning when we run out of them. That sounds nice.

That sounds horrible. There's no way to tell which extent directory
extent N goes into except by scanning the location directory to find
out how many extent subdirectories there are (so that you can compute
N modulo number-of-directories). Do you want to pay that price on every
file open?

Worse, what happens when you add another extent directory? You can't
find your old extents anymore, that's what, because they're not in the
right place (N modulo number-of-directories just changed). Since the
extents are presumably on different volumes, you're talking about
physical file moves to get them where they should be. You probably
can't add a new extent without shutting down the entire database while
you reshuffle files --- at the very least you'd need to get exclusive
locks on all the tables in that tablespace.

Also, you'll get filename conflicts from multiple extents of a single
table appearing in one of the recycled extent dirs. You could work
around it by using the non-modulo'd N as part of the final file name,
but that just adds more complexity and makes the filename-generation
machinery that much more closely tied to this specific way of doing
things.

The right way to do this is that extent N goes into extents subdirectory
N, period. If there's no such subdirectory, create one on-the-fly as a
plain subdirectory of the location directory. The dbadmin can easily
create secondary extent symlinks *in advance of their being needed*.
Reorganizing later is much more painful since it requires moving
physical files, but I think that'd be true no matter what. At least
we should see to it that adding more space in advance of needing it is
painless.

It's possible to do it that way (auto-create extent subdir if needed)
without tying the md.c machinery real closely to a specific filename
creation procedure: it's just the same sort of thing as install programs
customarily do. "If you fail to create a file, try creating its
ancestor directory." We'd have to think about whether it'd be a good
idea to allow auto-creation of more than one level of directory; offhand
it seems that needing to make more than one level is probably a sign of
an erroneous path, not need for another extent subdirectory.

regards, tom lane

#136Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#135)
Re: Big 7.1 open items

If we eliminate the round-robin idea, what did people think of the rest
of the ideas?

Bruce Momjian <pgman@candle.pha.pa.us> writes:

... We could even get fancy and
round-robin through all the extents directories, looping around to the
beginning when we run out of them. That sounds nice.

That sounds horrible. There's no way to tell which extent directory
extent N goes into except by scanning the location directory to find
out how many extent subdirectories there are (so that you can compute
N modulo number-of-directories). Do you want to pay that price on every
file open?

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#137Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Kaare Rasmussen (#121)
Re: Big 7.1 open items

Not to mention the reverse: if I read this right, you have to suck
up your GB's long in advance of actually needing them. That's OK
for a machine that's dedicated to Oracle ... not so OK for smaller
installations, playpens, etc.

To me it looks like a way to make Oracle work on VMS machines. This is the way
files are allocated on Digital hardware.

Agreed.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#138Don Baccus
dhogaza@pacifier.com
In reply to: Bruce Momjian (#136)
Re: Big 7.1 open items

At 06:50 PM 6/18/00 -0400, Bruce Momjian wrote:

If we eliminate the round-robin idea, what did people think of the rest
of the ideas?

Why invent new syntax when "create tablespace" is something a lot
of folks will recognize?

And why not use "create table ... using ... "? In other words,
Oracle-compatible for this construct? Sure, Postgres doesn't
have to follow Oraclisms but picking an existing contruct means
at least SOME folks can import a datamodel without having to
edit it.

Does your proposal break the smgr abstraction, i.e. does it
preclude later efforts to (say) implement an (optional)
raw-device storage manager?

- Don Baccus, Portland OR <dhogaza@pacifier.com>
Nature photos, on-line guides, Pacific Northwest
Rare Bird Alert Service and other goodies at
http://donb.photo.net.

#139Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Don Baccus (#138)
Re: Big 7.1 open items

At 06:50 PM 6/18/00 -0400, Bruce Momjian wrote:

If we eliminate the round-robin idea, what did people think of the rest
of the ideas?

Why invent new syntax when "create tablespace" is something a lot
of folks will recognize?

And why not use "create table ... using ... "? In other words,
Oracle-compatible for this construct? Sure, Postgres doesn't
have to follow Oraclisms but picking an existing contruct means
at least SOME folks can import a datamodel without having to
edit it.

Sure, use another syntax. My idea was to use symlinks, and allow their
moving using symlinks and preserve them during dump.

Does your proposal break the smgr abstraction, i.e. does it
preclude later efforts to (say) implement an (optional)
raw-device storage manager?

Seeing very few want that done, I don't see it as an issue at this
point.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#140Don Baccus
dhogaza@pacifier.com
In reply to: Bruce Momjian (#139)
Re: Big 7.1 open items

At 08:08 PM 6/18/00 -0400, Bruce Momjian wrote:

Does your proposal break the smgr abstraction, i.e. does it
preclude later efforts to (say) implement an (optional)
raw-device storage manager?

Seeing very few want that done, I don't see it as an issue at this
point.

Sorry, I disagree. There's excuse for breaking existing abstractions
unless there's a compelling reason to do so.

My question should make it clear I was using a raw-device storage
manager as an example. There are other possbilities, like a
many-tables-per-file storage manager.

- Don Baccus, Portland OR <dhogaza@pacifier.com>
Nature photos, on-line guides, Pacific Northwest
Rare Bird Alert Service and other goodies at
http://donb.photo.net.

#141Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#130)
Re: Big 7.1 open items

JanWieck@t-online.de (Jan Wieck) writes:

Thomas Lockhart wrote:

Those who live in HP houses should not throw stones :))

Huh? Up to HPUX-9 they used to have BSD-FFS - even if it was
a 4.2 BSD one - no?

Yeah, the standard HPUX filesystem is still BSD ... and it still runs
rings around Linux extfs2 in my experience. (I've been informed that
Linux has better filesystems than extfs2, but that seems to be what
the average Linux user is running.) I have a realtime data collection
program that usually wants to write several thousand small files during
shutdown. The shutdown typically takes about 3 minutes on an HP 715/75,
upwards of 10 minutes on a Linux box with nominally-faster hardware.

BTW, HP is trying to sell people on using a new journaling filesystem
that they claim outperforms BSD, but my few experiments with it
haven't encouraged me to pursue it.

You should really try the BSD4.4 FFS with soft updates. It re-orders
disk flushes to greatly improve performance. It really is great.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#142Ross J. Reedstrom
reedstrm@rice.edu
In reply to: Don Baccus (#140)
Re: Big 7.1 open items

On Sun, Jun 18, 2000 at 05:12:22PM -0700, Don Baccus wrote:

At 08:08 PM 6/18/00 -0400, Bruce Momjian wrote:

Does your proposal break the smgr abstraction, i.e. does it
preclude later efforts to (say) implement an (optional)
raw-device storage manager?

Seeing very few want that done, I don't see it as an issue at this
point.

Sorry, I disagree. There's excuse for breaking existing abstractions
unless there's a compelling reason to do so.

My question should make it clear I was using a raw-device storage
manager as an example. There are other possbilities, like a
many-tables-per-file storage manager.

Don, I see Bruce's proposal as implementation details within the sotrage
manager. In fact, we should probably implement the tablespace commands
with an extention of the smgr api. One different smgr I've been thinking
a little about is the persistent RAM smgr: I've heard there's some
new technologies coming up that may make large amounts cheaper, soon.
And there's always PostgreSQL for PalmOS, right? (Hey, IBM's got a Pocket
DB2, why shouldn't we?)

Ross
--
Ross J. Reedstrom, Ph.D., <reedstrm@rice.edu>
NSBRI Research Scientist/Programmer
Computer and Information Technology Institute
Rice University, 6100 S. Main St., Houston, TX 77005

#143Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Don Baccus (#140)
Re: Big 7.1 open items

At 08:08 PM 6/18/00 -0400, Bruce Momjian wrote:

Does your proposal break the smgr abstraction, i.e. does it
preclude later efforts to (say) implement an (optional)
raw-device storage manager?

Seeing very few want that done, I don't see it as an issue at this
point.

Sorry, I disagree. There's excuse for breaking existing abstractions
unless there's a compelling reason to do so.

My question should make it clear I was using a raw-device storage
manager as an example. There are other possbilities, like a
many-tables-per-file storage manager.

I agree it is nice to keep things as abstract as possible. I just don't
know if the abstraction will cause added complexity.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#144Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#135)
Re: Big 7.1 open items

My basic proposal is that we optionally allow symlinks when creating
tablespace directories, and that we interrogate those symlinks during a
dump so administrators can move tablespaces around without having to
modify environment variables or system tables.

I also suggested creating an extent directory to hold extents, like
extent/2 and extent/3. This will allow administration for smaller sites
to be simpler.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#145Don Baccus
dhogaza@pacifier.com
In reply to: Bruce Momjian (#144)
Re: Big 7.1 open items

At 11:13 PM 6/18/00 -0400, Bruce Momjian wrote:

My basic proposal is that we optionally allow symlinks when creating
tablespace directories, and that we interrogate those symlinks during a
dump so administrators can move tablespaces around without having to
modify environment variables or system tables.

If they can move them around from within the db, they'll have no need to
move them around from outside the db.

I don't quite understand your devotion to using filesystem commands
outside the database to do database administration.

- Don Baccus, Portland OR <dhogaza@pacifier.com>
Nature photos, on-line guides, Pacific Northwest
Rare Bird Alert Service and other goodies at
http://donb.photo.net.

#146Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#144)
Re: Big 7.1 open items

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I also suggested creating an extent directory to hold extents, like
extent/2 and extent/3. This will allow administration for smaller sites
to be simpler.

I don't see the value in creating an extra level of directory --- seems
that just adds one more Unix directory-lookup cycle to each file open,
without any apparent return. What's wrong with extent directory names
like extent2, extent3, etc?

Obviously the extent dirnames must be chosen so they can't conflict
with table filenames, but that's easily done. For example, if table
files are named like 'OID_xxx' then 'extentN' will never conflict.

regards, tom lane

#147Tom Lane
tgl@sss.pgh.pa.us
In reply to: Don Baccus (#145)
Re: Big 7.1 open items

Don Baccus <dhogaza@pacifier.com> writes:

If they can move them around from within the db, they'll have no need to
move them around from outside the db.
I don't quite understand your devotion to using filesystem commands
outside the database to do database administration.

Being *able* to use filesystem commands to see/fix what's going on is a
good thing, particularly from a development/debugging standpoint. But
I agree we want to have within-the-system admin commands to do the same
things.

regards, tom lane

#148Don Baccus
dhogaza@pacifier.com
In reply to: Tom Lane (#147)
Re: Big 7.1 open items

At 12:28 AM 6/19/00 -0400, Tom Lane wrote:

Being *able* to use filesystem commands to see/fix what's going on is a
good thing, particularly from a development/debugging standpoint.

Of course it's a crutch for development, but outside of development
circles few users will know how to use the OS in regard to the
database.

Assuming PG takes off. Of course, if it remains the realm of the
dedicated hard-core hacker, I'm wrong.

I have nothing against preserving the ability to use filesystem
commands if there's no significant costs inherent with this approach.
I'd view the breaking of smgr abstraction as a significant cost (though
I agree with Ross that it Bruce's proposal shouldn't require that, I
asked my question to flush Bruce out, if you will, because he's
devoted to a particular outside-the-db management model).

But
I agree we want to have within-the-system admin commands to do the same
things.

MUST have, I should think.

- Don Baccus, Portland OR <dhogaza@pacifier.com>
Nature photos, on-line guides, Pacific Northwest
Rare Bird Alert Service and other goodies at
http://donb.photo.net.

#149Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#147)
Re: Big 7.1 open items

Don Baccus <dhogaza@pacifier.com> writes:

If they can move them around from within the db, they'll have no need to
move them around from outside the db.
I don't quite understand your devotion to using filesystem commands
outside the database to do database administration.

Being *able* to use filesystem commands to see/fix what's going on is a
good thing, particularly from a development/debugging standpoint. But
I agree we want to have within-the-system admin commands to do the same
things.

Yes, I like to have db commands to do it. I just like to allow things
outside too, if possible. It also prevents things from getting out of
sync because the database doesn't need to store the symlink location.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#150Tom Lane
tgl@sss.pgh.pa.us
In reply to: Don Baccus (#148)
Re: Big 7.1 open items

Don Baccus <dhogaza@pacifier.com> writes:

I'd view the breaking of smgr abstraction as a significant cost

Actually, the "smgr abstraction" has *been* broken for a long time,
due to sloppy implementation of features like relation rename.
But I agree we should try to re-establish a clean separation.

But
I agree we want to have within-the-system admin commands to do the same
things.

MUST have, I should think.

No argument from this quarter. It seems to me that once a PG
installation has been set up, it ought to be possible to do routine
admin tasks remotely --- and that means no direct access to the
server's filesystem.

regards, tom lane

#151Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Tom Lane (#150)
AW: Big 7.1 open items

BTW, schemas do make things interesting for the other camp:
is it possible for the same table to be referenced by different
names in different schemas? If so, just how useful is it to pick
one of those names arbitrarily for the filename? This is an advanced
version of the main objection to using the original relname and not
updating it at RENAME TABLE --- sooner or later, the filenames are
going to be more confusing than helpful.

Comments? Have I missed something important about schemas?

I think we have to agree on the way we want schemas to be.
Imho (and in other db's) the schema is simply the owner of a table.

The owner is an optional part of the table keyword ( select * from
"owner".tabname ).
It also implys that different owners can have a table with the same name
in the same database. (this is only implemented in some other db Systems)

Our database concept is and imho should not be altered, thus we keep the
hierarchy dbname --> owner(=schema) --> tablename.

Andreas

#152Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Zeugswetter Andreas SB (#151)
AW: Big 7.1 open items

OK, to get back to the point here: so in Oracle, tables can't cross
tablespace boundaries,

This is only true if you don't insert more coins and buy the Partitioning
Option,
or you use those coins to switch to Informix.

but a tablespace itself could span multiple
disks?

Yes

Not sure if I like that better or worse than equating a tablespace
with a directory (so, presumably, all the files within it live on
one filesystem) and then trying to make tables able to span
tablespaces. We will need to do one or the other though, if we want
to have any significant improvement over the current state of affairs
for large tables.

You can currently use a union all view and write appropriate rules
for insert, update and delete in Postgresql. This has the only disadvantage,
that Partitions (fragments, table parts) cannot be optimized away,
but we could fix that if we fixed the optimizer to take check constraints
into account (like check (year = 2000) and select * where year=1999).

Andreas

#153Zeugswetter Andreas SB
ZeugswetterA@Wien.Spardat.at
In reply to: Zeugswetter Andreas SB (#152)
AW: Big 7.1 open items

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

Please add my opinion for naming rule.

relname/unique_id but need some work new

pg_class column,

no relname change. for unique-id generation

filename not relname

Why is a unique ID better than --- or even different from ---
using the relation's OID? It seems pointless to me...

just to open up a whole new bucket of worms here, but ... if
we do use OID
(which up until this thought I endorse 100%) ... do we not
run a risk if
we run out of OIDs? As far as I know, those are still a
finite resource,
no?

or, do we just assume that by the time that comes, everyone
will be pretty
much using 64bit machines? :)

I think the idea is to have an option to remove oid's from
user tables. I don't think you will run out of oid's if you have your bulk
data
not use up oid's.

Andreas

#154Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#146)
Re: Big 7.1 open items

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I also suggested creating an extent directory to hold extents, like
extent/2 and extent/3. This will allow administration for smaller sites
to be simpler.

I don't see the value in creating an extra level of directory --- seems
that just adds one more Unix directory-lookup cycle to each file open,
without any apparent return. What's wrong with extent directory names
like extent2, extent3, etc?

Obviously the extent dirnames must be chosen so they can't conflict
with table filenames, but that's easily done. For example, if table
files are named like 'OID_xxx' then 'extentN' will never conflict.

We could call them extent.2, extent-2, or Extent-2.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#155Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Don Baccus (#148)
Re: Big 7.1 open items

At 12:28 AM 6/19/00 -0400, Tom Lane wrote:

Being *able* to use filesystem commands to see/fix what's going on is a
good thing, particularly from a development/debugging standpoint.

Of course it's a crutch for development, but outside of development
circles few users will know how to use the OS in regard to the
database.

Assuming PG takes off. Of course, if it remains the realm of the
dedicated hard-core hacker, I'm wrong.

I have nothing against preserving the ability to use filesystem
commands if there's no significant costs inherent with this approach.
I'd view the breaking of smgr abstraction as a significant cost (though
I agree with Ross that it Bruce's proposal shouldn't require that, I
asked my question to flush Bruce out, if you will, because he's
devoted to a particular outside-the-db management model).

The fact is that symlink information is already stored in the file
system. If we store symlink information in the database too, there
exists the ability for the two to get out of sync. My point is that I
think we can _not_ store symlink information in the database, and query
the file system using lstat when required.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#156Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Bruce Momjian (#155)
RE: Big 7.1 open items

-----Original Message-----
From: Bruce Momjian [mailto:pgman@candle.pha.pa.us]

The fact is that symlink information is already stored in the file
system. If we store symlink information in the database too, there
exists the ability for the two to get out of sync. My point is that I
think we can _not_ store symlink information in the database, and query
the file system using lstat when required.

Hmm,this seems pretty confusing to me.
I don't understand the necessity of symlink.
Directory tree,symlink,hard link ... are OS's standard.
But I don't think they are fit for dbms management.

PostgreSQL is a database system of cource. So
couldn't it handle more flexible structure than OS's
directory tree for itself ?

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#157Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Hiroshi Inoue (#156)
Re: Big 7.1 open items

-----Original Message-----
From: Bruce Momjian [mailto:pgman@candle.pha.pa.us]

The fact is that symlink information is already stored in the file
system. If we store symlink information in the database too, there
exists the ability for the two to get out of sync. My point is that I
think we can _not_ store symlink information in the database, and query
the file system using lstat when required.

Hmm,this seems pretty confusing to me.
I don't understand the necessity of symlink.
Directory tree,symlink,hard link ... are OS's standard.
But I don't think they are fit for dbms management.

PostgreSQL is a database system of cource. So
couldn't it handle more flexible structure than OS's
directory tree for itself ?

Yes, but is anyone suggesting a solution that does not work with
symlinks? If not, why not do it that way?

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#158Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Bruce Momjian (#157)
RE: Big 7.1 open items

-----Original Message-----
From: Bruce Momjian [mailto:pgman@candle.pha.pa.us]

-----Original Message-----
From: Bruce Momjian [mailto:pgman@candle.pha.pa.us]

The fact is that symlink information is already stored in the file
system. If we store symlink information in the database too, there
exists the ability for the two to get out of sync. My point is that I
think we can _not_ store symlink information in the database,

and query

the file system using lstat when required.

Hmm,this seems pretty confusing to me.
I don't understand the necessity of symlink.
Directory tree,symlink,hard link ... are OS's standard.
But I don't think they are fit for dbms management.

PostgreSQL is a database system of cource. So
couldn't it handle more flexible structure than OS's
directory tree for itself ?

Yes, but is anyone suggesting a solution that does not work with
symlinks? If not, why not do it that way?

Maybe other solutions have been proposed already because
there have been so many opinions and proposals.

I've felt TABLE(DATA)SPACE discussion has always been
divergent. IMHO,one of the main cause is that various factors
have been discussed at once. Shouldn't we make step by step
consensus in TABLE(DATA)SPACE discussion ?

IMHO,the first step is to decide the syntax of CREATE TABLE
command not to define TABLE(DATA)SPACE.

Comments ?

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#159Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Hiroshi Inoue (#158)
Re: Big 7.1 open items

Yes, but is anyone suggesting a solution that does not work with
symlinks? If not, why not do it that way?

Maybe other solutions have been proposed already because
there have been so many opinions and proposals.

I've felt TABLE(DATA)SPACE discussion has always been
divergent. IMHO,one of the main cause is that various factors
have been discussed at once. Shouldn't we make step by step
consensus in TABLE(DATA)SPACE discussion ?

IMHO,the first step is to decide the syntax of CREATE TABLE
command not to define TABLE(DATA)SPACE.

Comments ?

Agreed. Seems we have several issues:

filename contents
tablespace implementation
tablespace directory layout
tablespace commands and syntax

Filename syntax seems to have resolved to
tablespace/tablename_oid_version or something like that. I think a
clean solution to keep symlink names in sync with rename is to use hard
links during rename, and during vacuum, if the link count is greater
than one, we can scan the directory and remove old files matching the
oid.

I hope we can implement tablespaces using symlinks that can be dump, but
the symlink location does not have to be stored in the database.

Seems we are going to use Extent-2/Extent-3 to store extents under each
tablespace.

It also seems we will be using the Oracle tablespace syntax where
appropriate.

Comments?
-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#160Philip J. Warner
pjw@rhyme.com.au
In reply to: Bruce Momjian (#159)
Re: Big 7.1 open items

At 09:40 20/06/00 -0400, Bruce Momjian wrote:

[lots of stuff about symlinks]

It just occurred to me that the symlinks concerns may be short-circuitable,
if the following are true:

1. most of the desirability is for external 'management' and debugging etc
on 'reasonably' static database designs.

2. metadata changes (specifically renaming tables) occur infrequently.

3. there is no reason why they are desirable *technically* within the
implementations being discussed.

If these are true, then why not create a utility (eg. pg_update_symlinks)
that creates the relevant symlinks. It does not matter if they are
outdated, from an integrity point of view, and for the most part they can
be automatically maintained. Internally, postgresql can totally ignore them.

Have I missed something?

----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.C.N. 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/

#161Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Philip J. Warner (#160)
Re: Big 7.1 open items

At 09:40 20/06/00 -0400, Bruce Momjian wrote:

[lots of stuff about symlinks]

It just occurred to me that the symlinks concerns may be short-circuitable,
if the following are true:

1. most of the desirability is for external 'management' and debugging etc
on 'reasonably' static database designs.

2. metadata changes (specifically renaming tables) occur infrequently.

3. there is no reason why they are desirable *technically* within the
implementations being discussed.

If these are true, then why not create a utility (eg. pg_update_symlinks)
that creates the relevant symlinks. It does not matter if they are
outdated, from an integrity point of view, and for the most part they can
be automatically maintained. Internally, postgresql can totally ignore them.

Have I missed something?

I am a little confused. Are you suggesting that the entire symlink
thing can be done outside the database? Yes, that is true if we don't
store the symlink locations in the database. Of course, the database
has to be down to do this.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#162Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#159)
Re: Big 7.1 open items

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Agreed. Seems we have several issues:

filename contents
tablespace implementation
tablespace directory layout
tablespace commands and syntax

I think we've agreed that the filename must depend on tablespace,
file version, and file segment number in some fashion --- plus
the table name/OID of course. Although there's no real consensus
about exactly how to construct the name, agreeing on the components
is still a positive step.

A couple of other areas of contention were:

revising smgr interface to be cleaner
exactly what to store in pg_class

I don't think there's any quibble about the idea of cleaning up smgr,
but we don't have a complete proposal on the table yet either.

As for the pg_class issue, I still favor storing
(a) OID of tablespace --- not for file access, but so that
associated tablespace-table entry can be looked up
by tablespace management operations
(b) pathname of file as a column of type "name", including
a %d to be replaced by segment #

I think Peter was holding out for storing purely numeric tablespace OID
and table version in pg_class and having a hardwired mapping to pathname
somewhere in smgr. However, I think that doing it that way gains only
micro-efficiency compared to passing a "name" around, while using the
name approach buys us flexibility that's needed for at least some of
the variants under discussion. Given that the exact filename contents
are still so contentious, I think it'd be a bad idea to pick an
implementation that doesn't allow some leeway as to what the filename
will be. A name also has the advantage that it is a single item that
can be used to identify the table to smgr, which will help in cleaning
up the smgr interface.

As for tablespace layout/implementation, the only real proposal I've
heard is that there be a subdirectory of the database directory for each
tablespace, and that that have a subdirectory for each segment (extent)
of its tables --- where any of these subdirectories could be symlinks
off to a different filesystem. Some unhappiness was raised about
depending on symlinks for this function, but I didn't hear one single
concrete reason not to do it, nor an alternative design. Unless someone
comes up with a counterproposal, I think that that's what the actual
access mechanism will look like. We still need to talk about what we
want to store in the SQL-level representation of a tablespace, and what
sort of tablespace management tools/commands are needed. (Although
"try to make it look like Oracle" seems to be pretty much the consensus
for the command level, not all of us know exactly what that means...)

Comments? Anything else that we do have consensus on?

regards, tom lane

#163Tom Lane
tgl@sss.pgh.pa.us
In reply to: Philip J. Warner (#160)
Re: Big 7.1 open items

"Philip J. Warner" <pjw@rhyme.com.au> writes:

If these are true, then why not create a utility (eg. pg_update_symlinks)
that creates the relevant symlinks. It does not matter if they are
outdated, from an integrity point of view, and for the most part they can
be automatically maintained. Internally, postgresql can totally ignore them.

What?

I think you are confusing a couple of different things. IIRC, at one
time when we were just thinking about ALTER TABLE RENAME, there was
a suggestion that the "real" table files be named by table OID, and
that there be symlinks to those files named by logical table name as
a crutch (:-)) for admins who wanted to know which table file was which.
That could be handled as you've sketched above, but I think the whole
proposal has fallen by the wayside anyway.

The current discussion of symlinks is focusing on using directory
symlinks, not file symlinks, to represent/implement tablespace layout.

regards, tom lane

#164Philip J. Warner
pjw@rhyme.com.au
In reply to: Bruce Momjian (#161)
Re: Big 7.1 open items

At 10:35 20/06/00 -0400, Bruce Momjian wrote:

If these are true, then why not create a utility (eg. pg_update_symlinks)
that creates the relevant symlinks. It does not matter if they are
outdated, from an integrity point of view, and for the most part they can
be automatically maintained. Internally, postgresql can totally ignore

them.

I am a little confused. Are you suggesting that the entire symlink
thing can be done outside the database? Yes, that is true if we don't
store the symlink locations in the database. Of course, the database
has to be down to do this.

The idea was to have postgresql, internally, totally ignore symlinks - use
OID or whatever is technically best for file names. Then create a
utility/command to make human-centric symlinks in a known location. The
symlinks *could* be updated automatically by postgres, if possible, but
would never be used internally. Things like vacuum could report out of date
symlinks, and maybe fix them (but probably not).

It may sound crude, but the only reason for the symlinks is for humans to
'see what is going on', and in most cases they wont be very volatile.

----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.C.N. 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/

#165Philip J. Warner
pjw@rhyme.com.au
In reply to: Tom Lane (#163)
Re: Big 7.1 open items

At 10:45 20/06/00 -0400, Tom Lane wrote:

What?

...

The current discussion of symlinks is focusing on using directory
symlinks, not file symlinks, to represent/implement tablespace layout.

Ooops. I'll pull my head in again.

----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.C.N. 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/

#166Thomas Lockhart
lockhart@alumni.caltech.edu
In reply to: Thomas Lockhart (#67)
Re: Re: Big 7.1 open items

Oh. I was recalling SQL_TEXT as being a "subset" character set which
contains only the characters (more or less) that are required for
implementing the SQL92 query language and standard features.
Are you seeing it as being a "superset" character set which can
represent all other character sets??

Yes, it's my understanding from the section 19.3.1 of Date's book
(fourth edition). Please correct me if I am wrong.

Yuck. That is what is says, all right :(

Date says that SQL_TEXT is required to have two things:
1) all characters used in the SQL language itself (which is what I
recalled)

2) Every other character from every character set in the installation.

afaict (2) pretty much kills extensibility if we interpret that
literally. I'd like to research it a bit more before we accept it as a
requirement.

I'd propose that we start accumulating a feature list, perhaps ordering
it into categories like

o required/suggested by SQL9x
o required/suggested by experience in the real world
o sure would be nice to have
o really bad idea ;)

Sounds good. Could I put "CREATE CHARACTER SET" as the first item of
the list and start a discussion for that?

I have a feeling that you have an idea to treat user defined charset
as a PostgreSQL new data type. So probably "CREATE CHARACTER SET"
could be traslated to our "CREATE TYPE" by the parer, right?

Yes. Though the SQL_TEXT issue may completely kill this. And lead to a
requirement that we have a full-unicode backend :((

I'm hoping that there is a less-intrusive way to do this. What do other
database systems have for this? I assume most do not have much...

- Thomas

#167Peter Eisentraut
peter_e@gmx.net
In reply to: Bruce Momjian (#129)
Re: Big 7.1 open items

Bruce Momjian writes:

If we have a new CREATE DATABASE LOCATION command, we can say:

CREATE DATABASE LOCATION dbloc IN '/var/private/pgsql';
CREATE DATABASE newdb IN dbloc;

We kind of have this already, with CREATE DATABASE foo WITH LOCATION =
'bar'; but of course with environment variable kludgery. But it's a start.

mkdir /var/private/pgsql/dbloc
ln -s /var/private/pgsql/dbloc data/base/dbloc

I think the problem with this was that you'd have to do an extra lookup
into, say, pg_location to resolve this. Some people are talking about
blind writes, this is not really blind.

CREATE LOCATION tabloc IN '/var/private/pgsql';
CREATE TABLE newtab ... IN tabloc;

Okay, so we'd have "table spaces" and "database spaces". Seems like one
"space" ought to be enough. I was thinking that the database "space" would
serve as a default "space" for tables created within it but you could
still create tables in other "spaces" than were the database really is. In
fact, the database wouldn't show up at all in the file names anymore,
which may or may not be a good thing.

I think Tom suggested something more or less like this:

$PGDATA/base/tablespace/segment/table

(leaving the details of "table" aside for now). pg_class would get a
column storing the table space somehow, say an oid reference to
pg_location. There would have to be a default tablespace that's created by
initdb and it's indicated by oid 0. So if you create a simple little table
"foo" it ends up in

$PGDATA/base/0/0/foo

That is pretty manageable. Now to create a table space you do

CREATE LOCATION "name" AT '/some/where';

which would make an entry in pg_location and, similar to how you
suggested, create a symlink from

$PGDATA/base/newoid -> /some/where

Then when you create a new table at that new location this gets simply
noted in pg_class with an oid reference, the rest works completely
transparently and no lookup outside of pg_class required. The system would
create the segment 0 subdirectory automatically.

When tables get segmented the system would simply create subdirectories 1,
2, 3, etc. as needed, just as it created the 0 as need, no extra code.

pg_dump doesn't need to use lstat or whatever at all because the locations
are catalogued. Administrators don't even need to know about the linking
business, they just make sure the target directory exists.

Two more items to ponder:

* per-location transaction logs

* pg_upgrade

--
Peter Eisentraut Sernanders v�g 10:115
peter_e@gmx.net 75262 Uppsala
http://yi.org/peter-e/ Sweden

#168Peter Eisentraut
peter_e@gmx.net
In reply to: Thomas Lockhart (#67)
Character sets (Re: Re: Big 7.1 open items)

Thomas Lockhart writes:

One issue: I can see (or imagine ;) how we can use the Postgres type
system to manage multiple character sets.

But how are you going to tell a genuine "type" from a character set? And
you might have to have three types for each charset. There'd be a lot of
redundancy and confusion regarding the input and output functions and
other pg_type attributes. No doubt there's something to be learned from
the type system, but character sets have different properties -- like
characters(!), collation rules, encoding "translations" and what not.
There is no doubt also need for different error handling. So I think that
just dumping every character set into pg_type is not a good idea. That's
almost equivalent to having separate types for char(6), char(7), etc.

Instead, I'd suggest that character sets become separate objects. A
character entity would carry around its character set in its header
somehow. Consider a string concatenation function, being invoked with two
arguments of the same exotic character set. Using the type system only
you'd have to either provide a function signature for all combinations of
characters sets or you'd have to cast them up to SQL_TEXT, concatenate
them and cast them back to the original charset. A smarter concatentation
function instead might notice that both arguments are of the same
character set and simply paste them together right there.

But allowing arbitrary character sets in, say, table names forces us
to cope with allowing a mix of character sets in a single column of a
system table.

The priority is probably the data people store, not the way they get to
name their tables.

Would it be acceptable to have a "default database character set"
which is allowed to creep into the pg_xxx tables?

I think we could go with making all system table char columns Unicode, but
of course they are really of the "name" type, which is another issue
completely.

We should itemize all of these issues so we can keep track of what is
necessary, possible, and/or "easy".

Here are a couple of "items" I keep wondering about:

* To what extend would we be able to use the operating systems locale
facilities? Besides the fact that some systems are deficient or broken one
way or another, POSIX really doesn't provide much besides "given two
strings, which one is greater", and then only on a per-process basis.
We'd really need more that, see also LIKE indexing issues, and indexing in
general.

* Client support: A lot of language environments provide pretty smooth
Unicode support these days, e.g., Java, Perl 5.6, and I think that C99 has
also made some strides. So while "we can store stuff in any character set
you want" is great, it's really no good if it doesn't work transparently
with the client interfaces. At least something to keep in mind.

--
Peter Eisentraut Sernanders v�g 10:115
peter_e@gmx.net 75262 Uppsala
http://yi.org/peter-e/ Sweden

#169Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Peter Eisentraut (#167)
Re: Big 7.1 open items

[ Charset ISO-8859-1 unsupported, converting... ]

Bruce Momjian writes:

If we have a new CREATE DATABASE LOCATION command, we can say:

CREATE DATABASE LOCATION dbloc IN '/var/private/pgsql';
CREATE DATABASE newdb IN dbloc;

We kind of have this already, with CREATE DATABASE foo WITH LOCATION =
'bar'; but of course with environment variable kludgery. But it's a start.

Yes, I didn't like the environment variable stuff. In fact, I would
like to not mention the symlink location anywhere in the database, so it
can be changed without changing it in the database.

mkdir /var/private/pgsql/dbloc
ln -s /var/private/pgsql/dbloc data/base/dbloc

I think the problem with this was that you'd have to do an extra lookup
into, say, pg_location to resolve this. Some people are talking about
blind writes, this is not really blind.

I was think of storing the relfilename as dbloc/mytab32332.

CREATE LOCATION tabloc IN '/var/private/pgsql';
CREATE TABLE newtab ... IN tabloc;

Okay, so we'd have "table spaces" and "database spaces". Seems like one
"space" ought to be enough. I was thinking that the database "space" would
serve as a default "space" for tables created within it but you could
still create tables in other "spaces" than were the database really is. In
fact, the database wouldn't show up at all in the file names anymore,
which may or may not be a good thing.

I think Tom suggested something more or less like this:

$PGDATA/base/tablespace/segment/table

So you mix tables from different database in the same tablespace? Seems
better to keep them in separate directories for efficiency and clarity.

We could use tablespace/dbname/table so that a tablespace would have
a directory for each database that uses the tablespace.

(leaving the details of "table" aside for now). pg_class would get a
column storing the table space somehow, say an oid reference to
pg_location. There would have to be a default tablespace that's created by
initdb and it's indicated by oid 0. So if you create a simple little table
"foo" it ends up in

$PGDATA/base/0/0/foo

Seems better to use the top directory for 0, and have extents in
subdirectories like Extent-2, etc. Easier for administrators and new
people.

However, one problem is that tables created in a database without a
location are put under pgsql directory. You would have to symlink the
actual database directory. Maybe that is why I had separate database
locations. I realize that is bad.

That is pretty manageable. Now to create a table space you do

CREATE LOCATION "name" AT '/some/where';

which would make an entry in pg_location and, similar to how you
suggested, create a symlink from

$PGDATA/base/newoid -> /some/where

Then when you create a new table at that new location this gets simply
noted in pg_class with an oid reference, the rest works completely
transparently and no lookup outside of pg_class required. The system would
create the segment 0 subdirectory automatically.

When tables get segmented the system would simply create subdirectories 1,
2, 3, etc. as needed, just as it created the 0 as need, no extra code.

pg_dump doesn't need to use lstat or whatever at all because the locations
are catalogued. Administrators don't even need to know about the linking
business, they just make sure the target directory exists.

What I was suggesting is not to catalog the symlink locations, but to
use lstat when dumping, so that admins can move files around using
symlinks and not have to udpate the database.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#170Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Tom Lane (#162)
RE: Big 7.1 open items

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Agreed. Seems we have several issues:

filename contents
tablespace implementation
tablespace directory layout
tablespace commands and syntax

[snip]

Comments? Anything else that we do have consensus on?

Before the details of tablespace implementation,

1) How to change(extend) the syntax of CREATE TABLE
We only add table(data)space name with some
keyword ? i.e Do we consider tablespace as an
abstraction ?

To confirm our mutual understanding.

2) Is tablespace defined per PostgreSQL's database ?
3) Is default tablespace defined per database/user or
for all ?

AFAIK in Oracle,2) global, 3) per user.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#171Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Peter Eisentraut (#167)
RE: Big 7.1 open items

-----Original Message-----
From: Peter Eisentraut

Bruce Momjian writes:

If we have a new CREATE DATABASE LOCATION command, we can say:

CREATE DATABASE LOCATION dbloc IN '/var/private/pgsql';
CREATE DATABASE newdb IN dbloc;

We kind of have this already, with CREATE DATABASE foo WITH LOCATION =
'bar'; but of course with environment variable kludgery. But it's a start.

mkdir /var/private/pgsql/dbloc
ln -s /var/private/pgsql/dbloc data/base/dbloc

I think the problem with this was that you'd have to do an extra lookup
into, say, pg_location to resolve this. Some people are talking about
blind writes, this is not really blind.

CREATE LOCATION tabloc IN '/var/private/pgsql';
CREATE TABLE newtab ... IN tabloc;

Okay, so we'd have "table spaces" and "database spaces". Seems like one
"space" ought to be enough.

Does your "database space" correspond to current PostgreSQL's database ?
And is it different from SCHEMA ?

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#172Philip J. Warner
pjw@rhyme.com.au
In reply to: Hiroshi Inoue (#170)
RE: Big 7.1 open items

At 05:59 21/06/00 +0900, Hiroshi Inoue wrote:

Before the details of tablespace implementation,

1) How to change(extend) the syntax of CREATE TABLE
We only add table(data)space name with some
keyword ? i.e Do we consider tablespace as an
abstraction ?

It may be worth considering leaving the CREATE TABLE statement alone.
Dec/RDB uses a new statement entirely to define where a table goes. It's
actually a *very* complex statement, but the key syntax is:

CREATE STORAGE MAP <map-name> FOR <table-name>
[PLACEMENT VIA INDEX <index-name>]
STORE [COLUMNS ([col-name,])]
[IN <area-name>
| RANDOMLY ACROSS <area-list>]
;

where <area-name> is the name of a Dec/RDB STORAGE AREA, which is basically
a file that contains one or more tables/indices etc. There are options to
specify area choice by column value, fullness, how to store BLOBs etc etc.

I realize that this is way too complex for a first pass, but it gives an
idea of where you *might* want to go, and hence, possibly, a reason for
starting out with something like:

CREATE STORAGE MAP <map-name> for <table-name> STORE IN <area-name>;

P.S. I really hope this is more cogent than my last message.

----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.C.N. 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/

#173Chris Bitmead
chrisb@nimrod.itg.telstra.com.au
In reply to: Bruce Momjian (#159)
Re: Big 7.1 open items

Tom Lane wrote:

Some unhappiness was raised about
depending on symlinks for this function, but I didn't hear one single
concrete reason not to do it, nor an alternative design.

Are symlinks portable?

#174Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Hiroshi Inoue (#171)
Re: Big 7.1 open items

[ Charset ISO-8859-1 unsupported, converting... ]

-----Original Message-----
From: Peter Eisentraut

Bruce Momjian writes:

If we have a new CREATE DATABASE LOCATION command, we can say:

CREATE DATABASE LOCATION dbloc IN '/var/private/pgsql';
CREATE DATABASE newdb IN dbloc;

We kind of have this already, with CREATE DATABASE foo WITH LOCATION =
'bar'; but of course with environment variable kludgery. But it's a start.

mkdir /var/private/pgsql/dbloc
ln -s /var/private/pgsql/dbloc data/base/dbloc

I think the problem with this was that you'd have to do an extra lookup
into, say, pg_location to resolve this. Some people are talking about
blind writes, this is not really blind.

CREATE LOCATION tabloc IN '/var/private/pgsql';
CREATE TABLE newtab ... IN tabloc;

Okay, so we'd have "table spaces" and "database spaces". Seems like one
"space" ought to be enough.

Does your "database space" correspond to current PostgreSQL's database ?
And is it different from SCHEMA ?

OK, seems I have things a little confused. My whole idea of database
locations vs. normal locations is flawed. Here is my new proposal.

First, I believe there should be locations define per database, not
global locations.

I recommend

CREATE TABLESPACE tabloc USING '/var/private/pgsql';
CREATE TABLE newtab ... IN tabloc;

and this does:

mkdir /var/private/pgsql/dbname
mkdir /var/private/pgsql/dbname/tabloc
ln -s /var/private/pgsql/dbname/tabloc data/base/tabloc

I recommend making a dbname in each directory, then putting the
location inside there.

This allows the same directory to be used for tablespaces by several
databases, and allows databases created in locations without making
special per-database locations.

I can give a more specific proposal if people wish.

Comments?

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#175Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Chris Bitmead (#173)
Re: Big 7.1 open items

Tom Lane wrote:

Some unhappiness was raised about
depending on symlinks for this function, but I didn't hear one single
concrete reason not to do it, nor an alternative design.

Are symlinks portable?

Sure, and if the system loading it can not create the required symlinks
because the directories don't exist, it can just skip the symlink step.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#176Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#174)
Re: Big 7.1 open items

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I recommend making a dbname in each directory, then putting the
location inside there.

This still seems backwards to me. Why is it better than tablespace
directory inside database directory?

One significant problem with it is that there's no longer (AFAICS)
a "default" per-database directory that corresponds to the current
working directory of backends running in that database. Thus,
for example, it's not immediately clear where temporary files and
backend core-dump files will end up. Also, you've just added an
essential extra level (if not two) to the pathnames that backends will
use to address files.

There is a great deal to be said for
..../database/tablespace/filename
where .../database/ is the working directory of a backend running in
that database, so that the relative pathname used by that backend to
get to a table is just tablespace/filename. I fail to see any advantage
in reversing the pathname order. If you see one, enlighten me.

regards, tom lane

#177Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#176)
Re: Big 7.1 open items

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I recommend making a dbname in each directory, then putting the
location inside there.

This still seems backwards to me. Why is it better than tablespace
directory inside database directory?

Yes, that is what I want too.

One significant problem with it is that there's no longer (AFAICS)
a "default" per-database directory that corresponds to the current
working directory of backends running in that database. Thus,
for example, it's not immediately clear where temporary files and
backend core-dump files will end up. Also, you've just added an
essential extra level (if not two) to the pathnames that backends will
use to address files.

There is a great deal to be said for
..../database/tablespace/filename
where .../database/ is the working directory of a backend running in
that database, so that the relative pathname used by that backend to
get to a table is just tablespace/filename. I fail to see any advantage
in reversing the pathname order. If you see one, enlighten me.

Yes, agreed. I was thinking this:

CREATE TABLESPACE loc USING '/var/pgsql'

does:

ln -s /var/pgsql/dbname/loc data/base/dbname/loc

In this way, the database has a view of its main directory, plus a /loc
subdirectory for the tablespace. In the other location, we have
/var/pgsql/dbname/loc because this allows different databases to use:

CREATE TABLESPACE loc USING '/var/pgsql'

and they do not collide with each other in /var/pgsql. It puts /loc
inside the dbname that created it. It also allows:

CREATE DATABASE loc IN '/var/pgsql'

to work because this does:

ln -s /var/pgsql/dbname data/base/dbname

Seems we should create the dbname and loc directories for the users
automatically in the synlink target to keep things clean. It prevents
them from accidentally having two databases point to the same directory.

Comments?

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#178Chris Bitmead
chrisb@nimrod.itg.telstra.com.au
In reply to: Bruce Momjian (#175)
Re: Big 7.1 open items

Bruce Momjian wrote:

Tom Lane wrote:

Some unhappiness was raised about
depending on symlinks for this function, but I didn't hear one single
concrete reason not to do it, nor an alternative design.

Are symlinks portable?

Sure, and if the system loading it can not create the required symlinks
because the directories don't exist, it can just skip the symlink step.

What I meant is, would you still be able to create tablespaces on
systems without symlinks? That would seem to be a desirable feature.

#179Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Tom Lane (#176)
RE: Big 7.1 open items

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I recommend making a dbname in each directory, then putting the
location inside there.

This still seems backwards to me. Why is it better than tablespace
directory inside database directory?

One significant problem with it is that there's no longer (AFAICS)
a "default" per-database directory that corresponds to the current
working directory of backends running in that database. Thus,
for example, it's not immediately clear where temporary files and
backend core-dump files will end up. Also, you've just added an
essential extra level (if not two) to the pathnames that backends will
use to address files.

There is a great deal to be said for
..../database/tablespace/filename

OK,I seem to have gotten the answer for the question
Is tablespace defined per PostgreSQL's database ?

You and Bruce
1) tablespace is per database
Peter seems to have the following idea(?? not sure)
2) database = tablespace
My opinion
3) database and tablespace are relatively irrelevant.
I assume PostgreSQL's database would correspond
to the concept of SCHEMA.

It seems we are different from the first.
Shoudln't we reach an agreement on it in the first place ?

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#180Tom Lane
tgl@sss.pgh.pa.us
In reply to: Chris Bitmead (#178)
Re: Big 7.1 open items

Chris Bitmead <chrisb@nimrod.itg.telstra.com.au> writes:

What I meant is, would you still be able to create tablespaces on
systems without symlinks? That would seem to be a desirable feature.

All else being equal, it'd be nice. Since all else is not equal,
exactly how much sweat are we willing to expend on supporting that
feature on such systems --- to the exclusion of other features we
might expend the same sweat on, with more widely useful results?

Bear in mind that everything will still *work* just fine on such a
platform, you just don't have a way to spread the database across
multiple filesystems. That's only an issue if the platform has a
fairly Unixy notion of filesystems ... but no symlinks.

A few messages back someone was opining that we were wasting our time
thinking about tablespaces at all, because any modern platform can
create disk-spanning filesystems for itself, so applications don't have
to worry. I don't buy that argument in general, but I'm quite willing
to quote it for the *very* few systems that are Unixy enough to run
Postgres in the first place, but not quite Unixy enough to have
symlinks.

You gotta draw the line somewhere at what you will support, and
this particular line seems to me to be entirely reasonable and
justifiable. YMMV...

regards, tom lane

#181Don Baccus
dhogaza@pacifier.com
In reply to: Philip J. Warner (#172)
RE: Big 7.1 open items

At 11:22 AM 6/21/00 +1000, Philip J. Warner wrote:

It may be worth considering leaving the CREATE TABLE statement alone.
Dec/RDB uses a new statement entirely to define where a table goes...

It's worth considering, but on the other hand Oracle users greatly
outnumber Compaq/RDB users these days...

If there's no SQL92 guidance for implementing a feature, I'm pretty much in
favor of tracking Oracle, whose SQL dialect is rapidly becoming a
de-facto standard.

I'm not saying I like the fact, Oracle's a pain in the ass. But when
adopting existing syntax, might as well adopt that of the crushing
borg.

- Don Baccus, Portland OR <dhogaza@pacifier.com>
Nature photos, on-line guides, Pacific Northwest
Rare Bird Alert Service and other goodies at
http://donb.photo.net.

#182Don Baccus
dhogaza@pacifier.com
In reply to: Chris Bitmead (#173)
Re: Big 7.1 open items

At 12:27 PM 6/21/00 +1000, Chris Bitmead wrote:

Tom Lane wrote:

Some unhappiness was raised about
depending on symlinks for this function, but I didn't hear one single
concrete reason not to do it, nor an alternative design.

Are symlinks portable?

In today's world? Yeah, I think so.

My only unhappiness has hinged around the possibility that a new
storage scheme might temp folks to toss aside the sgmr abstraction,
or weaken it.

It doesn't appear that this will happen.

Given an adequate sgmr abstraction, it doesn't really matter what
low-level model is adopted in some sense (i.e. other models might
become available, the implemented model might get replaced, etc -
without breaking backends).

Obviously we'll all be using the default model for some time, maybe
forever, but if mistakes are made maintaining the smgr abstraction
means that replacements are possible. Or kinky substitutes like
working with DAFS.

- Don Baccus, Portland OR <dhogaza@pacifier.com>
Nature photos, on-line guides, Pacific Northwest
Rare Bird Alert Service and other goodies at
http://donb.photo.net.

#183Thomas Lockhart
lockhart@alumni.caltech.edu
In reply to: Bruce Momjian (#169)
Re: Big 7.1 open items

Yes, I didn't like the environment variable stuff. In fact, I would
like to not mention the symlink location anywhere in the database, so
it can be changed without changing it in the database.

Well, as y'all have noticed, I think there are strong reasons to use
environment variables to manage locations, and that symlinks are a
potential portability and robustness problem.

An additional point which has relevance to this whole discussion:

In the future we may allow system resource such as tables to carry names
which use multi-byte encodings. afaik these encodings are not allowed to
be used for physical file names, and even if they were the utility of
using standard operating system utilities like ls goes way down.

istm that from a portability and evolutionary standpoint OID-only file
names (or at least file names *not* based on relation/class names) is a
requirement.

Comments?

- Thomas

#184Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hiroshi Inoue (#179)
Re: Big 7.1 open items

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

There is a great deal to be said for
..../database/tablespace/filename

OK,I seem to have gotten the answer for the question
Is tablespace defined per PostgreSQL's database ?

Not necessarily --- the tablespace subdirectories could be symlinks
pointing to the same place (assuming you use OIDs or something to keep
the table filenames unique even across databases). This is just an
implementation mechanism; it doesn't foreclose the policy decision
whether tablespaces are database-local or installation-wide.

(OTOH, pathnames like tablespace/database would pretty much force
tablespaces to be installation-wide whether you wanted it that way
or not.)

My opinion
3) database and tablespace are relatively irrelevant.
I assume PostgreSQL's database would correspond
to the concept of SCHEMA.

My inclindation is that tablespaces should be installation-wide, but
I'm not completely sold on it. In any case I could see wanting a
permissions mechanism that would only allow some databases to have
tables in a particular tablespace.

We do need to think more about how traditional Postgres databases
fit together with SCHEMA. Maybe we wouldn't even need multiple
databases per installation if we had SCHEMA done right.

regards, tom lane

#185Ross J. Reedstrom
reedstrm@rice.edu
In reply to: Tom Lane (#184)
Re: Big 7.1 open items

On Wed, Jun 21, 2000 at 01:23:57AM -0400, Tom Lane wrote:

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

My opinion
3) database and tablespace are relatively irrelevant.
I assume PostgreSQL's database would correspond
to the concept of SCHEMA.

My inclindation is that tablespaces should be installation-wide, but
I'm not completely sold on it. In any case I could see wanting a
permissions mechanism that would only allow some databases to have
tables in a particular tablespace.

We do need to think more about how traditional Postgres databases
fit together with SCHEMA. Maybe we wouldn't even need multiple
databases per installation if we had SCHEMA done right.

The important point I think is that tablespaces are about physical
storage/namespace, and SCHEMA are about logical namespace: it would make
sense for tables from multiple schema to live in the same tablespace,
as well as tables from one schema to be stored in multiple tablespaces.

Ross
--
Ross J. Reedstrom, Ph.D., <reedstrm@rice.edu>
NSBRI Research Scientist/Programmer
Computer and Information Technology Institute
Rice University, 6100 S. Main St., Houston, TX 77005

#186Chris Bitmead
chrisb@nimrod.itg.telstra.com.au
In reply to: Hiroshi Inoue (#179)
Re: Big 7.1 open items

"Ross J. Reedstrom" wrote:

The important point I think is that tablespaces are about physical
storage/namespace, and SCHEMA are about logical namespace: it would make
sense for tables from multiple schema to live in the same tablespace,
as well as tables from one schema to be stored in multiple tablespaces.

If we accept that argument (which sounds good) then wouldn't we have...

data/base/db1/table1 -> ../../../tablespace/ts1/db1.table1
data/base/db1/table2 -> ../../../tablespace/ts1/db1.table2
data/tablespace/ts1/db1.table1
data/tablespace/ts1/db1.table2

In other words there is a directory for databases, and a directory for
tablespaces. Database tables are symlinked to the appropriate
tablespace. So there is multiple databases per tablespace and multiple
tablespaces per database.

#187Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Peter Eisentraut (#168)
Re: Character sets (Re: Re: Big 7.1 open items)

But how are you going to tell a genuine "type" from a character set? And
you might have to have three types for each charset. There'd be a lot of
redundancy and confusion regarding the input and output functions and
other pg_type attributes. No doubt there's something to be learned from
the type system, but character sets have different properties -- like
characters(!), collation rules, encoding "translations" and what not.
There is no doubt also need for different error handling. So I think that
just dumping every character set into pg_type is not a good idea. That's
almost equivalent to having separate types for char(6), char(7), etc.

Instead, I'd suggest that character sets become separate objects. A
character entity would carry around its character set in its header
somehow. Consider a string concatenation function, being invoked with two
arguments of the same exotic character set. Using the type system only
you'd have to either provide a function signature for all combinations of
characters sets or you'd have to cast them up to SQL_TEXT, concatenate
them and cast them back to the original charset. A smarter concatentation
function instead might notice that both arguments are of the same
character set and simply paste them together right there.

Intersting idea. But what about collations? SQL allows to assign a
collation different from the default one to a character set on the
fly. Should we make collations as separate obejcts as well?

Here are a couple of "items" I keep wondering about:

* To what extend would we be able to use the operating systems locale
facilities? Besides the fact that some systems are deficient or broken one
way or another, POSIX really doesn't provide much besides "given two
strings, which one is greater", and then only on a per-process basis.
We'd really need more that, see also LIKE indexing issues, and indexing in
general.

Correct. I'd suggest completely getting ride of OS's locale.

* Client support: A lot of language environments provide pretty smooth
Unicode support these days, e.g., Java, Perl 5.6, and I think that C99 has
also made some strides. So while "we can store stuff in any character set
you want" is great, it's really no good if it doesn't work transparently
with the client interfaces. At least something to keep in mind.

Do you suggest that we should convert everyting into Unicode and store
them into DB?
--
Tatsuo Ishii

#188Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Thomas Lockhart (#166)
SQL_TEXT (Re: Re: Big 7.1 open items)

Yuck. That is what is says, all right :(

Date says that SQL_TEXT is required to have two things:
1) all characters used in the SQL language itself (which is what I
recalled)

2) Every other character from every character set in the installation.

Doesn't it say "charcter repertory", rather than character set? I
think it would be possible to let our SQL_TEXT support every character
repertories in the world, if we use Unicode or Mule internal code for
that.
--
Tatsuo Ishii

#189Philip J. Warner
pjw@rhyme.com.au
In reply to: Don Baccus (#181)
RE: Big 7.1 open items

At 22:12 20/06/00 -0700, Don Baccus wrote:

At 11:22 AM 6/21/00 +1000, Philip J. Warner wrote:

It may be worth considering leaving the CREATE TABLE statement alone.
Dec/RDB uses a new statement entirely to define where a table goes...

It's worth considering, but on the other hand Oracle users greatly
outnumber Compaq/RDB users these days...

It's actually Oracle/Rdb, but I call it Dec/Rdb to distinguish it from
'Oracle/Oracle'. It was acquired by Oracle, supposedly because Oracle
wanted their optimizer, management and tuning tools (although that was only
hearsay). They *say* that they plan to merge the two products.

What I was trying to suggest was that the CREATE TABLE statement will get
very overloaded, and it might be worth avoiding having to support two
storage management syntaxes if/when it becomes desirable to create a
'storage' statement of some kind.

I'm not saying I like the fact, Oracle's a pain in the ass. But when
adopting existing syntax, might as well adopt that of the crushing
borg.

Only if it is a good thing, or part of a real standard. Philosophically,
where possible I would prefer to see statement that are *in* the SQL
standard (ie. CREATE TABLE) to be left as unencumbered as possible.

----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.C.N. 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/

#190Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Chris Bitmead (#186)
RE: Big 7.1 open items

-----Original Message-----
From: pgsql-hackers-owner@hub.org
[mailto:pgsql-hackers-owner@hub.org]On Behalf Of Chris Bitmead

"Ross J. Reedstrom" wrote:

The important point I think is that tablespaces are about physical
storage/namespace, and SCHEMA are about logical namespace: it would make
sense for tables from multiple schema to live in the same tablespace,
as well as tables from one schema to be stored in multiple tablespaces.

If we accept that argument (which sounds good) then wouldn't we have...

data/base/db1/table1 -> ../../../tablespace/ts1/db1.table1
data/base/db1/table2 -> ../../../tablespace/ts1/db1.table2
data/tablespace/ts1/db1.table1
data/tablespace/ts1/db1.table2

Hmm,is above symlinking business really preferable just because
it is possible ? Why do we have to be dependent upon directory
tree representation when we handle db structure ?

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#191Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Hiroshi Inoue (#190)
AW: Big 7.1 open items

The current discussion of symlinks is focusing on using directory
symlinks, not file symlinks, to represent/implement tablespace layout.

If that is the only issue for the symlinks, I think it would be sufficient
to
put the files in the correct subdirectories. The dba can then decide
whether he wants to mount filsystems directly to the disired location,
or create a symlink. I do not see an advantage in creating a symlink
in the backend, since the dba has to create the filesystems anyway.

fs: data
fs: data/base/...../extent1
link: data/base/...../extent2 -> /data/extent2
...

Andreas

#192Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Zeugswetter Andreas SB (#191)
AW: Big 7.1 open items

CREATE LOCATION tabloc IN '/var/private/pgsql';
CREATE TABLE newtab ... IN tabloc;

Okay, so we'd have "table spaces" and "database spaces".

Seems like one

"space" ought to be enough.

Yes, one space should be enough.

Does your "database space" correspond to current PostgreSQL's
database ?

I think we should think of the "database space" as the default "table space"
for this database.

And is it different from SCHEMA ?

Please don't mix schema and database, they are two different issues.
Even Oracle has a database, only in Oracle you are limited to one database
per instance. We do not want to add this limitation to PostgreSQL.

Andreas

#193Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Zeugswetter Andreas SB (#192)
AW: Big 7.1 open items

My opinion
3) database and tablespace are relatively irrelevant.
I assume PostgreSQL's database would correspond
to the concept of SCHEMA.

No, this should definitely not be so.

Andreas

#194Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Zeugswetter Andreas SB (#193)
AW: Big 7.1 open items

My inclindation is that tablespaces should be installation-wide, but
I'm not completely sold on it. In any case I could see wanting a
permissions mechanism that would only allow some databases to have
tables in a particular tablespace.

I fully second that.

We do need to think more about how traditional Postgres databases
fit together with SCHEMA. Maybe we wouldn't even need multiple
databases per installation if we had SCHEMA done right.

This gives me the goose bumps. A schema is something that is below
the database hierarchy. It is the owner of a table. We lack the ability
to qualify a tablemname with an owner like "owner".tabname .
Can we please agree to that much ?

Andreas

#195Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Chris Bitmead (#178)
Re: Big 7.1 open items

Sure, and if the system loading it can not create the required symlinks
because the directories don't exist, it can just skip the symlink step.

What I meant is, would you still be able to create tablespaces on
systems without symlinks? That would seem to be a desirable feature.

You could create tablespaces, but you could not point them at different
drives. The issue is that we don't store the symlink location in the
database, just the tablespace name.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#196Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Bruce Momjian (#195)
AW: Big 7.1 open items

Sure, and if the system loading it can not create the

required symlinks

because the directories don't exist, it can just skip the

symlink step.

What I meant is, would you still be able to create tablespaces on
systems without symlinks? That would seem to be a desirable feature.

You could create tablespaces, but you could not point them at
different
drives. The issue is that we don't store the symlink location in the
database, just the tablespace name.

You could point them to another drive if your OS allows you to mount a
filesystem under an arbitrary name, no ?

Andreas

#197Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#177)
Re: Big 7.1 open items

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Yes, agreed. I was thinking this:
CREATE TABLESPACE loc USING '/var/pgsql'
does:
ln -s /var/pgsql/dbname/loc data/base/dbname/loc
In this way, the database has a view of its main directory, plus a /loc
subdirectory for the tablespace. In the other location, we have
/var/pgsql/dbname/loc because this allows different databases to use:
CREATE TABLESPACE loc USING '/var/pgsql'
and they do not collide with each other in /var/pgsql.

But they don't collide anyway, because the dbname is already unique.
Isn't the extra subdirectory a waste?

Because table files will have installation-wide unique names, there's
no really good reason to have either level of subdirectory; you could
just make
CREATE TABLESPACE loc USING '/var/pgsql'
do
ln -s /var/pgsql data/base/dbname/loc
and it'd still work even if multiple DBs were using the same tablespace.

However, forcing creation of a subdirectory does give you the chance to
make sure the subdir is owned by postgres and has the right permissions,
so there's something to be said for that. It might be reasonable to do
mkdir /var/pgsql/dbname
chmod 700 /var/pgsql/dbname
ln -s /var/pgsql/dbname data/base/dbname/loc

regards, tom lane

#198Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Don Baccus (#182)
Re: Big 7.1 open items

At 12:27 PM 6/21/00 +1000, Chris Bitmead wrote:

Tom Lane wrote:

Some unhappiness was raised about
depending on symlinks for this function, but I didn't hear one single
concrete reason not to do it, nor an alternative design.

Are symlinks portable?

In today's world? Yeah, I think so.

My only unhappiness has hinged around the possibility that a new
storage scheme might temp folks to toss aside the sgmr abstraction,
or weaken it.

It doesn't appear that this will happen.

Given an adequate sgmr abstraction, it doesn't really matter what
low-level model is adopted in some sense (i.e. other models might
become available, the implemented model might get replaced, etc -
without breaking backends).

Obviously we'll all be using the default model for some time, maybe
forever, but if mistakes are made maintaining the smgr abstraction
means that replacements are possible. Or kinky substitutes like
working with DAFS.

The symlink solution where the actual symlink location is not stored
in the database is certainly abstract. We store that info in the file
system, which is where it belongs. We only query the symlink location
when we need it for database location dumping.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#199Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Thomas Lockhart (#183)
Re: Big 7.1 open items

Yes, I didn't like the environment variable stuff. In fact, I would
like to not mention the symlink location anywhere in the database, so
it can be changed without changing it in the database.

Well, as y'all have noticed, I think there are strong reasons to use
environment variables to manage locations, and that symlinks are a
potential portability and robustness problem.

Sorry, disagree. Environment variables are a pain to administer, and
quite counter-intuitive.

I also don't see any portability or robustness problems. Can you be
more specific?

An additional point which has relevance to this whole discussion:

In the future we may allow system resource such as tables to carry names
which use multi-byte encodings. afaik these encodings are not allowed to
be used for physical file names, and even if they were the utility of
using standard operating system utilities like ls goes way down.

That is really a different issues of file names. Multi-byte table names
can be made to hold just the oid. We have complete control over that
because the file name will be in pg_class.

istm that from a portability and evolutionary standpoint OID-only file
names (or at least file names *not* based on relation/class names) is a
requirement.

Maybe a requirement at some point for some installations, but I hope not
a general requirement.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#200Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Bruce Momjian (#199)
AW: Big 7.1 open items

The symlink solution where the actual symlink location is not stored
in the database is certainly abstract. We store that info in the file
system, which is where it belongs. We only query the symlink location
when we need it for database location dumping.

Sounds good, and also if the symlink query shows a simple directory
we do nothing.

Andreas

#201Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#184)
Re: Big 7.1 open items

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

There is a great deal to be said for
..../database/tablespace/filename

OK,I seem to have gotten the answer for the question
Is tablespace defined per PostgreSQL's database ?

Not necessarily --- the tablespace subdirectories could be symlinks
pointing to the same place (assuming you use OIDs or something to keep
the table filenames unique even across databases). This is just an
implementation mechanism; it doesn't foreclose the policy decision
whether tablespaces are database-local or installation-wide.

Seems we are better just auto-creating a directory that matches the
dbname.

(OTOH, pathnames like tablespace/database would pretty much force
tablespaces to be installation-wide whether you wanted it that way
or not.)

My opinion
3) database and tablespace are relatively irrelevant.
I assume PostgreSQL's database would correspond
to the concept of SCHEMA.

My inclindation is that tablespaces should be installation-wide, but
I'm not completely sold on it. In any case I could see wanting a
permissions mechanism that would only allow some databases to have
tables in a particular tablespace.

On idea is to allow tablespaces defined in template1 to be propogated to
newly created directories, with the symlinks adjusted so they use the
proper dbname in the symlink.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#202Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Ross J. Reedstrom (#185)
Re: Big 7.1 open items

The important point I think is that tablespaces are about physical
storage/namespace, and SCHEMA are about logical namespace: it would make
sense for tables from multiple schema to live in the same tablespace,
as well as tables from one schema to be stored in multiple tablespaces.

It seems mixing the physical layout and the logical SCHEMA would have
problems because people have different reasons for using each feature.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#203Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Philip J. Warner (#189)
Re: Big 7.1 open items

At 22:12 20/06/00 -0700, Don Baccus wrote:

At 11:22 AM 6/21/00 +1000, Philip J. Warner wrote:

It may be worth considering leaving the CREATE TABLE statement alone.
Dec/RDB uses a new statement entirely to define where a table goes...

It's worth considering, but on the other hand Oracle users greatly
outnumber Compaq/RDB users these days...

It's actually Oracle/Rdb, but I call it Dec/Rdb to distinguish it from
'Oracle/Oracle'. It was acquired by Oracle, supposedly because Oracle
wanted their optimizer, management and tuning tools (although that was only
hearsay). They *say* that they plan to merge the two products.

What I was trying to suggest was that the CREATE TABLE statement will get
very overloaded, and it might be worth avoiding having to support two
storage management syntaxes if/when it becomes desirable to create a
'storage' statement of some kind.

Seems adding tablespace to CREATE TABLE/INDEX/DATABASE is pretty simple.
Doing it as a separate command seems cumbersome.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#204Thomas Lockhart
lockhart@alumni.caltech.edu
In reply to: Bruce Momjian (#199)
Re: Big 7.1 open items

Sorry, disagree. Environment variables are a pain to administer, and
quite counter-intuitive.

Well, I guess we disagree. But until we have a complete proposed
solution, we should leave environment variables on the table, since they
*do* allow some decoupling of logical and physical storage, and *do*
give the administrator some control over resources *that the admin would
not otherwise have*.

istm that from a portability and evolutionary standpoint OID-only
file names (or at least file names *not* based on relation/class
names) is a requirement.

Maybe a requirement at some point for some installations, but I hope
not a general requirement.

If a table name can have characters which are not legal for file names,
then how would you propose to support it? If we are doing a
restructuring of the storage scheme, this should be taken into account.

lockhart=# create table "one/two" (i int);
ERROR: cannot create one/two

Why not? It demonstrates an unfortunate linkage between file systems and
database resources.

- Thomas

#205Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Lockhart (#183)
Re: Big 7.1 open items

Thomas Lockhart <lockhart@alumni.caltech.edu> writes:

Well, as y'all have noticed, I think there are strong reasons to use
environment variables to manage locations, and that symlinks are a
potential portability and robustness problem.

Reasons? Evidence?

An additional point which has relevance to this whole discussion:
In the future we may allow system resource such as tables to carry names
which use multi-byte encodings. afaik these encodings are not allowed to
be used for physical file names, and even if they were the utility of
using standard operating system utilities like ls goes way down.

Good point, although in one sense a string is a string --- as long as
we don't allow embedded nulls in server-side encodings, we could use
anything that Postgres thought was a name in a filename, and the OS
should take it. But if your local ls doesn't show it the way you see
in Postgres, the usefulness of having the tablename in the filename
goes way down.

istm that from a portability and evolutionary standpoint OID-only file
names (or at least file names *not* based on relation/class names) is a
requirement.

No argument from me ;-). I've been looking for compromise positions
but I still think that pure numeric filenames are the cleanest solution.

There's something else that should be taken into account: for WAL, the
log will need to record the table file that each insert/delete/update
operation affects. To do that with the smgr-token-is-a-pathname
approach I was suggesting yesterday, I think you have to record the
database name and pathname in each WAL log entry. That's 64 bytes/log
entry which is a *lot*. If we bit the bullet and restricted ourselves
to numeric filenames then the log would need just four numeric values:
database OID
tablespace OID
relation OID
relation version number
(this set of 4 values would also be an smgr file reference token).
16 bytes/log entry looks much better than 64.

At the moment I can recall the following opinions:

Pure OID filenames: Thomas, Tom, Marc, Peter E.

OID+relname filenames: Bruce

Vadim was in the pure-OID camp a few months ago, but I won't presume
to list him there now since he hasn't been involved in this most
recent round of discussions. I'm not sure where anyone else stands...
but at least in terms of the core group it's pretty clear where the
majority opinion is.

regards, tom lane

#206Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#197)
Re: Big 7.1 open items

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Yes, agreed. I was thinking this:
CREATE TABLESPACE loc USING '/var/pgsql'
does:
ln -s /var/pgsql/dbname/loc data/base/dbname/loc
In this way, the database has a view of its main directory, plus a /loc
subdirectory for the tablespace. In the other location, we have
/var/pgsql/dbname/loc because this allows different databases to use:
CREATE TABLESPACE loc USING '/var/pgsql'
and they do not collide with each other in /var/pgsql.

But they don't collide anyway, because the dbname is already unique.
Isn't the extra subdirectory a waste?

Not really. Yes, we could put them all in the same directory, but why
bother. Probably easier to put them in unique directories per database.
Cuts down on directory searches to open file, and allows 'du' to return
meaningful numbers per database. If you don't do that, you can't really
tell what files belong to which databases.

Because table files will have installation-wide unique names, there's
no really good reason to have either level of subdirectory; you could
just make
CREATE TABLESPACE loc USING '/var/pgsql'
do
ln -s /var/pgsql data/base/dbname/loc
and it'd still work even if multiple DBs were using the same tablespace.

However, forcing creation of a subdirectory does give you the chance to
make sure the subdir is owned by postgres and has the right permissions,
so there's something to be said for that. It might be reasonable to do
mkdir /var/pgsql/dbname
chmod 700 /var/pgsql/dbname
ln -s /var/pgsql/dbname data/base/dbname/loc

Yes, that is true. My idea is that they may want to create loc1 and
loc2 which initially point to the same location, but later may be moved.
For example, one tablespace for tables, another for indexes. They may
initially point to the same directory, but later be split. Seems we
need to keep the actual tablespace information relivant by using
different directories on the other end too.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#207Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Zeugswetter Andreas SB (#200)
Re: AW: Big 7.1 open items

The symlink solution where the actual symlink location is not stored
in the database is certainly abstract. We store that info in the file
system, which is where it belongs. We only query the symlink location
when we need it for database location dumping.

Sounds good, and also if the symlink query shows a simple directory
we do nothing.

Yes, that is correct.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#208Lamar Owen
lamar.owen@wgcr.org
In reply to: Bruce Momjian (#169)
Re: Big 7.1 open items

Tom Lane wrote:

Thomas Lockhart <lockhart@alumni.caltech.edu> writes:

Well, as y'all have noticed, I think there are strong reasons to use
environment variables to manage locations, and that symlinks are a
potential portability and robustness problem.

Reasons? Evidence?

Does Win32 do symlinks these days? I know Win32 does envvars, and Win32
is currently a supported platform.

I'm not thrilled with either solution -- envvars have their problems
just as surely as symlinks do.

At the moment I can recall the following opinions:

Pure OID filenames: Thomas, Tom, Marc, Peter E.

FWIW, count me here. I have tried administering my system using the
filenames -- and have been bitten. Better admin tools in the PostgreSQL
package beat using standard filesystem tools -- the PostgreSQL tools can
be WAL-aware, transaction-aware, and can provide consistent results.
Filesystem tools never will be able to provide consistent results for a
database system that must remain up 24x7, as many if not most PostgreSQL
installations must.

OID+relname filenames: Bruce

Sorry Bruce -- I understand and am sympathetic to your position, and, at
one time, I agreed with it. But not any more.

--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11

#209Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zeugswetter Andreas SB (#191)
Re: AW: Big 7.1 open items

Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at> writes:

The current discussion of symlinks is focusing on using directory
symlinks, not file symlinks, to represent/implement tablespace layout.

If that is the only issue for the symlinks, I think it would be sufficient
to
put the files in the correct subdirectories. The dba can then decide
whether he wants to mount filsystems directly to the disired location,
or create a symlink. I do not see an advantage in creating a symlink
in the backend, since the dba has to create the filesystems anyway.

fs: data
fs: data/base/...../extent1
link: data/base/...../extent2 -> /data/extent2

That (mounting a filesystem directly where the symlink would otherwise
be) would be OK if you were making a new filesystem that you intended to
use *only* as database storage, and *only* for one database ... maybe
even just one extent subdir of one database. I'd accept it as being an
OK answer for anyone unfortunate enough not to have symlinks, but for
most people symlinks would be more flexible.

regards, tom lane

#210Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Lamar Owen (#208)
Re: Big 7.1 open items

FWIW, count me here. I have tried administering my system using the
filenames -- and have been bitten. Better admin tools in the PostgreSQL
package beat using standard filesystem tools -- the PostgreSQL tools can
be WAL-aware, transaction-aware, and can provide consistent results.
Filesystem tools never will be able to provide consistent results for a
database system that must remain up 24x7, as many if not most PostgreSQL
installations must.

OID+relname filenames: Bruce

Sorry Bruce -- I understand and am sympathetic to your position, and, at
one time, I agreed with it. But not any more.

I thought the most recent proposal was to just throw ~16 chars of the
file name on the end of the file name, and that should not be used for
anything except visibility. WAL would not need to store that. It could
just grab the file name that matches the oid/sequence number.

If people don't want table names in the file name, I totally understand,
and we can move on without them. I have made the best case I can for
their inclusion, but if people are not convinced, then maybe I was
wrong.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#211Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#206)
Re: Big 7.1 open items

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Yes, that is true. My idea is that they may want to create loc1 and
loc2 which initially point to the same location, but later may be moved.
For example, one tablespace for tables, another for indexes. They may
initially point to the same directory, but later be split.

Well, that opens up a completely different issue, which is what about
moving tables from one tablespace to another?

I think the way you appear to be implying above (shut down the server
so that you can rearrange subdirectories by hand) is the wrong way to
go about it. For one thing, lots of people don't want to shut down
their servers completely for that long, but it's difficult to avoid
doing so if you want to move files by filesystem commands. For another
thing, the above approach requires guessing in advance --- maybe long
in advance --- how you are going to want to repartition your database
when it gets too big for your existing storage.

The right way to address this problem is to invent a "move table to
new tablespace" command. This'd be pretty trivial to implement based
on a file-versioning approach: the new version of the pg_class tuple
has a new tablespace identifier in it.

regards, tom lane

#212Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#211)
Re: Big 7.1 open items

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Yes, that is true. My idea is that they may want to create loc1 and
loc2 which initially point to the same location, but later may be moved.
For example, one tablespace for tables, another for indexes. They may
initially point to the same directory, but later be split.

Well, that opens up a completely different issue, which is what about
moving tables from one tablespace to another?

Are you suggesting that doing dbname/locname is somehow harder to do
that? If you are, I don't understand why.

The general issue of moving tables between tablespaces can be done from
in the database. I don't think it is reasonable to shut down the db to
do that. However, I can see moving tablespaces to different symlinked
locations may require a shutdown.

I think the way you appear to be implying above (shut down the server
so that you can rearrange subdirectories by hand) is the wrong way to
go about it. For one thing, lots of people don't want to shut down
their servers completely for that long, but it's difficult to avoid
doing so if you want to move files by filesystem commands. For another
thing, the above approach requires guessing in advance --- maybe long
in advance --- how you are going to want to repartition your database
when it gets too big for your existing storage.

The right way to address this problem is to invent a "move table to
new tablespace" command. This'd be pretty trivial to implement based
on a file-versioning approach: the new version of the pg_class tuple
has a new tablespace identifier in it.

Agreed.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#213Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#210)
Re: Big 7.1 open items

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Sorry Bruce -- I understand and am sympathetic to your position, and, at
one time, I agreed with it. But not any more.

I thought the most recent proposal was to just throw ~16 chars of the
file name on the end of the file name, and that should not be used for
anything except visibility. WAL would not need to store that. It could
just grab the file name that matches the oid/sequence number.

But that's extra complexity in WAL, plus extra complexity in renaming
tables (if you want the filename to track the logical table name, which
I expect you would), plus extra complexity in smgr and bufmgr and other
places.

I think people are coming around to the notion that it's better to keep
these low-level operations simple, even if we need to expend more work
on high-level admin tools as a result.

But we do need to remember to expend that effort on tools! Let's not
drop the ball on that, folks.

regards, tom lane

#214Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#212)
Re: Big 7.1 open items

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Well, that opens up a completely different issue, which is what about
moving tables from one tablespace to another?

Are you suggesting that doing dbname/locname is somehow harder to do
that? If you are, I don't understand why.

It doesn't make it harder, but it still seems pointless to have the
extra directory level. Bear in mind that if we go with all-OID
filenames then you're not going to be looking at "loc1" and "loc2"
anyway, but at "5938171" and "8583727". It's not much of a convenience
to the admin to see that, so we might as well save a level of directory
lookup.

The general issue of moving tables between tablespaces can be done from
in the database. I don't think it is reasonable to shut down the db to
do that. However, I can see moving tablespaces to different symlinked
locations may require a shutdown.

Only if you insist on doing it outside the database using filesystem
tools. Another way is to create a new tablespace in the desired new
location, then move the tables one-by-one to that new tablespace.

I suppose either one might be preferable depending on your access
patterns --- locking your most critical tables while they're being moved
might be as bad as a total shutdown.

regards, tom lane

#215Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#214)
Re: Big 7.1 open items

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Well, that opens up a completely different issue, which is what about
moving tables from one tablespace to another?

Are you suggesting that doing dbname/locname is somehow harder to do
that? If you are, I don't understand why.

It doesn't make it harder, but it still seems pointless to have the
extra directory level. Bear in mind that if we go with all-OID
filenames then you're not going to be looking at "loc1" and "loc2"
anyway, but at "5938171" and "8583727". It's not much of a convenience
to the admin to see that, so we might as well save a level of directory
lookup.

Just seems easier to have stuff segregates into separate per-db
directories for clarity. Also, as directories get bigger, finding a
specific file in there becomes harder. Putting 10 databases all in the
same directory seems bad in this regard.

The general issue of moving tables between tablespaces can be done from
in the database. I don't think it is reasonable to shut down the db to
do that. However, I can see moving tablespaces to different symlinked
locations may require a shutdown.

Only if you insist on doing it outside the database using filesystem
tools. Another way is to create a new tablespace in the desired new
location, then move the tables one-by-one to that new tablespace.

I suppose either one might be preferable depending on your access
patterns --- locking your most critical tables while they're being moved
might be as bad as a total shutdown.

Seems we are better having the directory be a symlink so we don't have
symlink overhead for every file open. Also, symlinks when removed just
remove symlink and not the file. I don't think we want to be using
symlinks for tables if we can avoid it.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#216Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#215)
Re: Big 7.1 open items

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Are you suggesting that doing dbname/locname is somehow harder to do
that? If you are, I don't understand why.

It doesn't make it harder, but it still seems pointless to have the
extra directory level. Bear in mind that if we go with all-OID
filenames then you're not going to be looking at "loc1" and "loc2"
anyway, but at "5938171" and "8583727". It's not much of a convenience
to the admin to see that, so we might as well save a level of directory
lookup.

Just seems easier to have stuff segregates into separate per-db
directories for clarity. Also, as directories get bigger, finding a
specific file in there becomes harder. Putting 10 databases all in the
same directory seems bad in this regard.

Huh? I wasn't arguing against making a db-specific directory below the
tablespace point. I was arguing against making *another* directory
below that one.

I don't think we want to be using
symlinks for tables if we can avoid it.

Agreed, but where did that come from? None of these proposals mentioned
symlinks for anything but directories, AFAIR.

regards, tom lane

#217Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#216)
Re: Big 7.1 open items

Just seems easier to have stuff segregates into separate per-db
directories for clarity. Also, as directories get bigger, finding a
specific file in there becomes harder. Putting 10 databases all in the
same directory seems bad in this regard.

Huh? I wasn't arguing against making a db-specific directory below the
tablespace point. I was arguing against making *another* directory
below that one.

I was suggesting:

ln -s /var/pgsql/dbname/loc data/base/dbname/loc

I thought you were suggesting:

ln -s /var/pgsql/dbname data/base/dbname/loc

With this system:

ln -s /var/pgsql/dbname data/base/dbname/loc1
ln -s /var/pgsql/dbname data/base/dbname/loc2

go into the same directory, which makes it impossible to move loc1
easily using the file system. Seems cheap to add the extra directory.

I don't think we want to be using
symlinks for tables if we can avoid it.

Agreed, but where did that come from? None of these proposals mentioned
symlinks for anything but directories, AFAIR.

I thought you mentioned it. Sorry.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#218Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#162)
Re: Big 7.1 open items

Tom Lane writes:

I think Peter was holding out for storing purely numeric tablespace OID
and table version in pg_class and having a hardwired mapping to pathname
somewhere in smgr. However, I think that doing it that way gains only
micro-efficiency compared to passing a "name" around, while using the
name approach buys us flexibility that's needed for at least some of
the variants under discussion.

But that name can only be a dozen or so characters, contain no slash or
other funny characters, etc. That's really poor. Then the alternative is
to have an internal name and an external canonical name. Then you have two
names to worry about. Also consider that when you store both the table
space oid and the internal name in pg_class you create redundant data.
What if you rename the table space? Do you leave the internal name out of
sync? Then what good is the internal name? I'm just concerned that we are
creating at the table space level problems similar to that we're trying to
get rid of at the relation and database level.

--
Peter Eisentraut Sernanders v�g 10:115
peter_e@gmx.net 75262 Uppsala
http://yi.org/peter-e/ Sweden

#219Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Peter Eisentraut (#218)
Re: Big 7.1 open items

[ Charset ISO-8859-1 unsupported, converting... ]

Tom Lane writes:

I think Peter was holding out for storing purely numeric tablespace OID
and table version in pg_class and having a hardwired mapping to pathname
somewhere in smgr. However, I think that doing it that way gains only
micro-efficiency compared to passing a "name" around, while using the
name approach buys us flexibility that's needed for at least some of
the variants under discussion.

But that name can only be a dozen or so characters, contain no slash or
other funny characters, etc. That's really poor. Then the alternative is
to have an internal name and an external canonical name. Then you have two
names to worry about. Also consider that when you store both the table
space oid and the internal name in pg_class you create redundant data.
What if you rename the table space? Do you leave the internal name out of
sync? Then what good is the internal name? I'm just concerned that we are
creating at the table space level problems similar to that we're trying to
get rid of at the relation and database level.

Agreed. Having table spaces stored by directories named by oid just
seems very complicated for no reason.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#220Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#219)
Re: Big 7.1 open items

Bruce Momjian <pgman@candle.pha.pa.us> writes:

But that name can only be a dozen or so characters, contain no slash or
other funny characters, etc. That's really poor. Then the alternative is
to have an internal name and an external canonical name. Then you have two
names to worry about. Also consider that when you store both the table
space oid and the internal name in pg_class you create redundant data.
What if you rename the table space? Do you leave the internal name out of
sync? Then what good is the internal name? I'm just concerned that we are
creating at the table space level problems similar to that we're trying to
get rid of at the relation and database level.

Agreed. Having table spaces stored by directories named by oid just
seems very complicated for no reason.

Huh? He just gave you two very good reasons: avoid Unix-derived
limitations on the naming of tablespaces (and tables), and avoid
problems with renaming tablespaces.

I'm pretty much firmly back in the "OID and nothing but" camp.
Or perhaps I should say "OID, file version, and nothing but",
since we still need a version number to do CLUSTER etc.

regards, tom lane

#221Randall Parker
randall@nls.net
In reply to: Tom Lane (#220)
Re: Big 7.1 open items

Lamar,

See:

http://support.microsoft.com/support/kb/articles/Q205/5/24.ASP

IMO, its a bad idea to require the use of symlinks in order to be able to put different tablespaces on different drives. For a discussion on how DB2 supports tablespaces see my message entitled:
"tablespace managed by system vs managed by database"

I think one of the reasons one needs a fairly complex syntax for creating table spaces is that different devices have different hardware characteristics and one might want to tell the RDBMS to treat them differently for that
reason. You can see how DB2 allows you to do that if you read that message I posted about it.

On Wed, 21 Jun 2000 11:48:19 -0400, Lamar Owen wrote:

Show quoted text

Does Win32 do symlinks these days? I know Win32 does envvars, and Win32
is currently a supported platform.

#222Randall Parker
randall@nls.net
In reply to: Randall Parker (#221)
Re: Big 7.1 open items

Tom,

DB2 supports an ALTER TABLESPACE command that allows one to add new containers to an existing tablespace. IMO, that's far more supportive of 24x7 usage.

On Wed, 21 Jun 2000 12:10:15 -0400, Tom Lane wrote:

Show quoted text

The right way to address this problem is to invent a "move table to
new tablespace" command. This'd be pretty trivial to implement based
on a file-versioning approach: the new version of the pg_class tuple
has a new tablespace identifier in it.

#223Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Randall Parker (#222)
RE: Big 7.1 open items

If we bit the bullet and restricted ourselves to numeric filenames then
the log would need just four numeric values:
database OID
tablespace OID

Is someone going to implement it for 7.1?

relation OID
relation version number

I believe that we can avoid versions using WAL...

(this set of 4 values would also be an smgr file reference token).
16 bytes/log entry looks much better than 64.

At the moment I can recall the following opinions:

Pure OID filenames: Thomas, Tom, Marc, Peter E.

+ me.

But what about LOCATIONs? I object using environment and think that
locations
must be stored in pg_control..?

Vadim

#224Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Tom Lane (#205)
RE: Big 7.1 open items

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]

No argument from me ;-). I've been looking for compromise positions
but I still think that pure numeric filenames are the cleanest solution.

There's something else that should be taken into account: for WAL, the
log will need to record the table file that each insert/delete/update
operation affects. To do that with the smgr-token-is-a-pathname
approach I was suggesting yesterday, I think you have to record the
database name and pathname in each WAL log entry. That's 64 bytes/log
entry which is a *lot*. If we bit the bullet and restricted ourselves
to numeric filenames then the log would need just four numeric values:
database OID
tablespace OID

I strongly object to keep tablespace OID for smgr file reference token
though we have to keep it for another purpose of cource. I've mentioned
many times tablespace(where to store) info should be distinguished from
*where it is stored* info. Generally tablespace isn't sufficiently
restrictive
for this purpose. e.g. there was an idea about round-robin. e.g. Oracle's
tablespace could have pluaral files... etc.
IMHO,it is misleading to use tablespace OID as (a part of) reference token.

relation OID
relation version number
(this set of 4 values would also be an smgr file reference token).
16 bytes/log entry looks much better than 64.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#225Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Mikheev, Vadim (#223)
RE: Big 7.1 open items

-----Original Message-----
From: Mikheev, Vadim [mailto:vmikheev@SECTORBASE.COM]

If we bit the bullet and restricted ourselves to numeric filenames then
the log would need just four numeric values:
database OID
tablespace OID

Is someone going to implement it for 7.1?

relation OID
relation version number

I believe that we can avoid versions using WAL...

How to re-construct tables in place ?
Is the following right ?
1) save the content of current table to somewhere
2) shrink the table and related indexes
3) reload the saved(+some filtering) content

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#226Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Hiroshi Inoue (#225)
RE: Big 7.1 open items

relation version number

I believe that we can avoid versions using WAL...

How to re-construct tables in place ?
Is the following right ?
1) save the content of current table to somewhere
2) shrink the table and related indexes
3) reload the saved(+some filtering) content

Or - create tmp file and load with new content; log "intent to relink table
file";
relink table file; log "file is relinked".

Vadim

#227Chris Bitmead
chrisb@nimrod.itg.telstra.com.au
In reply to: Bruce Momjian (#198)
Re: Big 7.1 open items

Bruce Momjian wrote:

The symlink solution where the actual symlink location is not stored
in the database is certainly abstract. We store that info in the file
system, which is where it belongs. We only query the symlink location
when we need it for database location dumping.

how would that work? would pg_dump dump the tablespace locations or not?

#228Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Mikheev, Vadim (#226)
RE: Big 7.1 open items

-----Original Message-----
From: Mikheev, Vadim [mailto:vmikheev@SECTORBASE.COM]

relation version number

I believe that we can avoid versions using WAL...

How to re-construct tables in place ?
Is the following right ?
1) save the content of current table to somewhere
2) shrink the table and related indexes
3) reload the saved(+some filtering) content

Or - create tmp file and load with new content; log "intent to
relink table
file";
relink table file; log "file is relinked".

It seems to me that whole content of the table should be
logged before relinking or shrinking.
Is my understanding right ?

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#229Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Hiroshi Inoue (#228)
RE: Big 7.1 open items

Or - create tmp file and load with new content;
log "intent to relink table file";
relink table file; log "file is relinked".

It seems to me that whole content of the table should be
logged before relinking or shrinking.

Why not just fsync tmp files?

Vadim

#230Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Tom Lane (#205)
RE: Big 7.1 open items

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]

At the moment I can recall the following opinions:

Pure OID filenames: Thomas, Tom, Marc, Peter E.

OID+relname filenames: Bruce

Please add my opinion to the list.

Unique-id filename: Hiroshi
(Unqiue-id is irrelevant to OID/relname).

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#231Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Mikheev, Vadim (#229)
RE: Big 7.1 open items

-----Original Message-----
From: Mikheev, Vadim [mailto:vmikheev@SECTORBASE.COM]

Or - create tmp file and load with new content;
log "intent to relink table file";
relink table file; log "file is relinked".

It seems to me that whole content of the table should be
logged before relinking or shrinking.

Why not just fsync tmp files?

Probably I've misunderstood *relink*.
If *relink* different from *rename* ?

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#232Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Hiroshi Inoue (#231)
RE: Big 7.1 open items

Or - create tmp file and load with new content;
log "intent to relink table file";
relink table file; log "file is relinked".

It seems to me that whole content of the table should be
logged before relinking or shrinking.

Why not just fsync tmp files?

Probably I've misunderstood *relink*.
If *relink* different from *rename* ?

I ment something like this - link(table file, tmp2 file); fsync(tmp2 file);
unlink(table file); link(tmp file, table file); fsync(table file);
unlink(tmp file). We can do additional logging (with log flush) of these
steps
if required, postpone on-recovery redo of operations till last relink log
record/
end of log/transaction abort etc etc etc.

Vadim

#233Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Chris Bitmead (#227)
Re: Big 7.1 open items

Bruce Momjian wrote:

The symlink solution where the actual symlink location is not stored
in the database is certainly abstract. We store that info in the file
system, which is where it belongs. We only query the symlink location
when we need it for database location dumping.

how would that work? would pg_dump dump the tablespace locations or not?

pg_dump would recreate a CREATE TABLESPACE command:

printf("CREATE TABLESPACE %s USING %s", loc, symloc);

where symloc would be SELECT symloc(loc) and return the value into a
variable that is used by pg_dump. The backend would do the lstat() and
return the value to the client.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#234Tom Lane
tgl@sss.pgh.pa.us
In reply to: Randall Parker (#222)
Re: Big 7.1 open items

"Randall Parker" <randall@nls.net> writes:

DB2 supports an ALTER TABLESPACE command that allows one to add new
containers to an existing tablespace. IMO, that's far more supportive
of 24x7 usage.

Er, what do they mean by "container", and why is it better?

regards, tom lane

#235Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Mikheev, Vadim (#232)
RE: Big 7.1 open items

-----Original Message-----
From: Mikheev, Vadim [mailto:vmikheev@SECTORBASE.COM]

Or - create tmp file and load with new content;
log "intent to relink table file";
relink table file; log "file is relinked".

It seems to me that whole content of the table should be
logged before relinking or shrinking.

Why not just fsync tmp files?

Probably I've misunderstood *relink*.
If *relink* different from *rename* ?

I ment something like this - link(table file, tmp2 file);
fsync(tmp2 file);
unlink(table file); link(tmp file, table file); fsync(table file);
unlink(tmp file).

I see,old file would be rolled back from tmp2 file on abort.
This would work on most platforms.
But cygwin port has a flaw that files could not be unlinked
if they are open. So *relink* may fail in some cases(including
rollback cases).

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#236Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mikheev, Vadim (#223)
Re: Big 7.1 open items

"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:

relation OID
relation version number

I believe that we can avoid versions using WAL...

I don't think so. You're basically saying that
1. create file 'new'
2. delete file 'old'
3. rename 'new' to 'old'
is safe as long as you have a redo log to ensure that the rename
happens even if you crash between steps 2 and 3. But crash is not
the only hazard. What if step 3 just plain fails? Redo won't help.

I'm having a hard time inventing really plausible examples, but a
slightly implausible example is that someone chmod's the containing
directory -w between steps 2 and 3. (Maybe it's not so implausible
if you assume a crash after step 2 ... someone might have left the
directory nonwritable while restoring the system.)

If we use file version numbers, then the *only* thing needed to
make a valid transition between one set of files and another is
a commit of the update of pg_class that shows the new version number
in the rel's pg_class tuple. The worst that can happen to you in
a crash or other failure is that you are unable to get rid of the
set of files that you don't want anymore. That might waste disk
space but it doesn't leave the database corrupted.

But what about LOCATIONs? I object using environment and think that
locations must be stored in pg_control..?

I don't like environment variables for this either; it's just way too
easy to start the postmaster with wrong environment. It still seems
to me that relying on subdirectory symlinks is a good way to go.
pg_control is not so good --- if it gets corrupted, how do you recover?
symlinks can be recreated by hand if necessary, but...

regards, tom lane

#237Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hiroshi Inoue (#230)
Re: Big 7.1 open items

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

Please add my opinion to the list.
Unique-id filename: Hiroshi
(Unqiue-id is irrelevant to OID/relname).

"Unique ID" is more or less equivalent to "OID + version number",
right?

I was trying earlier to convince myself that a single unique-ID value
would be better than OID+version for the smgr interface, because it'd
certainly be easier to pass around. I failed to convince myself though,
and the thing that bothered me was this. Suppose you are trying to
recover a corrupted database manually, and the only information you have
about which table is which is a somewhat out-of-date listing of OIDs
versus table names. (Maybe it's out of date because you got it from
your last backup tape.) If the files are named OID+version you're not
going to have much trouble seeing which is which, even if some of the
versions are higher than what was on the tape. But if version-updated
tables are given entirely new unique IDs, you've got no hope at all of
telling which one corresponds to what you had in the listing. Maybe
you can tell by looking through the physical file contents, but
certainly this way is more fragile from the point of view of data
recovery.

regards, tom lane

#238Chris Bitmead
chrisb@nimrod.itg.telstra.com.au
In reply to: Bruce Momjian (#233)
Re: Big 7.1 open items

Bruce Momjian wrote:

Bruce Momjian wrote:

The symlink solution where the actual symlink location is not stored
in the database is certainly abstract. We store that info in the file
system, which is where it belongs. We only query the symlink location
when we need it for database location dumping.

how would that work? would pg_dump dump the tablespace locations or not?

pg_dump would recreate a CREATE TABLESPACE command:

printf("CREATE TABLESPACE %s USING %s", loc, symloc);

where symloc would be SELECT symloc(loc) and return the value into a
variable that is used by pg_dump. The backend would do the lstat() and
return the value to the client.

I'm wondering if pg_dump should store the location of the tablespace. If
your machine dies, you get a new machine to re-create the database, you
may not want the tablespace in the same spot. And text-editing a
gigabyte file would be extremely painful.

#239Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Chris Bitmead (#238)
Re: Big 7.1 open items

where symloc would be SELECT symloc(loc) and return the value into a
variable that is used by pg_dump. The backend would do the lstat() and
return the value to the client.

I'm wondering if pg_dump should store the location of the tablespace. If
your machine dies, you get a new machine to re-create the database, you
may not want the tablespace in the same spot. And text-editing a
gigabyte file would be extremely painful.

If the symlink create fails in CREATE TABLESPACE, it just creates an
ordinary directory.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#240Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hiroshi Inoue (#224)
Re: Big 7.1 open items

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

I strongly object to keep tablespace OID for smgr file reference token
though we have to keep it for another purpose of cource. I've mentioned
many times tablespace(where to store) info should be distinguished from
*where it is stored* info.

Sure. But this proposal assumes that we're relying on symlinks to
carry the information about physical locations corresponding to
tablespace OIDs. The backend just needs to know enough to access a
relation file at a relative pathname like
tablespaceOID/relationOID
(ignoring version and segment numbers for now). Under the hood,
a symlink for tablespaceOID gets the work done.

Certainly this is not a perfect mechanism. But it is simple, it
is reliable, it is portable to most of the platforms we care about
(yeah, I know we have a Win port, but you wouldn't ever recommend
someone to run a *serious* database on it would you?), and in general
I think the bang-for-the-buck ratio is enormous. I do not want to
have to deal with explicit tablespace bookkeeping in the backend,
but that seems like what we'd have to do in order to improve on
symlinks.

regards, tom lane

#241Randall Parker
randall@nls.net
In reply to: Tom Lane (#240)
Re: Big 7.1 open items

Tom,

A "container" can be a file or a device or a directory. Here again are examples that I already posted in
another thread:

In the first example there are 3 devices specified as containers. In the second example 3 directories are
specified as containers (DB2 therefore makes its own file names in it - and uses OIDs to do it I think). In
the third example 2 files are the 2 containers. In the fourth example 6 devices on 3 nodes are the
containers.

CREATE TABLESPACE PAYROLL
MANAGED BY DATABASE
USING (DEVICE'/dev/rhdisk6' 10000,
DEVICE '/dev/rhdisk7' 10000,
DEVICE '/dev/rhdisk8' 10000)
OVERHEAD 24.1
TRANSFERRATE 0.9

CREATE TABLESPACE ACCOUNTING
MANAGED BY SYSTEM
USING ('d:\acc_tbsp', 'e:\acc_tbsp', 'f:\acc_tbsp')
EXTENTSIZE 64
PREFETCHSIZE 32

CREATE TEMPORARY TABLESPACE TEMPSPACE2
MANAGED BY DATABASE
USING (FILE '/tmp/tempspace2.f1' 50000,
FILE '/tmp/tempspace2.f2' 50000)
EXTENTSIZE 256

CREATE TABLESPACE PLANS
MANAGED BY DATABASE
USING (DEVICE '/dev/rhdisk0' 10000, DEVICE '/dev/rn1hd01' 40000) ON NODE 1
USING (DEVICE '/dev/rhdisk0' 10000, DEVICE '/dev/rn3hd03' 40000) ON NODE 3
USING (DEVICE '/dev/rhdisk0' 10000, DEVICE '/dev/rn5hd05' 40000) ON NODE 5

On Wed, 21 Jun 2000 23:03:03 -0400, Tom Lane wrote:

Show quoted text

"Randall Parker" <randall@nls.net> writes:

DB2 supports an ALTER TABLESPACE command that allows one to add new
containers to an existing tablespace. IMO, that's far more supportive
of 24x7 usage.

Er, what do they mean by "container", and why is it better?

regards, tom lane

#242Don Baccus
dhogaza@pacifier.com
In reply to: Chris Bitmead (#238)
Re: Big 7.1 open items

At 01:43 PM 6/22/00 +1000, Chris Bitmead wrote:

I'm wondering if pg_dump should store the location of the tablespace. If
your machine dies, you get a new machine to re-create the database, you
may not want the tablespace in the same spot. And text-editing a
gigabyte file would be extremely painful.

So you don't dump your create tablespace statements, recognizing that on
a new machine (due to upgrades or crashing) you might assign them to
different directories/mount points/whatever. That's the reason for
wanting to hide physical allocation in tablespaces ... the rest of
your datamodel doesn't need to know.

Or you do dump your tablespaces, and knowing the paths assigned
to various ones set up your new machine accordingly.

- Don Baccus, Portland OR <dhogaza@pacifier.com>
Nature photos, on-line guides, Pacific Northwest
Rare Bird Alert Service and other goodies at
http://donb.photo.net.

#243Don Baccus
dhogaza@pacifier.com
In reply to: Bruce Momjian (#239)
Re: Big 7.1 open items

At 12:03 AM 6/22/00 -0400, Bruce Momjian wrote:

If the symlink create fails in CREATE TABLESPACE, it just creates an
ordinary directory.

Silent surprises - the earmark of truly professional software ...

- Don Baccus, Portland OR <dhogaza@pacifier.com>
Nature photos, on-line guides, Pacific Northwest
Rare Bird Alert Service and other goodies at
http://donb.photo.net.

#244Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Tom Lane (#240)
RE: Big 7.1 open items

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

I strongly object to keep tablespace OID for smgr file reference token
though we have to keep it for another purpose of cource. I've mentioned
many times tablespace(where to store) info should be distinguished from
*where it is stored* info.

Sure. But this proposal assumes that we're relying on symlinks to
carry the information about physical locations corresponding to
tablespace OIDs. The backend just needs to know enough to access a
relation file at a relative pathname like
tablespaceOID/relationOID
(ignoring version and segment numbers for now). Under the hood,
a symlink for tablespaceOID gets the work done.

I think tablespaceOID is an easy substitution for the purpose.
I don't like to depend on poor directory tree structure in dbms
either..

Certainly this is not a perfect mechanism. But it is simple, it
is reliable, it is portable to most of the platforms we care about
(yeah, I know we have a Win port, but you wouldn't ever recommend
someone to run a *serious* database on it would you?), and in general
I think the bang-for-the-buck ratio is enormous. I do not want to
have to deal with explicit tablespace bookkeeping in the backend,
but that seems like what we'd have to do in order to improve on
symlinks.

I've already mentioned about it 10 times or so but unfortunately
I see no one on my side yet.
OK,I've given up the discussion about it. I don't want to waste
my time any more.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#245Philip J. Warner
pjw@rhyme.com.au
In reply to: Tom Lane (#237)
Re: Big 7.1 open items

At 23:27 21/06/00 -0400, Tom Lane wrote:

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

Please add my opinion to the list.
Unique-id filename: Hiroshi
(Unqiue-id is irrelevant to OID/relname).

I was trying earlier to convince myself that a single unique-ID value
would be better than OID+version for the smgr interface, because it'd
certainly be easier to pass around. I failed to convince myself though,
and the thing that bothered me was this. Suppose you are trying to
recover a corrupted database manually, and the only information you have
about which table is which is a somewhat out-of-date listing of OIDs
versus table names.

This worries me a little; in the Dec/RDB world it is a very long time since
database backups were done by copying the files. There is a database
backup/restore utility which runs while the database is on-line and makes
sure a valid snapshot is taken. Backing up storage areas (table spapces)
can be done separately by the same utility, and again, it records enough
information to ensure integrity. Maybe the thing to do is write a pg_backup
utility, which in a first pass could, presumably, be synonymous with pg_dump?

Am I missing something here? Is there a problem with backing up using
'pg_dump | gzip'?

(Maybe it's out of date because you got it from
your last backup tape.) If the files are named OID+version you're not
going to have much trouble seeing which is which, even if some of the
versions are higher than what was on the tape.

Unfortunately here you hit severe RI problems, unless you use a 'proper'
database backup.

But if version-updated
tables are given entirely new unique IDs, you've got no hope at all of
telling which one corresponds to what you had in the listing. Maybe
you can tell by looking through the physical file contents, but
certainly this way is more fragile from the point of view of data
recovery.

In the Dec/RDB world, one has to very occasionally restore from files (this
only happens if multiple prior database backups and after-image journals
are corrupt). In this case, there is a utility for examining and changing
storage area file information. This is probably way over the top for
PostgreSQL.

[Aside: FWIW, the Dec/RDB storage area files are named by DBAs to be
something meaningful to the DBA (eg. EMPLOYEE_ACHIVE), and can contain one
of more tables etc. The files are never renamed or moved by the database
without an instruction from the DBA. The 'storage manager' manages the
datafiles internally. Usually, tables are allocated in chunks of multiples
of some file-based buffer size, and the file grows as needed. This allows
for disk read-ahead to be useful, while storing multiple tables in one
file. As stated in a previous message, tables can also be split across
storage areas]

Once again, I hope I have not missed a fundamental fact...

----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.C.N. 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/

#246Philip J. Warner
pjw@rhyme.com.au
In reply to: Chris Bitmead (#238)
Re: Big 7.1 open items

At 13:43 22/06/00 +1000, Chris Bitmead wrote:

Bruce Momjian wrote:

I'm wondering if pg_dump should store the location of the tablespace. If
your machine dies, you get a new machine to re-create the database, you
may not want the tablespace in the same spot. And text-editing a
gigabyte file would be extremely painful.

This is a very good point; the way Dec/RDB gets around it is to allow the
'pg_restore' command to override storage settings when restoring a backup
file.

----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.C.N. 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/

#247Tom Lane
tgl@sss.pgh.pa.us
In reply to: Chris Bitmead (#238)
Re: Big 7.1 open items

Chris Bitmead <chrisb@nimrod.itg.telstra.com.au> writes:

I'm wondering if pg_dump should store the location of the tablespace. If
your machine dies, you get a new machine to re-create the database, you
may not want the tablespace in the same spot. And text-editing a
gigabyte file would be extremely painful.

Might make sense to store the tablespace setup separately from the bulk
of the data, but certainly you want some way to dump that info in a
restorable form.

I've been thinking lately that the pg_dump shove-it-all-in-one-file
approach doesn't scale anyway. We ought to start thinking about ways
to make the standard dump method store schema separately from bulk
data, for example. That's offtopic for this thread but ought to be
on the TODO list someplace...

regards, tom lane

#248Tom Lane
tgl@sss.pgh.pa.us
In reply to: Philip J. Warner (#245)
Re: Big 7.1 open items

"Philip J. Warner" <pjw@rhyme.com.au> writes:

... the thing that bothered me was this. Suppose you are trying to
recover a corrupted database manually, and the only information you have
about which table is which is a somewhat out-of-date listing of OIDs
versus table names.

This worries me a little; in the Dec/RDB world it is a very long time since
database backups were done by copying the files. There is a database
backup/restore utility which runs while the database is on-line and makes
sure a valid snapshot is taken. Backing up storage areas (table spapces)
can be done separately by the same utility, and again, it records enough
information to ensure integrity. Maybe the thing to do is write a pg_backup
utility, which in a first pass could, presumably, be synonymous with pg_dump?

pg_dump already does the consistent-snapshot trick (it just has to run
inside a single transaction).

Am I missing something here? Is there a problem with backing up using
'pg_dump | gzip'?

None, as long as your ambition extends no further than restoring your
data to where it was at your last pg_dump. I was thinking about the
all-too-common-in-the-real-world scenario where you're hoping to recover
some data more recent than your last backup from the fractured shards
of your database...

regards, tom lane

#249Philip J. Warner
pjw@rhyme.com.au
In reply to: Tom Lane (#248)
Re: Big 7.1 open items

At 03:17 22/06/00 -0400, Tom Lane wrote:

This worries me a little; in the Dec/RDB world it is a very long time since
database backups were done by copying the files. There is a database
backup/restore utility which runs while the database is on-line and makes
sure a valid snapshot is taken. Backing up storage areas (table spapces)
can be done separately by the same utility, and again, it records enough
information to ensure integrity. Maybe the thing to do is write a pg_backup
utility, which in a first pass could, presumably, be synonymous with

pg_dump?

pg_dump already does the consistent-snapshot trick (it just has to run
inside a single transaction).

Am I missing something here? Is there a problem with backing up using
'pg_dump | gzip'?

None, as long as your ambition extends no further than restoring your
data to where it was at your last pg_dump. I was thinking about the
all-too-common-in-the-real-world scenario where you're hoping to recover
some data more recent than your last backup from the fractured shards
of your database...

pg_dump is a good basis for any pg_backup utility; perhaps as you indicated
elsewhere, more carefull formatting of the dump files would make
table-based restoration possible. In another response, I also suggested
allowing overrides of placement information in a restore operation- the
simplest approach would be an 'ignore-storage-parameters' flag. Does this
sound reasonable? If so, then discussion of file-id based on OID needs not
be too concerned about how db restoration is done.

----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.C.N. 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/

#250Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Philip J. Warner (#249)
RE: Big 7.1 open items

-----Original Message-----
From: Peter Eisentraut [mailto:e99re41@DoCS.UU.SE]

My opinion
3) database and tablespace are relatively irrelevant.
I assume PostgreSQL's database would correspond
to the concept of SCHEMA.

A database corresponds to a catalog and a schema corresponds to nothing
yet.

Oh I see your point. However I've thought that current PostgreSQL's
database is an imcomplete SCHEMA and still feel so in reality.
Catalog per database has been nothing but needless for me from
the first.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#251Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Tom Lane (#237)
RE: Big 7.1 open items

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

Please add my opinion to the list.
Unique-id filename: Hiroshi
(Unqiue-id is irrelevant to OID/relname).

"Unique ID" is more or less equivalent to "OID + version number",
right?

Hmm,no one seems to be on my side at this point also.
OK,I change my mind as follows.

OID except cygwin,unique-id on cygwin

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#252Giles Lean
giles@nemeton.com.au
In reply to: Mikheev, Vadim (#232)
Re: Big 7.1 open items

I ment something like this - link(table file, tmp2 file); fsync(tmp2 file);
unlink(table file); link(tmp file, table file); fsync(table file);
unlink(tmp file).

I don't see the purpose of the fsync() calls here: link() and unlink()
effect file system metadata, which with normal Unix (but not Linux)
filesystem semantics is written synchronously.

fsync() on a file forces outstanding data to disk; it doesn't effect
the preceding or subsequent link() and unlink() calls unless
McKusick's soft updates are in use.

If the intent is to make sure the files are in particular states
before each of the link() and unlink() calls (i.e. soft updates or
similar functionality are in use) then more than fsync() is required,
since the files can still be updated after the fsync() and before
link() or unlink().

On Linux I understand that a fsync() on a directory will force
metadata updates to that directory to be committed, but that doesn't
seem to be what this code is trying to do either?

Regards,

Giles

#253Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Don Baccus (#242)
Re: Big 7.1 open items

At 01:43 PM 6/22/00 +1000, Chris Bitmead wrote:

I'm wondering if pg_dump should store the location of the tablespace. If
your machine dies, you get a new machine to re-create the database, you
may not want the tablespace in the same spot. And text-editing a
gigabyte file would be extremely painful.

So you don't dump your create tablespace statements, recognizing that on
a new machine (due to upgrades or crashing) you might assign them to
different directories/mount points/whatever. That's the reason for
wanting to hide physical allocation in tablespaces ... the rest of
your datamodel doesn't need to know.

Or you do dump your tablespaces, and knowing the paths assigned
to various ones set up your new machine accordingly.

I imagine we will have a -l flag to pg_dump to dump tablespace
locations. If they exist on the new machine, we use them. If not, we
create just directories with no symlinks.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#254Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hiroshi Inoue (#250)
Re: Big 7.1 open items

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

A database corresponds to a catalog and a schema corresponds to nothing
yet.

Oh I see your point. However I've thought that current PostgreSQL's
database is an imcomplete SCHEMA and still feel so in reality.
Catalog per database has been nothing but needless for me from
the first.

It may be needless for you, but not for everybody ;-).

In my mind the point of the "database" concept is to provide a domain
within which custom datatypes and functions are available. Schemas
will control the visibility of tables, but SQL92 hasn't thought about
controlling visibility of datatypes or functions. So I think we will
still want "database" = "span of applicability of system catalogs"
and multiple databases allowed per installation, even though there may
be schemas subdividing the database(s).

regards, tom lane

#255Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hiroshi Inoue (#251)
Re: Big 7.1 open items

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

OK,I change my mind as follows.
OID except cygwin,unique-id on cygwin

We don't really want to do that, do we? That's a huge difference in
behavior to have in just one port --- especially a port that none of
the primary developers use (AFAIK anyway). The cygwin port's normal
state of existence will be "broken", surely, if we go that way.

Besides which, OID alone doesn't give us a possibility of file
versioning, and as I commented to Vadim I think we will want that,
WAL or no WAL. So it seems to me the two viable choices are
unique-id or OID+version-number. Either way, the file-naming behavior
should be the same across all platforms.

regards, tom lane

#256Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Tom Lane (#255)
RE: Big 7.1 open items

I believe that we can avoid versions using WAL...

I don't think so. You're basically saying that
1. create file 'new'
2. delete file 'old'
3. rename 'new' to 'old'
is safe as long as you have a redo log to ensure that the rename
happens even if you crash between steps 2 and 3. But crash is not
the only hazard. What if step 3 just plain fails? Redo won't help.

Ok, ok. Let's use *unique* file name for each table version.
But after thinking, seems that I agreed with Hiroshi about using
*some unique id* for file names instead of oid+version: we could use
just DB' OID + this unique ID in log records to find table file - just
8 bytes.

So, add me to Hiroshi' camp... if Hiroshi is ready to implement new file
naming -:)

But what about LOCATIONs? I object using environment and think that
locations must be stored in pg_control..?

I don't like environment variables for this either; it's just way too
easy to start the postmaster with wrong environment. It still seems
to me that relying on subdirectory symlinks is a good way to go.

I always thought so.

pg_control is not so good --- if it gets corrupted, how do
you recover?

Impossible to recover anyway - pg_control keeps last checkpoint pointer,
required for recovery. That's why Oracle recommends (requires?) at least
two copies of control file (and log too).
But what if log gets corrupted? Or file system (lost symlinks etc)?
One will have to use backup...

Vadim

#257Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Philip J. Warner (#249)
Re: Big 7.1 open items

pg_dump is a good basis for any pg_backup utility; perhaps as you indicated
elsewhere, more carefull formatting of the dump files would make
table-based restoration possible. In another response, I also suggested
allowing overrides of placement information in a restore operation- the
simplest approach would be an 'ignore-storage-parameters' flag. Does this
sound reasonable? If so, then discussion of file-id based on OID needs not
be too concerned about how db restoration is done.

My idea was to make dumping of tablespace locations/symlinks optional.
By trying to control it on the load end, you have to basically have some
way of telling the backend to ignore the symlinks on load. Right now,
pg_dump just creates SQL commands and COPY commands.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#258The Hermit Hacker
scrappy@hub.org
In reply to: Don Baccus (#242)
Re: Big 7.1 open items

On Wed, 21 Jun 2000, Don Baccus wrote:

At 01:43 PM 6/22/00 +1000, Chris Bitmead wrote:

I'm wondering if pg_dump should store the location of the tablespace. If
your machine dies, you get a new machine to re-create the database, you
may not want the tablespace in the same spot. And text-editing a
gigabyte file would be extremely painful.

So you don't dump your create tablespace statements, recognizing that on
a new machine (due to upgrades or crashing) you might assign them to
different directories/mount points/whatever. That's the reason for
wanting to hide physical allocation in tablespaces ... the rest of
your datamodel doesn't need to know.

Or you do dump your tablespaces, and knowing the paths assigned
to various ones set up your new machine accordingly.

Or, modify pg_dump so that it auto-dumps to two files, one for schema, one
for data. then its easier to modify the schema on a large database if
tablespaces change ...

#259Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#254)
Re: Big 7.1 open items

Tom Lane writes:

In my mind the point of the "database" concept is to provide a domain
within which custom datatypes and functions are available.

Quoth SQL99:

"A user-defined type is a schema object"

"An SQL-invoked routine is an element of an SQL-schema"

I have yet to see anything in SQL that's a per-catalog object. Some things
are global, like users, but everything else is per-schema.

The way I see it is that schemas are required to be a logical hierarchy,
whereas implementations may see catalogs as a physical division (as indeed
this implementation does).

So I think we will still want "database" = "span of applicability of
system catalogs"

Yes, because the system catalogs would live in a schema of their own.

--
Peter Eisentraut Sernanders v�g 10:115
peter_e@gmx.net 75262 Uppsala
http://yi.org/peter-e/ Sweden

#260Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Peter Eisentraut (#259)
RE: Big 7.1 open items

-----Original Message-----
From: Peter Eisentraut

Tom Lane writes:

In my mind the point of the "database" concept is to provide a domain
within which custom datatypes and functions are available.

AFAIK few users understand it and many users have wondered
why we couldn't issue cross "database" queries.

Quoth SQL99:

"A user-defined type is a schema object"

"An SQL-invoked routine is an element of an SQL-schema"

I have yet to see anything in SQL that's a per-catalog object. Some things
are global, like users, but everything else is per-schema.

So why is system catalog needed per "database" ?

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#261Chris Bitmead
chrisb@nimrod.itg.telstra.com.au
In reply to: The Hermit Hacker (#258)
Re: Big 7.1 open items

The Hermit Hacker wrote:

Or, modify pg_dump so that it auto-dumps to two files, one for schema, one
for data. then its easier to modify the schema on a large database if
tablespaces change ...

That's a pretty good idea as an option. But I'd say keep the schema
separate from the tablespace locations. And if you're going down that
path why not create a directory automatically and dump each table into a
separate file. On occasion I've had to restore one table by hand-editing
the pg_dump, and that's a real pain.

#262Philip Warner
pjw@rhyme.com.au
In reply to: Chris Bitmead (#261)
Re: Big 7.1 open items

At 09:55 23/06/00 +1000, Chris Bitmead wrote:

The Hermit Hacker wrote:

Or, modify pg_dump so that it auto-dumps to two files, one for schema, one
for data. then its easier to modify the schema on a large database if
tablespaces change ...

That's a pretty good idea as an option. But I'd say keep the schema
separate from the tablespace locations. And if you're going down that
path why not create a directory automatically and dump each table into a
separate file. On occasion I've had to restore one table by hand-editing
the pg_dump, and that's a real pain.

Have a look at my message entitled:

Proposal: More flexible backup/restore via pg_dump

It's supposed to address these issues.

----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.C.N. 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/

#263Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Mikheev, Vadim (#256)
RE: Big 7.1 open items

-----Original Message-----
From: Mikheev, Vadim [mailto:vmikheev@SECTORBASE.COM]

I believe that we can avoid versions using WAL...

I don't think so. You're basically saying that
1. create file 'new'
2. delete file 'old'
3. rename 'new' to 'old'
is safe as long as you have a redo log to ensure that the rename
happens even if you crash between steps 2 and 3. But crash is not
the only hazard. What if step 3 just plain fails? Redo won't help.

Ok, ok. Let's use *unique* file name for each table version.
But after thinking, seems that I agreed with Hiroshi about using
*some unique id* for file names instead of oid+version: we could use
just DB' OID + this unique ID in log records to find table file - just
8 bytes.

So, add me to Hiroshi' camp... if Hiroshi is ready to implement new file
naming -:)

I've thought e.g. newfileid() like newoid() using pg_variable.
Other smarter ways ?

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#264Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Hiroshi Inoue (#263)
Re: Big 7.1 open items

[ Charset ISO-8859-1 unsupported, converting... ]

Bruce Momjian writes:

Here is the list I have gotten of open 7.1 items:

new location for config files

I'm on that task now, more or less by accident but I might as well get it
done. I'm reorganizing all the file name handling code for pg_hba.conf,
pg_indent.conf, pg_control, etc. so they have consistent accessor
routines. The DataDir global variable will disappear, you'll have to use
GetDataDir().

Can we get agreement to remove our secondary password files, and make
something that makes more sense?

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#265Peter Eisentraut
peter_e@gmx.net
In reply to: Bruce Momjian (#1)
Re: Big 7.1 open items

Bruce Momjian writes:

Here is the list I have gotten of open 7.1 items:

new location for config files

I'm on that task now, more or less by accident but I might as well get it
done. I'm reorganizing all the file name handling code for pg_hba.conf,
pg_indent.conf, pg_control, etc. so they have consistent accessor
routines. The DataDir global variable will disappear, you'll have to use
GetDataDir().

--
Peter Eisentraut Sernanders v�g 10:115
peter_e@gmx.net 75262 Uppsala
http://yi.org/peter-e/ Sweden

#266Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Peter Eisentraut (#265)
Re: Big 7.1 open items

[ Charset ISO-8859-1 unsupported, converting... ]

Bruce Momjian writes:

Can we get agreement to remove our secondary password files, and make
something that makes more sense?

How about this: Normally secondary password files look like

username:ABS5SGh1EL6bk

We could add the option of making them look like

username:+

which means "look into pg_shadow". That would be fully backward
compatible, allows the use of alter user with password, and avoids
creating any extra system tables (that would need to be dumped to plain
text). And the coding looks very simple.

Yes, perfect. In fact, how about:

username

as doing that. Any username with no colon uses pg_shadow.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#267Peter Eisentraut
peter_e@gmx.net
In reply to: Bruce Momjian (#264)
Re: Big 7.1 open items

Bruce Momjian writes:

Can we get agreement to remove our secondary password files, and make
something that makes more sense?

How about this: Normally secondary password files look like

username:ABS5SGh1EL6bk

We could add the option of making them look like

username:+

which means "look into pg_shadow". That would be fully backward
compatible, allows the use of alter user with password, and avoids
creating any extra system tables (that would need to be dumped to plain
text). And the coding looks very simple.

--
Peter Eisentraut Sernanders v�g 10:115
peter_e@gmx.net 75262 Uppsala
http://yi.org/peter-e/ Sweden

#268Thomas Lockhart
lockhart@alumni.caltech.edu
In reply to: Thomas Lockhart (#94)
Re: SQL_TEXT (Re: Re: Big 7.1 open items)

Date says that SQL_TEXT is required to have two things:
1) all characters used in the SQL language itself (which is what I
recalled)
2) Every other character from every character set in the
installation.

Doesn't it say "charcter repertory", rather than character set? I
think it would be possible to let our SQL_TEXT support every character
repertories in the world, if we use Unicode or Mule internal code for
that.

I think that "character set" and "character repertoire" are synonymous
(at least I am interpreting them that way). SQL99 makes a slight
distinction, in that "repertoire" is a "set" in a specific context of
application.

I'm starting to look at the SQL99 doc. I am going to try to read the doc
as if SQL_TEXT is a placeholder for "any allowed character set", not
"all character sets simultaneously" and see if that works.

Since there are a wide range of encodings to choose from, and since most
character sets can not be translated to another random character set,
having SQL_TEXT usefully require all sets present simultaneously seems a
bit of a stretch.

I'm also not going to try to understand the complete doc before having a
trial solution; we can extend/modify/redefine/throw away the trial
solution as we understand the spec better.

While I'm thinking about it: afaict, if we have the ability to load
multiple character sets simultaneously, we will want to have *one* of
those mapped in as the "default character set" for an installation or
database. So we might want to statically link that one in, while the
others get loaded dynamically.

- Thomas

#269Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Thomas Lockhart (#268)
Re: SQL_TEXT (Re: Re: Big 7.1 open items)

I think that "character set" and "character repertoire" are synonymous
(at least I am interpreting them that way).

Is it? I think that a "character set" consists of a "character
repertoire" and a "form of use".

SQL99 makes a slight
distinction, in that "repertoire" is a "set" in a specific context of
application.

I don't understand this probably due to my English ability. Can you
tell me where I can get SQL99 on line doc so that I could study it
more?

While I'm thinking about it: afaict, if we have the ability to load
multiple character sets simultaneously, we will want to have *one* of
those mapped in as the "default character set" for an installation or
database. So we might want to statically link that one in, while the
others get loaded dynamically.

Right. While I am not sure we could statically link it, there could be
"default character set" in a installation or database. Also we could
have another "default character set" for NATIONAL CHARACTER. It seems
that those "default character set" are actually same acording to the
standard?
--
Tatsuo Ishii

#270Thomas Lockhart
lockhart@alumni.caltech.edu
In reply to: Thomas Lockhart (#166)
Re: SQL_TEXT (Re: Re: Big 7.1 open items)

I think that a "character set" consists of a "character
repertoire" and a "form of use".

SQL99 makes a slight distinction, in that "repertoire" is a "set" in
a specific context of application.

I don't understand this probably due to my English ability.

I'm pretty sure that it is due to convoluted standards ;)

Can you tell me where I can get SQL99 on line doc so that I could
study it more?

From Peter E:

ftp://jerry.ece.umassd.edu/isowg3/x3h2/Standards/ansi-iso-9075-[12345]-1999.txt

(a set of 5 files) which are also available in PDF at the same site.

- Thomas

#271Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Peter Eisentraut (#267)
AW: Big 7.1 open items

In my mind the point of the "database" concept is to provide a domain
within which custom datatypes and functions are available. Schemas
will control the visibility of tables, but SQL92 hasn't thought about
controlling visibility of datatypes or functions. So I think we will
still want "database" = "span of applicability of system catalogs"
and multiple databases allowed per installation, even though there may
be schemas subdividing the database(s).

Yes, and people wanting only one database like in Oracle will simply only
create one database. The only issue I can think of is that they can have
some "default database" other than the current dbname=username, so
they don't need to worry about it.

Andreas

#272Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Zeugswetter Andreas SB (#271)
AW: Big 7.1 open items

In my mind the point of the "database" concept is to

provide a domain

within which custom datatypes and functions are available.

AFAIK few users understand it and many users have wondered
why we couldn't issue cross "database" queries.

Imho the same issue is access to tables on another machine.
If we "fix" that, access to another db on the same instance is just
a variant of the above.

Quoth SQL99:

"A user-defined type is a schema object"

"An SQL-invoked routine is an element of an SQL-schema"

I have yet to see anything in SQL that's a per-catalog

object. Some things

are global, like users, but everything else is per-schema.

Yes.

So why is system catalog needed per "database" ?

I like to use different databases on a development machine,
because it makes testing easier. The only thing that
needs to be changed is the connect statement. All other statements
including schema qualified tablenames stay exactly the same for
each developer even though each has his own database,
and his own version of functions.
I have yet to see an installation that does'nt have at least one program
that needs access to more than one schema.

On production machines we (using Informix) use different databases
for different products, because it reduces the possibility of accessing
the wrong tables, since the syntax for accessing tables in other db's
is different (dbname[@instancename]:"owner".tabname in Informix)
The schema does not help us, since most of our programs access
tables from more than one schema.

And again someone wanting Oracle'ish behavior will only create one
database per instance.

Andreas

#273Hiroshi Inoue
Inoue@seiren.co.jp
In reply to: Zeugswetter Andreas SB (#272)
RE: Big 7.1 open items

-----Original Message-----
From: Zeugswetter Andreas SB

In my mind the point of the "database" concept is to

provide a domain

within which custom datatypes and functions are available.

AFAIK few users understand it and many users have wondered
why we couldn't issue cross "database" queries.

Imho the same issue is access to tables on another machine.
If we "fix" that, access to another db on the same instance is just
a variant of the above.

What is a difference between SCHAMA and your "database" ?
I myself am confused about them.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#274Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Hiroshi Inoue (#273)
AW: Big 7.1 open items

Hiroshi Inoue [mailto:Inoue@seiren.co.jp] wrote:

In my mind the point of the "database" concept is to

provide a domain

within which custom datatypes and functions are available.

AFAIK few users understand it and many users have wondered
why we couldn't issue cross "database" queries.

Imho the same issue is access to tables on another machine.
If we "fix" that, access to another db on the same instance is just
a variant of the above.

What is a difference between SCHAMA and your "database" ?
I myself am confused about them.

Think of it as a hierarchy:
instance -> database -> schema -> object

- "instance" corresponds to one postmaster
- "database" as in current implementation
- "schema" name corresponds to the owner of the object,
only that a corresponding db or os user does not need to exist in
some of the implementations I know.
- "object" is one of table, index, function ...

The database is what you connect to in your connect statement,
you then see all schemas inside this database only. Access to another
database would need an explicitly created synonym or different syntax.
The default "schema" name is usually the logged in user name
(although I don't like this approach, I like Informix's approach where
the schema need not be specified if tabname is unique (and tabname
is unique per db unless you specify database mode ansi)).
All other schemas have to be explicitly named ("schemaname".tabname).

Oracle has exactly this layout, only you are restricted to one database
per instance.
(They even have a "create database .." statement, although it is somehow
analogous to our initdb).

Andreas

#275Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Zeugswetter Andreas SB (#274)
AW: Big 7.1 open items

Vadim wrote:

Impossible to recover anyway - pg_control keeps last
checkpoint pointer, required for recovery.

Why not put this info in the tx log itself.

That's why Oracle recommends (requires?) at least
two copies of control file ....

This is one of the most stupid design issues Oracle has.
I suggest you look at the tx log design of Informix.
(No Informix dba fears to pull the power cord on his servers,
ask the same of an Oracle dba, they even fear
"shutdown immediate" on a heavily used db)

Andreas

#276Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Zeugswetter Andreas SB (#275)
AW: Big 7.1 open items

I wrote:

Vadim wrote:

Impossible to recover anyway - pg_control keeps last
checkpoint pointer, required for recovery.

Why not put this info in the tx log itself.

That's why Oracle recommends (requires?) at least
two copies of control file ....

This is one of the most stupid design issues Oracle has.

The problem is, that if you want to switch to a no fsync environment,
(here I also mean the tx log)
but the possibility of losing a write is still there, you cannot sync
writes to two or more different files. Only one file, the tx log itself is
allowed
to carry lastminute information.

Thus you need to txlog changes to pg_control also.

Andreas

#277Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Zeugswetter Andreas SB (#276)
RE: Big 7.1 open items

BTW we are about to take in tablespace concept. You would
need another information(the name of the symlink to a directory
,would be = tablespaceOID) for WAL logging.

Do we need *both* database & tablespace to find table file ?!
Imho, database shouldn't be used...

?

Vadim

#278Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mikheev, Vadim (#277)
Re: Big 7.1 open items

"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:

Do we need *both* database & tablespace to find table file ?!
Imho, database shouldn't be used...

That'd work fine for me, but I think Bruce was arguing for paths that
included the database name. We'd end up with paths that go something
like
..../data/tablespaces/TABLESPACEOID/RELATIONOID
(plus some kind of decoration for segment and version), so you'd have
a hard time telling which files in a tablespace belong to which
database. Doesn't bother me a whole lot, personally --- if one wants
to know that one could just as well assign separate tablespaces to
different databases. They're only subdirectories anyway...

regards, tom lane

#279Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Tom Lane (#278)
RE: Big 7.1 open items

Do we need *both* database & tablespace to find table file ?!
Imho, database shouldn't be used...

That'd work fine for me, but I think Bruce was arguing for paths that
included the database name. We'd end up with paths that go something
like
..../data/tablespaces/TABLESPACEOID/RELATIONOID
(plus some kind of decoration for segment and version), so you'd have
a hard time telling which files in a tablespace belong to which
database. Doesn't bother me a whole lot, personally --- if one wants

We could create /data/databases/DATABASEOID/ and create soft-links to
table-files. This way different tables of the same database could be in
different tablespaces. /data/database path would be used in production
and /data/tablespace path would be used in recovery.

Vadim

#280Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mikheev, Vadim (#279)
Re: Big 7.1 open items

"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:

We could create /data/databases/DATABASEOID/ and create soft-links to
table-files. This way different tables of the same database could be in
different tablespaces. /data/database path would be used in production
and /data/tablespace path would be used in recovery.

Why would you want to do it that way? Having a different access path
for recovery than for normal operation strikes me as just asking for
trouble ;-)

The symlinks wouldn't do any good for what Bruce had in mind anyway
(IIRC, he wanted to get useful per-database numbers from "du").

regards, tom lane

#281Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Tom Lane (#280)
RE: Big 7.1 open items

We could create /data/databases/DATABASEOID/ and create
soft-links to table-files. This way different tables of
the same database could be in different tablespaces.
/data/database path would be used in production
and /data/tablespace path would be used in recovery.

Why would you want to do it that way? Having a different access path
for recovery than for normal operation strikes me as just asking for
trouble ;-)

I just think that *databases* (schemas) must be used for *logical* groupping
of tables, not for *physical* one. "Where to store table" is tablespace'
related kind of things!

The symlinks wouldn't do any good for what Bruce had in mind anyway
(IIRC, he wanted to get useful per-database numbers from "du").

Imho, ability to put different tables/indices (of the same database)
to different tablespaces (disks) is much more useful then ability to
use du/ls for administration purposes -:)

Also, I think that we *must* go away from OS' driven disk space
allocation anyway. Currently, the way we extend table files breaks WAL
rule (nothing must go to disk untill logged). + we have to move tuples
from end of file to top to shrink relation - not perfect way to reuse
empty space. +... +... +...

Vadim

#282Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Mikheev, Vadim (#279)
Re: Big 7.1 open items

Tom Lane wrote:

The symlinks wouldn't do any good for what Bruce had in mind anyway
(IIRC, he wanted to get useful per-database numbers from "du").

Our database design seems to be in the opposite direction
if it is restricted for the convenience of command calls.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#283Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Hiroshi Inoue (#282)
Re: Big 7.1 open items

Tom Lane wrote:

The symlinks wouldn't do any good for what Bruce had in mind anyway
(IIRC, he wanted to get useful per-database numbers from "du").

Our database design seems to be in the opposite direction
if it is restricted for the convenience of command calls.

Well, I don't see any reason not to use tablespace/database rather than
just tablespace. Seems having fewer files in each directory will be a
little faster, and if we can make administration easier, why not?

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#284Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Bruce Momjian (#283)
Re: Big 7.1 open items

Bruce Momjian wrote:

Tom Lane wrote:

The symlinks wouldn't do any good for what Bruce had in mind anyway
(IIRC, he wanted to get useful per-database numbers from "du").

Our database design seems to be in the opposite direction
if it is restricted for the convenience of command calls.

Well, I don't see any reason not to use tablespace/database rather than
just tablespace. Seems having fewer files in each directory will be a
little faster, and if we can make administration easier, why not?

I only objected to emphasize the advantage in getting useful per-database
numbers from "du". It's only misleading in dbms design.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#285Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Hiroshi Inoue (#284)
AW: Big 7.1 open items

That'd work fine for me, but I think Bruce was arguing for paths that
included the database name. We'd end up with paths that go something
like
..../data/tablespaces/TABLESPACEOID/RELATIONOID
(plus some kind of decoration for segment and version), so you'd have
a hard time telling which files in a tablespace belong to which
database.

Well ,as long as we have the file per object layout it probably makes sense
to
have "speaking paths", But I see no real problem with:

..../data/tablespacename/dbname/RELATIONOID[.dat|.idx]

RELATIONOID standing for whatever the consensus will be.
I do not really see an argument for using a tablespaceoid instead of
it's [maybe mangled] name.

Andreas

#286Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zeugswetter Andreas SB (#285)
Re: AW: Big 7.1 open items

Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at> writes:

I do not really see an argument for using a tablespaceoid instead of
it's [maybe mangled] name.

Eliminating filesystem-based restrictions on names, for one.
For example we'd not have to forbid slashes and (probably) backquotes
in tablespace names if we did this, and we'd not have to worry about
filesystem-induced limits on name lengths. Renaming a tablespace
would also be trivial instead of nigh impossible.

It might be that using tablespace names as directory names is worth
enough from the admin point of view to make the above restrictions
acceptable. But it's a tradeoff, and not one with an obvious choice
IMHO.

regards, tom lane

#287Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Tom Lane (#286)
RE: Big 7.1 open items

The symlinks wouldn't do any good for what Bruce had in
mind anyway (IIRC, he wanted to get useful per-database
numbers from "du").

Our database design seems to be in the opposite direction
if it is restricted for the convenience of command calls.

Well, I don't see any reason not to use tablespace/database
rather than just tablespace. Seems having fewer files in each directory

Once again - ability to use different tablespaces (disks) for tables/indices
in the same schema. Schemas must not dictate where to store objects <-
bad design.

will be a little faster, and if we can make administration easier,
why not?

Because you'll not be able use du/ls once we'll implement new smgr anyway.

And, btw, - for what are we going implement tablespaces? Just to have
fewer files in each dir ?!

Vadim

#288Peter Eisentraut
peter_e@gmx.net
In reply to: Mikheev, Vadim (#277)
RE: Big 7.1 open items

Mikheev, Vadim writes:

Do we need *both* database & tablespace to find table file ?!
Imho, database shouldn't be used...

Then the system tables from different databases would collide.

--
Peter Eisentraut Sernanders v�g 10:115
peter_e@gmx.net 75262 Uppsala
http://yi.org/peter-e/ Sweden

#289Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Mikheev, Vadim (#287)
Re: Big 7.1 open items

[ Charset ISO-8859-1 unsupported, converting... ]

The symlinks wouldn't do any good for what Bruce had in
mind anyway (IIRC, he wanted to get useful per-database
numbers from "du").

Our database design seems to be in the opposite direction
if it is restricted for the convenience of command calls.

Well, I don't see any reason not to use tablespace/database
rather than just tablespace. Seems having fewer files in each directory

Once again - ability to use different tablespaces (disks) for tables/indices
in the same schema. Schemas must not dictate where to store objects <-
bad design.

I am suggesting this symlink:

ln -s data/base/testdb/myspace /var/myspace/testdb

rather than:

ln -s data/base/testdb/myspace /var/myspace

Tablespaces still sit inside database directories, it is just that it
points to a subdirectory of myspace, rather than myspace itself.

Am I missing something?

will be a little faster, and if we can make administration easier,
why not?

Because you'll not be able use du/ls once we'll implement new smgr anyway.

At least du will work. I doubt we will be putting tables from different
databases in the same file.

And, btw, - for what are we going implement tablespaces? Just to have
fewer files in each dir ?!

No, I thought it was to split files across drives.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#290Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Bruce Momjian (#289)
RE: Big 7.1 open items

Well, I don't see any reason not to use tablespace/database
rather than just tablespace. Seems having fewer files in
each directory

Once again - ability to use different tablespaces (disks)
for tables/indices in the same schema. Schemas must not dictate
where to store objects <- bad design.

I am suggesting this symlink:

ln -s data/base/testdb/myspace /var/myspace/testdb

rather than:

ln -s data/base/testdb/myspace /var/myspace

Tablespaces still sit inside database directories, it is just that it
points to a subdirectory of myspace, rather than myspace itself.

^^^^^^^^^^^

Didn't you mean

ln -s /var/myspace/testdb data/base/testdb/myspace

?

I thought that you don't like symlinks from data/base/... This is
how I understood Tom' words:

The symlinks wouldn't do any good for what Bruce had in mind anyway
(IIRC, he wanted to get useful per-database numbers from "du").

Vadim

#291Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Mikheev, Vadim (#290)
RE: Big 7.1 open items

Do we need *both* database & tablespace to find table file ?!
Imho, database shouldn't be used...

Then the system tables from different databases would collide.

Actually, if we're going to use unique-ids for file names
then we have to know how to get system file names anyway.
Hm, OID+VERSION would make our life easier... Hiroshi?

Vadim

#292Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Mikheev, Vadim (#290)
Re: Big 7.1 open items

I am suggesting this symlink:

ln -s data/base/testdb/myspace /var/myspace/testdb

rather than:

ln -s data/base/testdb/myspace /var/myspace

Tablespaces still sit inside database directories, it is just that it
points to a subdirectory of myspace, rather than myspace itself.

^^^^^^^^^^^

Sorry, I should have said a symlink sits in data/base and points to the
tablespace. My issue is having it point to tablespace/database and not
just tablespace/.

Didn't you mean

ln -s /var/myspace/testdb data/base/testdb/myspace

No, sorry for the confusion.

?

I thought that you don't like symlinks from data/base/... This is
how I understood Tom' words:

The symlinks wouldn't do any good for what Bruce had in mind anyway
(IIRC, he wanted to get useful per-database numbers from "du").

No, I want symlinks in data/base. What I wanted was to have
per-database directories in every tablespace so I can do a 'du' in the
tablespace directory to see how much of each database is the tablespace.
It also nicely partitions the files, makes the directories smaller, and
prevents any possible file name conflict.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#293Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Bruce Momjian (#292)
RE: Big 7.1 open items

Then the system tables from different databases would collide.

Actually, if we're going to use unique-ids for file names
then we have to know how to get system file names anyway.
Hm, OID+VERSION would make our life easier... Hiroshi?

I assume we were going to have a pg_class.relversion to do that, but

^^^^^^^^
PG_CLASS_OID.VERSION_ID...

Just a clarification -:)

that is per-database because pg_class is per-database.

Vadim

#294Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Mikheev, Vadim (#291)
Re: Big 7.1 open items

[ Charset ISO-8859-1 unsupported, converting... ]

Do we need *both* database & tablespace to find table file ?!
Imho, database shouldn't be used...

Then the system tables from different databases would collide.

Actually, if we're going to use unique-ids for file names
then we have to know how to get system file names anyway.
Hm, OID+VERSION would make our life easier... Hiroshi?

I assume we were going to have a pg_class.relversion to do that, but
that is per-database because pg_class is per-database.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#295Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Bruce Momjian (#294)
RE: Big 7.1 open items

I actually meant I thought we were going to have a pg_class column
called relversion that held the currently active version for that
relation.

Yes, the file name will be pg_class_oid.version_id.

Is that OK?

We recently discussed pure *unique-id* file names...

Vadim

#296Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Mikheev, Vadim (#293)
Re: Big 7.1 open items

[ Charset ISO-8859-1 unsupported, converting... ]

Then the system tables from different databases would collide.

Actually, if we're going to use unique-ids for file names
then we have to know how to get system file names anyway.
Hm, OID+VERSION would make our life easier... Hiroshi?

I assume we were going to have a pg_class.relversion to do that, but

^^^^^^^^
PG_CLASS_OID.VERSION_ID...

Just a clarification -:)

I actually meant I thought we were going to have a pg_class column
called relversion that held the currently active version for that
relation.

Yes, the file name will be pg_class_oid.version_id.

Is that OK?

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#297Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Mikheev, Vadim (#295)
Re: Big 7.1 open items

[ Charset ISO-8859-1 unsupported, converting... ]

I actually meant I thought we were going to have a pg_class column
called relversion that held the currently active version for that
relation.

Yes, the file name will be pg_class_oid.version_id.

Is that OK?

We recently discussed pure *unique-id* file names...

Well, that would allow us to mix database files in the same directory,
if we wanted to do that. My opinion it is better to keep databases in
separate directories in each tablespace for clarity and performance
reasons.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#298Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#288)
Re: Big 7.1 open items

Peter Eisentraut <peter_e@gmx.net> writes:

Mikheev, Vadim writes:

Do we need *both* database & tablespace to find table file ?!
Imho, database shouldn't be used...

Then the system tables from different databases would collide.

I've been assuming that we would create a separate tablespace for
each database, which would be the location of that database's
system tables. It's probably also the default tablespace for user
tables created in that database, though it wouldn't have to be.

There should also be a known tablespace for the installation-wide tables
(pg_shadow et al).

With this approach tablespace+relation would indeed be a sufficient
identifier. We could even eliminate the knowledge that certain
tables are installation-wide from the bufmgr and below (currently
that knowledge is hardwired in places that I'd rather didn't know
about it...)

regards, tom lane

#299Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#298)
Re: Big 7.1 open items

Peter Eisentraut <peter_e@gmx.net> writes:

Mikheev, Vadim writes:

Do we need *both* database & tablespace to find table file ?!
Imho, database shouldn't be used...

Then the system tables from different databases would collide.

I've been assuming that we would create a separate tablespace for
each database, which would be the location of that database's
system tables. It's probably also the default tablespace for user
tables created in that database, though it wouldn't have to be.

There should also be a known tablespace for the installation-wide tables
(pg_shadow et al).

With this approach tablespace+relation would indeed be a sufficient
identifier. We could even eliminate the knowledge that certain
tables are installation-wide from the bufmgr and below (currently
that knowledge is hardwired in places that I'd rather didn't know
about it...)

Well, if we did that, I can see a good reason to not use per-database
directories in the tablepspace.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#300Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#297)
Re: Big 7.1 open items

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Well, that would allow us to mix database files in the same directory,
if we wanted to do that. My opinion it is better to keep databases in
separate directories in each tablespace for clarity and performance
reasons.

One reason not to do that is that we'd still have to special-case
the system-wide relations. If it's just tablespace and OID in the
path, then the system-wide rels look just the same as any other rel
as far as the low-level stuff is concerned. That would be nice.

My feeling about the "clarity and performance" issue is that if a
dbadmin wants to keep track of database contents separately, he can
put different databases' tables into different tablespaces to start
with. If he puts several tables into one tablespace, he's saying
he doesn't care about distinguishing their space usage. There's
no reason for us to force an additional level of directory lookup
to be done whether the admin wants it or not.

regards, tom lane

#301Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#300)
Re: Big 7.1 open items

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Well, that would allow us to mix database files in the same directory,
if we wanted to do that. My opinion it is better to keep databases in
separate directories in each tablespace for clarity and performance
reasons.

One reason not to do that is that we'd still have to special-case
the system-wide relations. If it's just tablespace and OID in the
path, then the system-wide rels look just the same as any other rel
as far as the low-level stuff is concerned. That would be nice.

Yes, good point about pg_shadow. They don't have databases. How do we
get multiple pg_class tables in the same directory? Is the
pg_class.relversion file a number like 1,2,3,4, or does it come out of
some global counter like oid. If so, we could put them in the same
directory.

Should we be concerned about performance when 10-20 database are sitting
in the same directory? I am thinking about open() and other calls that
scan the directory. Certainly shorter file names will help.

My feeling about the "clarity and performance" issue is that if a
dbadmin wants to keep track of database contents separately, he can
put different databases' tables into different tablespaces to start
with. If he puts several tables into one tablespace, he's saying
he doesn't care about distinguishing their space usage. There's
no reason for us to force an additional level of directory lookup
to be done whether the admin wants it or not.

OK.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#302Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#301)
Re: Big 7.1 open items

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Yes, good point about pg_shadow. They don't have databases. How do we
get multiple pg_class tables in the same directory? Is the
pg_class.relversion file a number like 1,2,3,4, or does it come out of
some global counter like oid. If so, we could put them in the same
directory.

I think we could get away with insisting that each database store its
pg_class and friends in a separate tablespace (physically distinct
directory) from any other database. That gets around the OID conflict.

It's still an open question whether OID+version is better than
unique-ID for naming files that belong to different versions of the
same relation. I can see arguments on both sides.

regards, tom lane

#303Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Bruce Momjian (#301)
Re: Big 7.1 open items

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Yes, good point about pg_shadow. They don't have databases. How do we
get multiple pg_class tables in the same directory? Is the
pg_class.relversion file a number like 1,2,3,4, or does it come out of
some global counter like oid. If so, we could put them in the same
directory.

I think we could get away with insisting that each database store its
pg_class and friends in a separate tablespace (physically distinct
directory) from any other database. That gets around the OID conflict.

It's still an open question whether OID+version is better than
unique-ID for naming files that belong to different versions of the
same relation. I can see arguments on both sides.

I don't stick to unique-ID. My main point has always been the
transactional control of file allocation change.
However *VERSION(_ID)* may be misleading because it couldn't
mean the version of pg_class tuples.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#304Peter Mount
petermount@it.maidstone.gov.uk
In reply to: Hiroshi Inoue (#303)
RE: Big 7.1 open items

Yes, the file name will be pg_class_oid.version_id.

What about segmented files (ie: those over 1Gb)?

--
Peter Mount
Enterprise Support
Maidstone Borough Council
Any views stated are my own, and not those of Maidstone Borough Council

#305Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Peter Mount (#304)
AW: Big 7.1 open items

I am suggesting this symlink:

ln -s data/base/testdb/myspace /var/myspace/testdb

rather than:

ln -s data/base/testdb/myspace /var/myspace

I guess on similar reasoning I would suggest inserting the extent
subdirectory, because it would be easier to create different
filesystems for them.

ln -s data/base/testdb/myspace/extent1 /var/myspace/extent1/testdb

Andreas

#306Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Zeugswetter Andreas SB (#305)
AW: Big 7.1 open items

I've been assuming that we would create a separate tablespace for
each database, which would be the location of that database's
system tables. It's probably also the default tablespace for user
tables created in that database, though it wouldn't have to be.

I think I would prefer the ability to place more than one database into
the same tablespace.

Andreas

#307Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Zeugswetter Andreas SB (#306)
AW: Big 7.1 open items

I am suggesting this symlink:

ln -s data/base/testdb/myspace /var/myspace/testdb

rather than:

ln -s data/base/testdb/myspace /var/myspace

I guess on similar reasoning I would suggest inserting the extent
subdirectory, because it would be easier to create different
filesystems for them.

ln -s data/base/testdb/myspace/extent1 /var/myspace/extent1/testdb

Grmpf, I meant:
ln -s /var/myspace/extent1/testdb data/base/testdb/myspace/extent1

Andreas

#308Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Zeugswetter Andreas SB (#307)
AW: Big 7.1 open items

The symlinks wouldn't do any good for what Bruce had in
mind anyway (IIRC, he wanted to get useful per-database
numbers from "du").

Our database design seems to be in the opposite direction
if it is restricted for the convenience of command calls.

Well, I don't see any reason not to use tablespace/database
rather than just tablespace. Seems having fewer files in

each directory

Once again - ability to use different tablespaces (disks) for
tables/indices
in the same schema. Schemas must not dictate where to store objects <-
bad design.

Can we agree, that the schema is below the database hierarchy, and thus
is something different than a database ?
A table under another schema will simply get another oid, and thus no
collision.
But I agree that schema should not dictate storage location,
but the schema might imply a default storage location like in Oracle
(default tablespaces for a user).

will be a little faster, and if we can make administration easier,
why not?

Because you'll not be able use du/ls once we'll implement new
smgr anyway.

Leaving the file per table design imho does imply an order of magnitude
increase in the impact of errors in the smgr. Now an error is likely to
destroy
one table only, then it can destroy a whole tablespace.
But I am still a fan of the single file/raw device per tablespace design,
since it can remove a lot of the OS overhead.

And, btw, - for what are we going implement tablespaces? Just to have
fewer files in each dir ?!

No, I guess the idea is to have a tool to manipulate physical distribution
of objects (which disk, which filesystem ...)

Andreas

#309Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Zeugswetter Andreas SB (#308)
Re: AW: Big 7.1 open items

Zeugswetter Andreas SB wrote:

The symlinks wouldn't do any good for what Bruce had in
mind anyway (IIRC, he wanted to get useful per-database
numbers from "du").

Our database design seems to be in the opposite direction
if it is restricted for the convenience of command calls.

Well, I don't see any reason not to use tablespace/database
rather than just tablespace. Seems having fewer files in

each directory

Once again - ability to use different tablespaces (disks) for
tables/indices
in the same schema. Schemas must not dictate where to store objects <-
bad design.

Can we agree, that the schema is below the database hierarchy, and thus
is something different than a database ?

I don't think we have a common understanding for PG's *database*
(created by createdb). Every one seems to have his own *database*.

According to your another posting,your *database* hierarchy is
instance -> database -> schema -> object
like Oracle.

However SQL92 seems to have another hierarchy:
cluster -> catalog -> schema -> object
and dot notation catalog.schema.object could be used.

I couldn't find clear correspondense between PG's *database*
and above hierarchy because we have no dot notation for
objects currently.

A table under another schema will simply get another oid, and thus no
collision.
But I agree that schema should not dictate storage location,
but the schema might imply a default storage location like in Oracle
(default tablespaces for a user).

AFAIK,schema is independent from user in SQL92.
So default_tablespace_per_user doesn't necessarily imply
default_tablespace_per_schema.

Regards.

Hiroshi Inoue

#310Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Hiroshi Inoue (#309)
AW: AW: Big 7.1 open items

AFAIK,schema is independent from user in SQL92.
So default_tablespace_per_user doesn't necessarily imply
default_tablespace_per_schema.

Well, sombody must be interpreting this wrong, because
in Informix and Oracle the schema corresponds to the owner
and they say they conform to ansi in this regard.
In both, a user can access other schemas or switch to another
schema, so in that sense you could say that the schema is
independent of users.

However SQL92 seems to have another hierarchy:
cluster -> catalog -> schema -> object

I would say our "database" corresponds to "catalog" and
"instance" corresponds to "cluster" in the SQL92 hierarchy.
Instance is probably a bad wording in respect to multiple
machine clusters where you can access all objects
from every node.
Database was probably not used, because this is often used
to describe the whole hierarchy.

I couldn't find clear correspondense between PG's *database*
and above hierarchy because we have no dot notation for
objects currently.

This will definitely be a problem because of our current nested dot
interpretation towards functions taking one opaque or _class_type
argument.

Andreas

#311Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Zeugswetter Andreas SB (#305)
Re: AW: Big 7.1 open items

I am suggesting this symlink:

ln -s data/base/testdb/myspace /var/myspace/testdb

rather than:

ln -s data/base/testdb/myspace /var/myspace

I guess on similar reasoning I would suggest inserting the extent
subdirectory, because it would be easier to create different
filesystems for them.

ln -s data/base/testdb/myspace/extent1 /var/myspace/extent1/testdb

The idea was to put the main files in the directory, and create Extent2,
Extent3 directories for the extents.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#312Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zeugswetter Andreas SB (#306)
Re: AW: Big 7.1 open items

Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at> writes:

I've been assuming that we would create a separate tablespace for
each database, which would be the location of that database's
system tables. It's probably also the default tablespace for user
tables created in that database, though it wouldn't have to be.

I think I would prefer the ability to place more than one database into
the same tablespace.

You can put user tables from multiple databases into the same
tablespace, under this proposal. Just not system tables.

regards, tom lane

#313Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zeugswetter Andreas SB (#307)
Re: AW: Big 7.1 open items

Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at> writes:

I guess on similar reasoning I would suggest inserting the extent
subdirectory, because it would be easier to create different
filesystems for them.

ln -s data/base/testdb/myspace/extent1 /var/myspace/extent1/testdb

Grmpf, I meant:
ln -s /var/myspace/extent1/testdb data/base/testdb/myspace/extent1

That would mean more bookkeeping: everytime you add an extent to a
tablespace, you'd have to go around and find all the referencing
databases and add a symlink to each one.

But I think the direction we're headed in is that the data/base/DBNAME
directories are going to disappear entirely, so this argument about
what symlinks they need to have is a bit pointless ;-). Databases
are going to become a higher-level concept that's not directly reflected
in the physical layout.

The way I'm currently envisioning it is that we have paths like

data/spaces/TABLESPACE/EXTENT/RELATION.VERSION

(ignoring the details about whether we use names or OIDs and which
directory levels might be symlinks). Since we will require each logical
database to have a distinct "home tablespace" in which its system tables
live, that "home tablespace" can be the runtime working directory for
backends running in that database. If you like you can think of the
home tablespace directory as being equivalent to the old database
directory, but it's really a different notion --- and in particular,
it's got nothing to do with how the backend addresses tables that are
in other tablespaces.

BTW, it occurs to me that we ought to have some frammish whereby temp
files and tables created by backends running in a particular database
can be directed to a different tablespace. If we do nothing, then
they'd always appear in the database's home tablespace, but I can sure
see a dbadmin wanting to push his large sort temp files off to someplace
else...

regards, tom lane

#314Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#313)
Re: AW: Big 7.1 open items

BTW, it occurs to me that we ought to have some frammish whereby temp
files and tables created by backends running in a particular database
can be directed to a different tablespace. If we do nothing, then
they'd always appear in the database's home tablespace, but I can sure
see a dbadmin wanting to push his large sort temp files off to someplace
else...

Sort tables are an area Ingres does round-robin so the tape file can be
on different drives.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#315Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Mount (#304)
Re: Big 7.1 open items

Peter Mount <petermount@it.maidstone.gov.uk> writes:

Yes, the file name will be pg_class_oid.version_id.

What about segmented files (ie: those over 1Gb)?

Separate issue. Putting the segment number into the filename is
a bad idea because it doesn't give you any way to spread multiple
segments of a big table across filesystems. What's currently being
discussed is paths that look like

something/SEGNO/RELATIONOID.VERSIONID

This lets you control space allocation by making the SEGNO
subdirectories be symlinks to various places.

regards, tom lane

#316Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#315)
Re: Big 7.1 open items

Bruce Momjian <pgman@candle.pha.pa.us> writes:

If we put multiple database tables in the same directory, have we
considered how to drop databases? Right now we do rm -rf:

rm -rf will no longer work in a tablespaces environment anyway.
(Even if you kept symlinks underneath the DB directory, rm -rf
wouldn't follow them.)

DROP DATABASE will have to be implemented honestly: run through
pg_class and do a regular DROP on each user table.

Once you've got rid of the user tables, rm -rf should suffice to
get rid of the "home tablespace" as I've been calling it, with
all the system tables therein.

Now that you mention it, this is another reason why system tables for
each database have to live in a separate tablespace directory: there's
no other good way to do that final stage of DROP DATABASE. The
DROP-each-table approach doesn't work for system tables (somewhere along
about the point where you drop pg_attribute, DROP TABLE itself would
stop working ;-)).

However I do see a bit of a problem here: since DROP DATABASE is
ordinarily executed by a backend that's running in a different database,
how's it going to read pg_class of the target database? Perhaps it will
be necessary to fire up a sub-backend that runs in the target DB for
long enough to kill all the user tables. Looking messy...

regards, tom lane

#317Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#316)
Re: Big 7.1 open items

However I do see a bit of a problem here: since DROP DATABASE is
ordinarily executed by a backend that's running in a different database,
how's it going to read pg_class of the target database? Perhaps it will
be necessary to fire up a sub-backend that runs in the target DB for
long enough to kill all the user tables. Looking messy...

That was my feeling. Imagine another issue. If you see a file, how do
you figure out what database it belong to? You would have to cycle
through the pg_class relations for every database. Seems such reverse
lookups would not be impossible. Not sure if it will ever be required.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#318Ross J. Reedstrom
reedstrm@rice.edu
In reply to: Zeugswetter Andreas SB (#310)
Re: AW: Big 7.1 open items

On Wed, Jun 28, 2000 at 02:07:33PM +0200, Zeugswetter Andreas SB wrote:

AFAIK,schema is independent from user in SQL92.
So default_tablespace_per_user doesn't necessarily imply
default_tablespace_per_schema.

Well, sombody must be interpreting this wrong, because
in Informix and Oracle the schema corresponds to the owner
and they say they conform to ansi in this regard.

To quote from the SQL92 standard for CERATE SCHEMA:

<schema definition> ::=
CREATE SCHEMA <schema name clause>
[ <schema character set specification> ]
[ <schema element>... ]

<schema name clause> ::=
<schema name>
| AUTHORIZATION <schema authorization identifier>
| <schema name> AUTHORIZATION <schema authorization identifier>

1) If <schema name> is not specified, then a <schema name> equal to
<schema authorization identifier> is implicit.

2) If AUTHORIZATION <schema authorization identifier> is not speci-
fied, then

Case:

a) If the <schema definition> is contained in a <module> that
has a <module authorization identifier> specified, then an
<authorization identifier> equal to that <module authoriza-
tion identifier> is implicit for the <schema definition>.

b) Otherwise, an <authorization identifier> equal to the SQL-
session <authorization identifier> is implicit.

So, we see that the SQL92 default fora schema is the session username.

In both, a user can access other schemas or switch to another
schema, so in that sense you could say that the schema is
independent of users.

Not only in a sense, it is in fact.

This will definitely be a problem because of our current nested dot
interpretation towards functions taking one opaque or _class_type
argument.

Right. If we're going to support SQL92 dot notation (which I think we
should) we'll either need to lose the function notion completely, or
come up with some really clever hack about applying them in order.

Ross
--
Ross J. Reedstrom, Ph.D., <reedstrm@rice.edu>
NSBRI Research Scientist/Programmer
Computer and Information Technology Institute
Rice University, 6100 S. Main St., Houston, TX 77005

#319Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Zeugswetter Andreas SB (#310)
RE: AW: Big 7.1 open items

-----Original Message-----
From: Zeugswetter Andreas SB

AFAIK,schema is independent from user in SQL92.
So default_tablespace_per_user doesn't necessarily imply
default_tablespace_per_schema.

Well, sombody must be interpreting this wrong, because
in Informix and Oracle the schema corresponds to the owner
and they say they conform to ansi in this regard.

Is there really a schema:user=1:1 limitation in SQL-92 ?
Though both SQL-86 and SQL-89 had the limitation
SQL-92 removed it AFAIK.

Regards.

Hiroshi Inoue

#320Ross J. Reedstrom
reedstrm@rice.edu
In reply to: Hiroshi Inoue (#319)
Re: AW: Big 7.1 open items

On Thu, Jun 29, 2000 at 02:05:31AM +0900, Hiroshi Inoue wrote:

Is there really a schema:user=1:1 limitation in SQL-92 ?
Though both SQL-86 and SQL-89 had the limitation
SQL-92 removed it AFAIK.

See my other post. In SQL92, the username is the default schema name.

Ross
--
Ross J. Reedstrom, Ph.D., <reedstrm@rice.edu>
NSBRI Research Scientist/Programmer
Computer and Information Technology Institute
Rice University, 6100 S. Main St., Houston, TX 77005

#321Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Tom Lane (#316)
RE: Big 7.1 open items

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]

However I do see a bit of a problem here: since DROP DATABASE is
ordinarily executed by a backend that's running in a different database,
how's it going to read pg_class of the target database? Perhaps it will
be necessary to fire up a sub-backend that runs in the target DB for
long enough to kill all the user tables. Looking messy...

Why do we have to have system tables per *database* ?
Is there anything wrong with global system tables ?
And how about adding dbid to pg_class,pg_proc etc ?

Regards.

Hiroshi Inoue

#322Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hiroshi Inoue (#321)
Re: Big 7.1 open items

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

Why do we have to have system tables per *database* ?
Is there anything wrong with global system tables ?
And how about adding dbid to pg_class,pg_proc etc ?

We could, but I think I'd vote against it on two grounds:

1. Reliability. If something corrupts pg_class, do you want to
lose your whole installation, or just one database?

2. Increased locking overhead/loss of concurrency. Currently, there
is very little lock contention between backends running in different
databases. A shared pg_class will be a single point of locking (as
well as a single point of failure) for the whole installation.

It would solve the DROP DATABASE problem kind of nicely, but really
it'd just be downgrading DROP DATABASE to a DROP SCHEMA operation...

regards, tom lane

#323Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#298)
Re: Big 7.1 open items

Tom Lane writes:

I've been assuming that we would create a separate tablespace for
each database, which would be the location of that database's
system tables.

Then I can't put more than one database into a table space? But I can put
more than one table space into a database? I think that's the wrong
hierarchy. More specifically, I think it's wrong that there is a hierarchy
here at all. Table spaces and databases don't have to know about each
other in any predefined way.

--
Peter Eisentraut Sernanders v�g 10:115
peter_e@gmx.net 75262 Uppsala
http://yi.org/peter-e/ Sweden

#324Peter Eisentraut
peter_e@gmx.net
In reply to: Bruce Momjian (#301)
Re: Big 7.1 open items

Bruce Momjian writes:

How do we get multiple pg_class tables in the same directory?

pg_class.DBOID

--
Peter Eisentraut Sernanders v�g 10:115
peter_e@gmx.net 75262 Uppsala
http://yi.org/peter-e/ Sweden

#325Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Hiroshi Inoue (#321)
Re: Big 7.1 open items

Tom Lane wrote:

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

Why do we have to have system tables per *database* ?
Is there anything wrong with global system tables ?
And how about adding dbid to pg_class,pg_proc etc ?

We could, but I think I'd vote against it on two grounds:

1. Reliability. If something corrupts pg_class, do you want to
lose your whole installation, or just one database?

2. Increased locking overhead/loss of concurrency. Currently, there
is very little lock contention between backends running in different
databases. A shared pg_class will be a single point of locking (as
well as a single point of failure) for the whole installation.

Isn't current design of PG's *database* for dropdb using "rm -rf"
rather than for above 1.2. ?
If we couldn't rely on our db itself and our locking mechanism is
poor,we could start different postmasters for different *database*s.

It would solve the DROP DATABASE problem kind of nicely, but really
it'd just be downgrading DROP DATABASE to a DROP SCHEMA operation...

What is our *DATABASE* ?
Is it clear to all people ?
At least it's a vague concept for me.
Could you please tell me what kind of objects are our *DATABASE*
objects but could not be schema objects ?

Regards.

Hiroshi Inoue

#326Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#323)
Re: Big 7.1 open items

Peter Eisentraut <peter_e@gmx.net> writes:

Tom Lane writes:

I've been assuming that we would create a separate tablespace for
each database, which would be the location of that database's
system tables.

Then I can't put more than one database into a table space? But I can put
more than one table space into a database?

You can put *user* tables from more than one database into a table space.
The restriction is just on *system* tables.

Admittedly this is a tradeoff. We could avoid it along the lines you
suggest (name table files like DBOID.RELOID.VERSION instead of just
RELOID.VERSION) but is it really worth it? Vadim's concerned about
every byte that has to go into the WAL log, and I think he's got a
good point.

I think that's the wrong
hierarchy. More specifically, I think it's wrong that there is a hierarchy
here at all. Table spaces and databases don't have to know about each
other in any predefined way.

They don't, at least not at the smgr level. In my view of how this
should work, the smgr *only* knows about tablespaces and tables.
Databases are a higher-level construct.

regards, tom lane

#327Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Tom Lane (#326)
AW: AW: Big 7.1 open items

ln -s data/base/testdb/myspace/extent1 /var/myspace/extent1/testdb

The idea was to put the main files in the directory, and create Extent2,
Extent3 directories for the extents.

The reasoning was, that the database subdir should be below the extentdir,
so that creating different fs for each extent would be easier, and not
depend
on the database name.

It is easy to create fs for:
/var/myspace
or
/var/myspace[/extent1]
/var/myspace/extent2
but not if it has dbname in it.

Andreas

#328Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Zeugswetter Andreas SB (#327)
AW: AW: Big 7.1 open items

Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at> writes:

I've been assuming that we would create a separate tablespace for
each database, which would be the location of that database's
system tables. It's probably also the default

tablespace for user

tables created in that database, though it wouldn't have to be.

I think I would prefer the ability to place more than one

database into

the same tablespace.

You can put user tables from multiple databases into the same
tablespace, under this proposal. Just not system tables.

Yes, but then it is only half baked.

Andreas

#329Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Zeugswetter Andreas SB (#328)
AW: AW: Big 7.1 open items

AFAIK,schema is independent from user in SQL92.
So default_tablespace_per_user doesn't necessarily imply
default_tablespace_per_schema.

Well, sombody must be interpreting this wrong, because
in Informix and Oracle the schema corresponds to the owner
and they say they conform to ansi in this regard.

Is there really a schema:user=1:1 limitation in SQL-92 ?
Though both SQL-86 and SQL-89 had the limitation
SQL-92 removed it AFAIK.

As I said in another posting a user does not need to exist
for each schema. The dba can create objects under any
schema name.

Andreas

#330Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Zeugswetter Andreas SB (#329)
RE: AW: Big 7.1 open items

-----Original Message-----
From: pgsql-hackers-owner@hub.org [mailto:pgsql-hackers-owner@hub.org]On
Behalf Of Zeugswetter Andreas SB

AFAIK,schema is independent from user in SQL92.
So default_tablespace_per_user doesn't necessarily imply
default_tablespace_per_schema.

Well, sombody must be interpreting this wrong, because
in Informix and Oracle the schema corresponds to the owner
and they say they conform to ansi in this regard.

Is there really a schema:user=1:1 limitation in SQL-92 ?
Though both SQL-86 and SQL-89 had the limitation
SQL-92 removed it AFAIK.

As I said in another posting a user does not need to exist
for each schema. The dba can create objects under any
schema name.

Sorry for my poor understanding.
What I meant was that SQL92 allows the following.

schema owner
---------------------------
schema1 user1
schema2 user1
schema3 user2
schema4 user3
schema5 user3
schema6 user3

Is my understaning same as yours ?

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#331Peter Mount
petermount@it.maidstone.gov.uk
In reply to: Hiroshi Inoue (#330)
RE: AW: Big 7.1 open items

The SQL7 way is the schema is the username, with the exception of "dba" -
it's used as a "global" schema.

--
Peter Mount
Enterprise Support
Maidstone Borough Council
Any views stated are my own, and not those of Maidstone Borough Council

-----Original Message-----
From: Hiroshi Inoue [mailto:Inoue@tpf.co.jp]
Sent: Thursday, June 29, 2000 9:22 AM
To: Zeugswetter Andreas SB
Cc: PostgreSQL-development
Subject: RE: AW: [HACKERS] Big 7.1 open items

-----Original Message-----
From: pgsql-hackers-owner@hub.org [mailto:pgsql-hackers-owner@hub.org]On
Behalf Of Zeugswetter Andreas SB

AFAIK,schema is independent from user in SQL92.
So default_tablespace_per_user doesn't necessarily imply
default_tablespace_per_schema.

Well, sombody must be interpreting this wrong, because
in Informix and Oracle the schema corresponds to the owner
and they say they conform to ansi in this regard.

Is there really a schema:user=1:1 limitation in SQL-92 ?
Though both SQL-86 and SQL-89 had the limitation
SQL-92 removed it AFAIK.

As I said in another posting a user does not need to exist
for each schema. The dba can create objects under any
schema name.

Sorry for my poor understanding.
What I meant was that SQL92 allows the following.

schema owner
---------------------------
schema1 user1
schema2 user1
schema3 user2
schema4 user3
schema5 user3
schema6 user3

Is my understaning same as yours ?

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#332Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Peter Mount (#331)
AW: AW: Big 7.1 open items

As I said in another posting a user does not need to exist
for each schema. The dba can create objects under any
schema name.

Sorry for my poor understanding.
What I meant was that SQL92 allows the following.

schema owner
---------------------------
schema1 user1
schema2 user1
schema3 user2
schema4 user3
schema5 user3
schema6 user3

Is my understaning same as yours ?

Yes, this is how I read the spec 99. Also:
schema1 user1
schema1 user2

I doubt that this really buys any features that a simple grant cannot give.
I mean, if a user creates an object with a schema name that is different
from his user name we could simply grant him all rights on this object
(if he isn't dba).

Andreas

#333Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Zeugswetter Andreas SB (#332)
AW: AW: Big 7.1 open items

I think I would prefer the ability to place more than one

database into

the same tablespace.

You can put user tables from multiple databases into the same
tablespace, under this proposal. Just not system tables.

Yes, but then it is only half baked.

Half baked or not, I think I am starting to like it.
I think I would restrict such an automagically created tablespace
(tblspace name = db name) to only contain tables from this database.

Andreas

#334Peter Eisentraut
peter_e@gmx.net
In reply to: Hiroshi Inoue (#309)
Re: AW: Big 7.1 open items

Hiroshi Inoue writes:

According to your another posting,your *database* hierarchy is
instance -> database -> schema -> object
like Oracle.

However SQL92 seems to have another hierarchy:
cluster -> catalog -> schema -> object
and dot notation catalog.schema.object could be used.

FYI:

An "instance" is a "cluster". I don't know where the word instance came
from, the docs sometimes call it "installation" or "site", which is even
worse. I have been using "database cluster" for the latest documentation
work. My dictionary defines a cluster as "a group of things gathered or
occurring closely together", which is what this is. Call it a "data area"
or an "initdb'ed thing", etc.

A "catalog" can be equated with our "database". The method of creating
catalogs is implementation defined, so our CREATE DATABASE command is in
perfect compliance with the standard. We don't support the
catalog.schema.object notation but that notation only makes sense when you
can access more than one catalog at a time. We don't allow that and SQL
doesn't require it. We could allow that notation and throw an error when
the catalog name doesn't match the current database, but that's mere
cosmetic work.

In entry level SQL 92, a "schema" is essentially the same as table
ownership. You can execute the command CREATE SCHEMA AUTHORIZATION
"peter", which means that user "peter" (where he came from is
"implementation-defined") can now create tables under his name. There is
no such thing as a table owner, there's the "containing schema" and its
owner. The tables "peter" creates can then be referenced by the dotted
notation. But it is not correct to equate this with CREATE USER. Even if
there was no schema for "peter" he could still connect and query other
people's tables.

Moving beyond SQL 92 you can also create schemas with a different name
than your user name. This is merely a little more naming flexibility.

--
Peter Eisentraut Sernanders v�g 10:115
peter_e@gmx.net 75262 Uppsala
http://yi.org/peter-e/ Sweden

#335Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#326)
Re: Big 7.1 open items

Tom Lane writes:

You can put *user* tables from more than one database into a table space.
The restriction is just on *system* tables.

I think my understanding as a user would be that a table space represents
a storage location. If I want to put a table/object/entire database on a
fancy disk somewhere I create a table space for it there. But if I want to
store all my stuff under /usr/local/pgsql/data then I wouldn't expect to
have to create more than one table space. So the table spaces become at
that point affected by the logical hierarchy: I must make sure to have
enough table spaces to have many databases.

More specifically, what would the user interface to this look like?
Clearly there has to be some sort of CREATE TABLESPACE command. Now does
CREATE DATABASE imply a CREATE TABLESPACE? I think not. Do you have to
create a table space before creating each database? I think not.

We could avoid it along the lines you suggest (name table files like
DBOID.RELOID.VERSION instead of just RELOID.VERSION) but is it really
worth it?

I only intended that for pg_class and other bootstrap-sort-of tables,
maybe all system tables. Normal heap files could look like RELOID.VERSION,
whereas system tables would look like "name.DBOID". Clearly there's no
market for renaming system tables or dropping any of their columns. We're
obviously going to have to treat pg_class special anyway.

Vadim's concerned about every byte that has to go into the WAL log,
and I think he's got a good point.

True. But if you only do it for the system tables then it might take less
space than keeping track of lots of table spaces that are unneeded. :-)

--
Peter Eisentraut Sernanders v�g 10:115
peter_e@gmx.net 75262 Uppsala
http://yi.org/peter-e/ Sweden

#336Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Peter Eisentraut (#334)
Re: AW: Big 7.1 open items

Peter Eisentraut wrote:

Hiroshi Inoue writes:

According to your another posting,your *database* hierarchy is
instance -> database -> schema -> object
like Oracle.

However SQL92 seems to have another hierarchy:
cluster -> catalog -> schema -> object
and dot notation catalog.schema.object could be used.

FYI:

Thanks.
I'm asking to all what our *DATABASE* is.
Different from you,I couldn't see any decisive feature in our *DATABASE*.

An "instance" is a "cluster". I don't know where the word instance came

I could find the word in Oracle.
IMHO,it corresponds to our initdb'ed thing(a postmaster controls).

from, the docs sometimes call it "installation" or "site", which is even
worse. I have been using "database cluster" for the latest documentation
work. My dictionary defines a cluster as "a group of things gathered or
occurring closely together", which is what this is. Call it a "data area"
or an "initdb'ed thing", etc.

SQL92 seems to say that a cluster corresponds to a target of connection
and has no name(after connection was established). Isn't it same as our
*DATABASE* ?

A "catalog" can be equated with our "database". The method of creating
catalogs is implementation defined, so our CREATE DATABASE command is in
perfect compliance with the standard. We don't support the
catalog.schema.object notation but that notation only makes sense when you
can access more than one catalog at a time.

Yes,it's most essential that we couldn't access more than one catalog.
This means that we have only one (noname) "catalog" per "cluster".

We don't allow that and SQL
doesn't require it. We could allow that notation and throw an error when
the catalog name doesn't match the current database, but that's mere
cosmetic work.

In entry level SQL 92, a "schema" is essentially the same as table
ownership. You can execute the command CREATE SCHEMA AUTHORIZATION
"peter", which means that user "peter" (where he came from is
"implementation-defined") can now create tables under his name. There is
no such thing as a table owner, there's the "containing schema" and its
owner. The tables "peter" creates can then be referenced by the dotted
notation. But it is not correct to equate this with CREATE USER. Even if
there was no schema for "peter" he could still connect and query other
people's tables.

I've used *username* "schema"s in Oracle for a long time but I've never
thought that it's the essence of "schema". If I recoginze correctly,the
concept of "catalog" hasn't necessarily been important while "schema"
= "user". The conflict of "schema" name is equivalent to the conflict
of "user" name if "schema" = "user". IMHO,SQL92 has required the
concept of "catalog" because "schema" has been changed to be
independent of "user".

Anyway in current PG "cluster":"catalog":"schema"=1:1:1(0) and
our *DATABASE* is an only confusing concept in the hierarchy..

Regards,

Hiroshi Inoue

#337Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#335)
Re: Big 7.1 open items

Peter Eisentraut <peter_e@gmx.net> writes:

Tom Lane writes:

You can put *user* tables from more than one database into a table space.
The restriction is just on *system* tables.

More specifically, what would the user interface to this look like?
Clearly there has to be some sort of CREATE TABLESPACE command. Now does
CREATE DATABASE imply a CREATE TABLESPACE? I think not. Do you have to
create a table space before creating each database? I think not.

I would say that CREATE DATABASE just implicitly creates a new
tablespace that's physically located right under the toplevel data
directory of the installation, no symlink. What's wrong with that?
You need not keep anything except the system tables of the DB there
if you don't want to. In practice, for someone who doesn't need to
worry about tablespaces (because they put the installation on a disk
with enough room for their purposes), the whole thing acts exactly
the same as it does now.

We could avoid it along the lines you suggest (name table files like
DBOID.RELOID.VERSION instead of just RELOID.VERSION) but is it really
worth it?

I only intended that for pg_class and other bootstrap-sort-of tables,
maybe all system tables. Normal heap files could look like RELOID.VERSION,
whereas system tables would look like "name.DBOID".

That would imply that the very bottom levels of the system know all
about which tables are system tables and which are not (and, if you
are really going to insist on the "name" part of that, that they
know what name goes with each system-table OID). I'd prefer to avoid
that. The less the smgr knows about the upper levels of the system,
the better.

Clearly there's no market for renaming system tables or dropping any
of their columns.

No, but there is a market for compacting indexes on system relations,
and I haven't heard a good proposal for doing index compaction in place.
So we need versioning for system indexes.

Vadim's concerned about every byte that has to go into the WAL log,
and I think he's got a good point.

True. But if you only do it for the system tables then it might take less
space than keeping track of lots of table spaces that are unneeded. :-)

Again, WAL should not need to distinguish system and user tables.

And as for the keeping track, the tablespace OID will simply replace the
database OID in the log and in the smgr interfaces. There's no "extra"
cost, except maybe by comparison to a system with neither tablespaces
nor multiple databases.

regards, tom lane

#338Chris Bitmead
chris@bitmead.com
In reply to: Tom Lane (#316)
Re: Big 7.1 open items

Tom Lane wrote:

Now that you mention it, this is another reason why system tables for
each database have to live in a separate tablespace directory: there's
no other good way to do that final stage of DROP DATABASE. The
DROP-each-table approach doesn't work for system tables (somewhere along
about the point where you drop pg_attribute, DROP TABLE itself would
stop working ;-)).

If drop table is extended to drop multiple tables at once, then you read
and cache everything you need before doing all the destruction in the
second stage.

#339Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#337)
Re: Big 7.1 open items

Tom Lane writes:

In practice, for someone who doesn't need to worry about tablespaces
(because they put the installation on a disk with enough room for
their purposes), the whole thing acts exactly the same as it does now.

But I'd venture the guess that for someone who wants to use tablespaces it
wouldn't work as expected. Table spaces should represent a physical
storage location. Creation of table spaces should be a restricted
operation, possibly more than, but at least differently from, databases.
Eventually, table spaces probably will have attributes, such as
optimization parameters (random_page_cost). This will not work as expected
if you intermix them with the databases.

I'd expect that if I have three disks and 50 databases, then I make three
tablespaces and assign the databases to them. I'll bet lunch that if we
don't do it that way that before long people will come along and ask for
something that does work this way.

--
Peter Eisentraut Sernanders v�g 10:115
peter_e@gmx.net 75262 Uppsala
http://yi.org/peter-e/ Sweden

#340Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#339)
Re: Big 7.1 open items

Peter Eisentraut <peter_e@gmx.net> writes:

I'd expect that if I have three disks and 50 databases, then I make three
tablespaces and assign the databases to them.

In our last installment, you were complaining that you didn't want to
be bothered with that ;-)

But I don't see any reason why CREATE DATABASE couldn't take optional
parameters indicating where to create the new DB's default tablespace.
We already have a LOCATION option for it that does something close to
that.

Come to think of it, it would probably make sense to adapt the existing
notion of "location" (cf initlocation script) into something meaning
"directory that users are allowed to create tablespaces (including
databases) in". If there were an explicit table of allowed locations,
it could be used to address the protection issues you raise --- for
example, a location could be restricted so that only some users could
create tablespaces/databases in it. $PGDATA/data would be just the
first location in every installation.

regards, tom lane

#341Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#340)
Re: Big 7.1 open items

Tom Lane writes:

Come to think of it, it would probably make sense to adapt the existing
notion of "location" (cf initlocation script) into something meaning
"directory that users are allowed to create tablespaces (including
databases) in".

This is what I've been trying to push all along. But note that this
mechanism does allow multiple databases per location. :)

--
Peter Eisentraut Sernanders v�g 10:115
peter_e@gmx.net 75262 Uppsala
http://yi.org/peter-e/ Sweden

#342Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Peter Eisentraut (#341)
AW: Big 7.1 open items

In my mind the point of the "database" concept is to

provide a domain

within which custom datatypes and functions are available.

AFAIK few users understand it and many users have wondered
why we couldn't issue cross "database" queries.

Imho the same issue is access to tables on another machine.
If we "fix" that, access to another db on the same instance is just
a variant of the above.

What is a difference between SCHAMA and your "database" ?
I myself am confused about them.

"my *database*" corresponds to the current database, which is created with
"create database" in postgresql. It corresponds to the catalog concept in
SQL99.

The schema is below the database. Access to different schemas with one
connection
is mandatory. Access to different catalogs (databases) with one connection
is not mandatory,
but should imho be solved analogous to access to another catalog on a
different
(SQL99) cluster. This would be a very nifty feature.

Andreas