Primary Key

Started by P. Dwayne Millerover 24 years ago43 messages
#1P. Dwayne Miller
dmiller@espgroup.net

Given the following CREATE TABLE instructions...

1)
CREATE TABLE message
(
int4 msgid PRIMARY KEY NOT NULL,
text msgtext
);

2)
CREATE TABLE message
(
int4 msgid not null,
text msgtext,
PRIMARY KEY (msgid)
);

3)
CREATE TABLE message
(
int4 msgid not null,
text msgtext,
CONSTRAINT cons_001_pk PRIMARY KEY on (msgid)
);

The first two actually create a PRIMARY KEY on msgid. The third seems
to have a PRIMARY KEY on 'oid', not 'msgid', though it does create a
unique index on 'msgid'. One of the applications I'm using (Cold
Fusion) looks for the PRIMARY KEY and checks that I have included that
column(s) in my data statement.

The first two work, the third does not. Cold Fusion reports that I did
not provide 'oid' as one of the data elements.

Cold Fusion is accessing the database using ODBC.
Database is Postgres v7.1.1 on Red Hat Linux 7.0

I'm not looking for a fix as I can create the table using the syntax
that gives the expected results, but just wanted to alert someone that
there is some inconsistency in the way a PRIMARY KEY is used or
designated.

BTW, I did not try the COLUMN CONSTRAINT syntax.

Thanks

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: P. Dwayne Miller (#1)
Re: Primary Key

"P. Dwayne Miller" <dmiller@espgroup.net> writes:

CREATE TABLE message
(
int4 msgid not null,
text msgtext,
CONSTRAINT cons_001_pk PRIMARY KEY on (msgid)
);

The first two actually create a PRIMARY KEY on msgid. The third seems
to have a PRIMARY KEY on 'oid', not 'msgid', though it does create a
unique index on 'msgid'.

After fixing the several obvious syntax errors, it works fine for me:

regression=# CREATE TABLE message
regression-# (
regression(# msgid int4 not null,
regression(# msgtext text,
regression(# CONSTRAINT cons_001_pk PRIMARY KEY (msgid)
regression(# );
NOTICE: CREATE TABLE/PRIMARY KEY will create implicit index 'cons_001_pk' for table 'message'
CREATE
regression=# \d message
Table "message"
Attribute | Type | Modifier
-----------+---------+----------
msgid | integer | not null
msgtext | text |
Primary Key: cons_001_pk

regression=#

Is Cold Fusion perhaps doing strange things to the query behind your
back? None of those CREATE TABLE commands are legal SQL according
to my references.

regards, tom lane

#3P. Dwayne Miller
dmiller@espgroup.net
In reply to: Tom Lane (#2)
Re: Primary Key

My bad on the syntax of all three. I used your syntax (which is what I had originally used) and
got the same results with the \d command that you show.

I'm only using Cold Fusion to read data from the resulting table, not create the table... and I
still get an error when I have created the primary key using the table constraint syntax. Cold
Fusion is reporting that the primary key has been defined for the column oid. Using the correct
syntax with the first two CREATE TABLE statements, Cold Fusion reports the primary key field as
msgid.

Thanks for your reply,
Dwayne

Tom Lane wrote:

Show quoted text

"P. Dwayne Miller" <dmiller@espgroup.net> writes:

CREATE TABLE message
(
int4 msgid not null,
text msgtext,
CONSTRAINT cons_001_pk PRIMARY KEY on (msgid)
);

The first two actually create a PRIMARY KEY on msgid. The third seems
to have a PRIMARY KEY on 'oid', not 'msgid', though it does create a
unique index on 'msgid'.

After fixing the several obvious syntax errors, it works fine for me:

regression=# CREATE TABLE message
regression-# (
regression(# msgid int4 not null,
regression(# msgtext text,
regression(# CONSTRAINT cons_001_pk PRIMARY KEY (msgid)
regression(# );
NOTICE: CREATE TABLE/PRIMARY KEY will create implicit index 'cons_001_pk' for table 'message'
CREATE
regression=# \d message
Table "message"
Attribute | Type | Modifier
-----------+---------+----------
msgid | integer | not null
msgtext | text |
Primary Key: cons_001_pk

regression=#

Is Cold Fusion perhaps doing strange things to the query behind your
back? None of those CREATE TABLE commands are legal SQL according
to my references.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

#4Thomas Swan
tswan-lst@ics.olemiss.edu
In reply to: P. Dwayne Miller (#1)
Re: Primary Key

Tom Lane wrote:

After fixing the several obvious syntax errors, it works fine for me:

regression=# CREATE TABLE message
regression-# (
regression(# msgid int4 not null,
regression(# msgtext text,
regression(# CONSTRAINT cons_001_pk PRIMARY KEY (msgid)
regression(# );
NOTICE: CREATE TABLE/PRIMARY KEY will create implicit index 'cons_001_pk' for table 'message'
CREATE
regression=# \d message
Table "message"
Attribute | Type | Modifier
-----------+---------+----------
msgid | integer | not null
msgtext | text |
Primary Key: cons_001_pk

regression=#

Is Cold Fusion perhaps doing strange things to the query behind your
back? None of those CREATE TABLE commands are legal SQL according
to my references.

I've been using the syntax "PRIMARY KEY (/column_name/ [, /column_name/
])," without the constraint name, and the "/COLUMN_NAME TYPE/ PRIMARY
KEY" syntax for sometime now. I may be admitting to SQL heresy in
saying that; but, that's the syntax I've seen in MySQL and in quite a
few SQL/database books.

AFIAK, it's a legal table creation statement.

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Swan (#4)
Re: Primary Key

Thomas Swan <tswan-lst@ics.olemiss.edu> writes:

AFIAK, it's a legal table creation statement.

The variant I showed is. The original one had an extraneous "ON" in the
FOREIGN KEY clause, and even more to the point all the column
declarations had column name and type name reversed. That's why I was
questioning the syntax ...

regards, tom lane

#6Naomi Walker
nwalker@eldocomp.com
In reply to: P. Dwayne Miller (#3)
Backup and Recovery

I'm new to this list, and Postgresql, and could use some advice from you
experienced users. We are very interested in making postrgresql work for
our project, but its missing one big feature, that is absolutely necessary
for a true OLTP shop.

Even more important that uptime to us, is to never put ourselves in a
position where we could lose data. I understand I can do a hot backup with
pg_dumpall. What we need on top of that is the ability to replay the
transaction logs against the previous database archive. Without such a
feature, even if I did a full backup a few times a day, we would be
vulnerable to losing hours of data (which would not be acceptable to our
users).

I can tell this has been designed to do exactly that, because its really
close. What would be needed is a hook to write the logs to disk/tape, when
they are full (and not overwrite them until they go elsewhere), and, the
ability to actually play back the logs, exactly at the right place, tied to
a specific archive.

I'm sure this is something that would benefit all our lives. Other than
just hiring a consultant to do so, is there some way to make this
happen? Other than eliminating all my single points of failover in the
hardware, is there some other way to solve this problem?

Thanks,
Naomi
--
Naomi Walker
Chief Information Officer
Eldorado Computing, Inc.
602-604-3100 ext 242

#7Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Tom Lane (#2)
Re: Re: Primary Key

"P. Dwayne Miller" wrote:

My bad on the syntax of all three. I used your syntax (which is what I had originally used) and
got the same results with the \d command that you show.

I'm only using Cold Fusion to read data from the resulting table, not create the table... and I
still get an error when I have created the primary key using the table constraint syntax. Cold
Fusion is reporting that the primary key has been defined for the column oid. Using the correct
syntax with the first two CREATE TABLE statements, Cold Fusion reports the primary key field as
msgid.

SQLPrimaryKey() in the current psqlodbc driver doesn't
report the Primary key other than tablename_pkey.
It seems the cause.
I would change the implementatin of SQLPrimaryKey().
Dwayne, could you try the modified driver ?

regards,
Hiroshi Inoue

#8P. Dwayne Miller
dmiller@espgroup.net
In reply to: Hiroshi Inoue (#7)
Re: Primary Key

I can try it. Where do I get it.

My question would be why, if SQLPrimaryKey() only reported tablename_pkey, then why does my front end
return oid as the primary key?

Thanks,
Dwayne

Hiroshi Inoue wrote:

Show quoted text

"P. Dwayne Miller" wrote:

My bad on the syntax of all three. I used your syntax (which is what I had originally used) and
got the same results with the \d command that you show.

I'm only using Cold Fusion to read data from the resulting table, not create the table... and I
still get an error when I have created the primary key using the table constraint syntax. Cold
Fusion is reporting that the primary key has been defined for the column oid. Using the correct
syntax with the first two CREATE TABLE statements, Cold Fusion reports the primary key field as
msgid.

SQLPrimaryKey() in the current psqlodbc driver doesn't
report the Primary key other than tablename_pkey.
It seems the cause.
I would change the implementatin of SQLPrimaryKey().
Dwayne, could you try the modified driver ?

regards,
Hiroshi Inoue

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

#9Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Hiroshi Inoue (#7)
Re: Re: Primary Key

"P. Dwayne Miller" wrote:

Show quoted text

I can try it. Where do I get it.

My question would be why, if SQLPrimaryKey() only reported tablename_pkey, then why does my front end
return oid as the primary key?

Thanks,
Dwayne

Hiroshi Inoue wrote:

"P. Dwayne Miller" wrote:

My bad on the syntax of all three. I used your syntax (which is what I had originally used) and
got the same results with the \d command that you show.

I'm only using Cold Fusion to read data from the resulting table, not create the table... and I
still get an error when I have created the primary key using the table constraint syntax. Cold
Fusion is reporting that the primary key has been defined for the column oid. Using the correct
syntax with the first two CREATE TABLE statements, Cold Fusion reports the primary key field as
msgid.

SQLPrimaryKey() in the current psqlodbc driver doesn't
report the Primary key other than tablename_pkey.
It seems the cause.
I would change the implementatin of SQLPrimaryKey().
Dwayne, could you try the modified driver ?

regards,
Hiroshi Inoue

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

#10Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Hiroshi Inoue (#7)
Re: Re: Primary Key

"P. Dwayne Miller" wrote:

I can try it. Where do I get it.

I would send you the dll though I don't test it by myself.
OK ?

My question would be why, if SQLPrimaryKey() only reported tablename_pkey, then why does my front end
return oid as the primary key?

Don't you turn on the FAKEOIDINDEX option ?

regards,
Hiroshi Inoue

#11Matthew Kirkwood
matthew@hairy.beasts.org
In reply to: Naomi Walker (#6)
Re: Backup and Recovery

On Tue, 19 Jun 2001, Naomi Walker wrote:

Even more important that uptime to us, is to never put ourselves in a
position where we could lose data. I understand I can do a hot backup
with pg_dumpall. What we need on top of that is the ability to replay
the transaction logs against the previous database archive. Without
such a feature, even if I did a full backup a few times a day, we
would be vulnerable to losing hours of data (which would not be
acceptable to our users).

This is what I'd like too (though I'm not that bothered about
rolling forward from a dump if I can just do it by replaying
logs onto real datafiles).

I mentioned it a while ago:

http://fts.postgresql.org/db/mw/msg.html?mid=114397

but got no response.

You are aware that you can still lose up to (by default) 16Mb
worth of transactions in this scheme, I presume?

Matthew.

#12P. Dwayne Miller
dmiller@espgroup.net
In reply to: Hiroshi Inoue (#10)
Re: Primary Key

Please send it. And yes, I do have the Fake OID Index turned on. Although I have no idea what it does.

Thanks,
Dwayne

Hiroshi Inoue wrote:

Show quoted text

"P. Dwayne Miller" wrote:

I can try it. Where do I get it.

I would send you the dll though I don't test it by myself.
OK ?

My question would be why, if SQLPrimaryKey() only reported tablename_pkey, then why does my front end
return oid as the primary key?

Don't you turn on the FAKEOIDINDEX option ?

regards,
Hiroshi Inoue

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl

#13Ross J. Reedstrom
reedstrm@rice.edu
In reply to: P. Dwayne Miller (#3)
Re: Re: Primary Key

Hmm, your using ColdFusion, so that goes through the ODBC driver, which
picks up the 'primary key' by looking for an index named 'foo_pkey',
I think. Ah, here it is:

in interfaces/odbc/info.c:

sprintf(tables_query, "select ta.attname, ia.attnum"
" from pg_attribute ta, pg_attribute ia, pg_class c, pg_index i"
" where c.relname = '%s_pkey'"
" AND c.oid = i.indexrelid"
" AND ia.attrelid = i.indexrelid"
" AND ta.attrelid = i.indrelid"
" AND ta.attnum = i.indkey[ia.attnum-1]"
" order by ia.attnum", pktab);

So, don't name the primary key constraint, or name it 'something_pkey'
and you should be fine. Something's falling back to trying to use
oid if it can't find a primary key: I'm note sure if that's inside the
ODBC driver, or in ColdFusion.

Hmm, seems we have other Access specific hacks in the ODBC driver:

/*
* I have to hide the table owner from Access, otherwise it
* insists on referring to the table as 'owner.table'. (this
* is valid according to the ODBC SQL grammar, but Postgres
* won't support it.)
*
* set_tuplefield_string(&row->tuple[1], table_owner);
*/

I bet PgAdmin would like to have that info.

Ross

Show quoted text

On Tue, Jun 19, 2001 at 06:11:12PM -0400, P. Dwayne Miller wrote:

My bad on the syntax of all three. I used your syntax (which is what I had originally used) and
got the same results with the \d command that you show.

I'm only using Cold Fusion to read data from the resulting table, not create the table... and I
still get an error when I have created the primary key using the table constraint syntax. Cold
Fusion is reporting that the primary key has been defined for the column oid. Using the correct
syntax with the first two CREATE TABLE statements, Cold Fusion reports the primary key field as
msgid.

Thanks for your reply,
Dwayne

#14Naomi Walker
nwalker@eldocomp.com
In reply to: Matthew Kirkwood (#11)
Re: Backup and Recovery

At 12:26 PM 6/20/01 +0100, Matthew Kirkwood wrote:

On Tue, 19 Jun 2001, Naomi Walker wrote:

Even more important that uptime to us, is to never put ourselves in a
position where we could lose data. I understand I can do a hot backup
with pg_dumpall. What we need on top of that is the ability to replay
the transaction logs against the previous database archive. Without
such a feature, even if I did a full backup a few times a day, we
would be vulnerable to losing hours of data (which would not be
acceptable to our users).

This is what I'd like too (though I'm not that bothered about
rolling forward from a dump if I can just do it by replaying
logs onto real datafiles).

I mentioned it a while ago:

http://fts.postgresql.org/db/mw/msg.html?mid=114397

but got no response.

Well, so now there is at least TWO of us....

We should start the thread again.

You are aware that you can still lose up to (by default) 16Mb
worth of transactions in this scheme, I presume?

I'm just starting with Postgresql, but, I thought with fsync on this was
not the case. Is that not true or what else did I miss?

Matthew.

--
Naomi Walker
Chief Information Officer
Eldorado Computing, Inc.
602-604-3100 ext 242

#15Philip Warner
pjw@rhyme.com.au
In reply to: Naomi Walker (#14)
Re: Backup and Recovery

At 13:41 20/06/01 -0700, Naomi Walker wrote:

Well, so now there is at least TWO of us....

We should start the thread again.

WAL based backup & recovery is something I have been trying to do in
background, but unfortunately I have no time at the moment. I do plan to
get back to it as soon as I can.

----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.B.N. 75 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/

#16Matthew Kirkwood
matthew@hairy.beasts.org
In reply to: Naomi Walker (#14)
Re: Backup and Recovery

On Wed, 20 Jun 2001, Naomi Walker wrote:

You are aware that you can still lose up to (by default) 16Mb
worth of transactions in this scheme, I presume?

I'm just starting with Postgresql, but, I thought with fsync on this
was not the case. Is that not true or what else did I miss?

I suppose that it rather depends on how you expected to
move the logs over. My approach was to archive the redo
when PG is done with them and only then to roll them
forward.

If a catastrophe occurs, then I wouldn't be able to do
anything with a half-full log.

Our Oracle setups use redo logs of only 1Mb for this
reason, and it doesn't seem to hurt too much (though
Oracle's datafile formats seem a fair bit denser than
Postgres's).

Matthew.

#17Noname
ncm@zembu.com
In reply to: Matthew Kirkwood (#16)
Re: Backup and Recovery

On Thu, Jun 21, 2001 at 11:01:29AM +0100, Matthew Kirkwood wrote:

On Wed, 20 Jun 2001, Naomi Walker wrote:

You are aware that you can still lose up to (by default) 16Mb
worth of transactions in this scheme, I presume?

I'm just starting with Postgresql, but, I thought with fsync on this
was not the case. Is that not true or what else did I miss?

I suppose that it rather depends on how you expected to
move the logs over. My approach was to archive the redo
when PG is done with them and only then to roll them
forward.

If a catastrophe occurs, then I wouldn't be able to do
anything with a half-full log.

Our Oracle setups use redo logs of only 1Mb for this
reason, and it doesn't seem to hurt too much (though
Oracle's datafile formats seem a fair bit denser than
Postgres's).

The above makes no sense to me. A hot recovery that discards some
random number of committed transactions is a poor sort of recovery.

Ms. Walker might be able to adapt one of the several replication
tools available for PG to do replayable logging, instead.

It seems to me that for any replication regime (symmetric or not,
synchronous or not, global or not), and also any hot-backup/recovery
approach, an update-log mechanism that produces a high-level
description of changes is essential. Using triggers to produce
such a log seems to me to be too slow and too dependent on finicky
administrative procedures.

IIUC, the regular WAL records are optimized for a different purpose:
speeding up normal operation. Also IIUC, the WAL cannot be applied
to a database reconstructed from a dump. If augmented to enable such
reconstruction, the WAL might be too bulky to serve well in that role;
it currently only needs to keep enough data to construct the current
database from a recent checkpoint, so compactness is has not been
crucial. But there's much to be said for having just a single
synchronous log mechanism. A high-level log mixed into the WAL, to
be extracted asynchrously to a much more complact replay log, might
be the ideal compromise.

The same built-in high-level logging mechanism could make all the
various kinds of disaster prevention, disaster recovery, and load
sharing much easier to implement, because they all need much the
same thing.

Nathan Myers
ncm@zembu.com

#18Matthew Kirkwood
matthew@hairy.beasts.org
In reply to: Noname (#17)
Re: Backup and Recovery

On Thu, 21 Jun 2001, Nathan Myers wrote:

I suppose that it rather depends on how you expected to
move the logs over. My approach was to archive the redo
when PG is done with them and only then to roll them
forward.

The above makes no sense to me. A hot recovery that discards some
random number of committed transactions is a poor sort of recovery.

Agreed. Nevertheless it's at least db_size - 1Mb better
than the current options.

Recovering the rest manually from log files is good enough
for us (indeed, much better than the potential loss of
performance or reliability from "real" replication).

If it horrifies you that much, think of it as 15-minutely
incremental backups.

Matthew.

#19Noname
nj7e@yahoo.com
In reply to: Matthew Kirkwood (#11)
Re: Backup and Recovery

matthew@hairy.beasts.org (Matthew Kirkwood) wrote in message news:<Pine.LNX.4.33.0106201212240.25630-100000@sphinx.mythic-beasts.com>...

On Tue, 19 Jun 2001, Naomi Walker wrote:

Even more important that uptime to us, is to never put ourselves in a
position where we could lose data. I understand I can do a hot backup
with pg_dumpall. What we need on top of that is the ability to replay
the transaction logs against the previous database archive. Without
such a feature, even if I did a full backup a few times a day, we
would be vulnerable to losing hours of data (which would not be
acceptable to our users).

This is what I'd like too (though I'm not that bothered about
rolling forward from a dump if I can just do it by replaying
logs onto real datafiles).

With stock PostgreSQL... how many committed transactions can one lose
on a simple system crash/reboot? With Oracle or Informix, the answer
is zero. Is that true with PostgreSQL in fsync mode? If not, does it
lose all in the log, or just those not yet written to the DB?

Thanks

John

#20Rod Taylor
rbt@barchord.com
In reply to: Matthew Kirkwood (#11)
Re: Re: Backup and Recovery

With stock PostgreSQL... how many committed transactions can one

lose

on a simple system crash/reboot? With Oracle or Informix, the answer
is zero. Is that true with PostgreSQL in fsync mode? If not, does it
lose all in the log, or just those not yet written to the DB?

With WAL the theory is that it will not lose a committed transaction.
Bugs have plagged previous versions (7.1.2 looks clean) and it none
(Oracle, Informix, Postgres) can protect against coding errors in the
certain cases but from general power failure it's fine.

This assumes adequate hardware too. Some harddrives claim to have
written when they haven't among other things, but Postgres itself
won't lose the information -- your hardware might :do that silently
though.)

#21Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Rod Taylor (#20)
RE: Re: Backup and Recovery

With stock PostgreSQL... how many committed transactions can one lose
on a simple system crash/reboot? With Oracle or Informix, the answer
is zero. Is that true with PostgreSQL in fsync mode? If not, does it

It's true or better say should be, keeping in mind probability of bugs.

lose all in the log, or just those not yet written to the DB?

BAR is not for "simple crash" but for the disk crashes. In this case
one will lose as much as WAL files lost.

Vadim

#22Noname
ncm@zembu.com
In reply to: Noname (#19)
Re: Re: Backup and Recovery

On Thu, Jun 28, 2001 at 08:33:45AM -0700, John Moore wrote:

matthew@hairy.beasts.org (Matthew Kirkwood) wrote in message news:<Pine.LNX.4.33.0106201212240.25630-100000@sphinx.mythic-beasts.com>...

On Tue, 19 Jun 2001, Naomi Walker wrote:

Even more important that uptime to us, is to never put ourselves in a
position where we could lose data. I understand I can do a hot backup
with pg_dumpall. What we need on top of that is the ability to replay
the transaction logs against the previous database archive. Without
such a feature, even if I did a full backup a few times a day, we
would be vulnerable to losing hours of data (which would not be
acceptable to our users).

This is what I'd like too (though I'm not that bothered about
rolling forward from a dump if I can just do it by replaying
logs onto real datafiles).

With stock PostgreSQL... how many committed transactions can one lose
on a simple system crash/reboot? With Oracle or Informix, the answer
is zero. Is that true with PostgreSQL in fsync mode? If not, does it
lose all in the log, or just those not yet written to the DB?

The answer is zero for PG as well. However, what happens if the
database becomes corrupted (e.g. because of bad RAM or bad disk)?

With Informix and Oracle, you can restore from a snapshot backup
and replay the "redo" logs since that backup, if you kept them.

Alternatively, you can keep a "failover" server that is up to date
with the last committed transaction. If it matters, you do both.
(If you're lucky, the disk or memory failure won't have corrupted
all your backups and failover servers before you notice.)

There is currently no builtin support for either in PG. Of course
both can be simulated in the client. Also, for any particular
collection of tables, a redo or replication log may be produced with
triggers; that's how the currently available replication add-ons
for PG work. Something built in could be much faster and much less
fragile.

I imagine a daemon extracting redo log entries from WAL segments,
asynchronously. Mixing redo log entries into the WAL allows the WAL
to be the only synchronous disk writer in the system, a Good Thing.

Nathan Myers
ncm@zembu.com

#23Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Noname (#22)
AW: Re: Backup and Recovery

I imagine a daemon extracting redo log entries from WAL segments,
asynchronously. Mixing redo log entries into the WAL allows the WAL
to be the only synchronous disk writer in the system, a Good Thing.

This comes up periodically now. WAL currently already has all the info
that would be needed for redo (it actually has to). All that is missing
is a program, that can take a consistent physical snapshot (as it was after
a particular checkpoint) and would replay the WAL after a restore of such a
snapshot. This replay after a consistent snapshot is probably as simple
as making the WAL files available to the standard startup rollforward (redo)
mechanism, that is already implemented.

Actually it might even be possible to do the WAL redo based on
an inconsistent backup (e.g. done with tar/cpio), but that probably needs
more thought and testing than above. At the least, the restore would need to
generate a valid pg_control before starting redo.

Andreas

#24Noname
ncm@zembu.com
In reply to: Zeugswetter Andreas SB (#23)
Re: Re: Backup and Recovery

On Wed, Jul 04, 2001 at 10:37:57AM +0200, Zeugswetter Andreas SB wrote:

I imagine a daemon extracting redo log entries from WAL segments,
asynchronously. Mixing redo log entries into the WAL allows the WAL
to be the only synchronous disk writer in the system, a Good Thing.

This comes up periodically now. WAL currently already has all the info
that would be needed for redo (it actually has to).

The WAL has the information needed to take a binary table image
from the checkpoint state to the last committed transaction.
IIUC, it is meaningless in relation to a pg_dump image.

All that is missing is a program, that can take a consistent physical
snapshot (as it was after a particular checkpoint) and would replay
the WAL after a restore of such a snapshot. This replay after a
consistent snapshot is probably as simple as making the WAL files
available to the standard startup rollforward (redo) mechanism, that
is already implemented.

How would you take a physical snapshot without interrupting database
operation? Is a physical/binary snapshot a desirable backup format?
People seem to want to be able to restore from ASCII dumps.

Also, isn't the WAL format rather bulky to archive hours and hours of?
I would expect high-level transaction redo records to be much more compact;
mixed into the WAL, such records shouldn't make the WAL grow much faster.

Nathan Myers
ncm@zembu.com

#25Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Noname (#24)
Re: Re: Backup and Recovery

All that is missing is a program, that can take a consistent physical
snapshot (as it was after a particular checkpoint) and would replay
the WAL after a restore of such a snapshot. This replay after a
consistent snapshot is probably as simple as making the WAL files
available to the standard startup rollforward (redo) mechanism, that
is already implemented.

How would you take a physical snapshot without interrupting database
operation? Is a physical/binary snapshot a desirable backup format?
People seem to want to be able to restore from ASCII dumps.

Also, isn't the WAL format rather bulky to archive hours and hours of?
I would expect high-level transaction redo records to be much more compact;
mixed into the WAL, such records shouldn't make the WAL grow much faster.

The page images are not needed and can be thrown away once the page is
completely sync'ed to disk or a checkpoint happens.

The row images aren't that large. I think any solution would have to
handle page images and row images differently.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#26Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Noname (#24)
Re: Re: Backup and Recovery

On Wed, Jul 04, 2001 at 10:37:57AM +0200, Zeugswetter Andreas SB wrote:

I imagine a daemon extracting redo log entries from WAL segments,
asynchronously. Mixing redo log entries into the WAL allows the WAL
to be the only synchronous disk writer in the system, a Good Thing.

This comes up periodically now. WAL currently already has all the info
that would be needed for redo (it actually has to).

The WAL has the information needed to take a binary table image
from the checkpoint state to the last committed transaction.
IIUC, it is meaningless in relation to a pg_dump image.

Also, I expect this will be completed for 7.2.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#27Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Bruce Momjian (#26)
AW: Re: Backup and Recovery

I imagine a daemon extracting redo log entries from WAL segments,
asynchronously. Mixing redo log entries into the WAL allows the WAL
to be the only synchronous disk writer in the system, a Good Thing.

This comes up periodically now. WAL currently already has all the info
that would be needed for redo (it actually has to).

The WAL has the information needed to take a binary table image
from the checkpoint state to the last committed transaction.

It actually even contains the information needed to take an
inconsistent binary table image from any point in time to as far as you
have WAL records for. The only prerequisite is, that you apply
at least all those WAL records that where created during the time window
from start of reading the binary table image until end of reading
for backup.

If you want to restore a whole instance from an inconsistent physical dump,
you need to apply at least all WAL records from the time window from start of
whole physical dump to end of your whole physical dump. This would have the
advantage of not needing any synchronization with the backend for doing
the physical dump. (It doesen't even matter if you read inconsistent pages,
since those will be restored by "physical log" in WAL later.) If you start
the physical dump with pg_control, you might not need to reconstruct one for
rollforward later.

IIUC, it is meaningless in relation to a pg_dump image.

Yes, pg_dump produces a "logical snapshot". You cannot use a data content
dump for later rollforward.

Andreas

#28Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Zeugswetter Andreas SB (#27)
AW: Re: Backup and Recovery

Also, isn't the WAL format rather bulky to archive hours and hours of?

If it were actually too bulky, then it needs to be made less so, since that
directly affects overall performance :-)

I would expect high-level transaction redo records to be much more compact;
mixed into the WAL, such records shouldn't make the WAL grow much faster.

All redo records have to be at the tuple level, so what higher-level are you talking
about ? (statement level redo records would not be able to reproduce the same
resulting table data (keyword: transaction isolation level))

The page images are not needed and can be thrown away once the page is
completely sync'ed to disk or a checkpoint happens.

Actually they should at least be kept another few seconds to allow "stupid"
disks to actually write the pages :-) But see previous mail, they can also
help with various BAR restore solutions.

Andreas

#29Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Zeugswetter Andreas SB (#28)
Re: AW: Re: Backup and Recovery

The page images are not needed and can be thrown away once the page is
completely sync'ed to disk or a checkpoint happens.

Actually they should at least be kept another few seconds to allow "stupid"
disks to actually write the pages :-) But see previous mail, they can also
help with various BAR restore solutions.

Agreed. They have to be kept a few seconds.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#30Naomi Walker
nwalker@eldocomp.com
In reply to: Bruce Momjian (#25)
Re: Re: Backup and Recovery

At 11:04 AM 7/4/01 -0400, Bruce Momjian wrote:

All that is missing is a program, that can take a consistent physical
snapshot (as it was after a particular checkpoint) and would replay
the WAL after a restore of such a snapshot. This replay after a
consistent snapshot is probably as simple as making the WAL files
available to the standard startup rollforward (redo) mechanism, that
is already implemented.

You would also have to have some triggering method that would not allow
WAL's to be overwritten, and make sure they are sent to some device before
that happen. In addition, you'd have to anchor the LOGS against the dumped
database, so it would know exactly which log to start with when replaying
against any given archive.
--
Naomi Walker
Chief Information Officer
Eldorado Computing, Inc.
602-604-3100 ext 242

#31Noname
ncm@zembu.com
In reply to: Zeugswetter Andreas SB (#28)
Re: Re: Backup and Recovery

On Thu, Jul 05, 2001 at 02:27:01PM +0200, Zeugswetter Andreas SB wrote:

Also, isn't the WAL format rather bulky to archive hours and hours of?

If it were actually too bulky, then it needs to be made less so, since
that directly affects overall performance :-)

ISTM that WAL record size trades off against lots of things, including
(at least) complexity of recovery code, complexity of WAL generation
code, usefulness in fixing corrupt table images, and processing time
it would take to produce smaller log entries.

Complexity is always expensive, and CPU time spent "pre-sync" is a lot
more expensive than time spent in background. That is, time spent
generating the raw log entries affects latency and peak capacity,
where time in background mainly affects average system load.

For a WAL, the balance seems to be far to the side of simple-and-bulky.
For other uses, the balance is sure to be different.

I would expect high-level transaction redo records to be much more
compact; mixed into the WAL, such records shouldn't make the WAL
grow much faster.

All redo records have to be at the tuple level, so what higher-level
are you talking about ? (statement level redo records would not be
able to reproduce the same resulting table data (keyword: transaction
isolation level))

Statement-level redo records would be nice, but as you note they are
rarely practical if done by the database.

Redo records that contain that contain whole blocks may be much bulkier
than records of whole tuples. Redo records of whole tuples may be much
bulkier than those that just identify changed fields.

Bulky logs mean more-frequent snapshot backups, and bulky log formats
are less suitable for network transmission, and therefore less useful
for replication. Smaller redo records take more processing to generate,
but that processing can be done off-line, and the result saves other
costs.

Nathan Myers
ncm@zembu.com

#32Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Noname (#31)
Re: Re: Backup and Recovery

I would expect high-level transaction redo records to be much more
compact; mixed into the WAL, such records shouldn't make the WAL
grow much faster.

All redo records have to be at the tuple level, so what higher-level
are you talking about ? (statement level redo records would not be
able to reproduce the same resulting table data (keyword: transaction
isolation level))

Statement-level redo records would be nice, but as you note they are
rarely practical if done by the database.

Redo records that contain that contain whole blocks may be much bulkier
than records of whole tuples. Redo records of whole tuples may be much
bulkier than those that just identify changed fields.

Bulky logs mean more-frequent snapshot backups, and bulky log formats
are less suitable for network transmission, and therefore less useful
for replication. Smaller redo records take more processing to generate,
but that processing can be done off-line, and the result saves other
costs.

Tom has identified that VACUUM generates hug WAL traffic because of the
writing of page preimages in case the page is partially written to disk.
It would be nice to split those out into a separate WAL file _except_ it
would require two fsyncs() for commit (bad), so we are stuck. Once the
page is flushed to disk after checkpoint, we don't really need those
pre-images anymore, hence the spliting of WAL page images and row
records for recovery purposes.

In other words, we keep the page images and row records in one file so
we can do one fsync, but once we have written the page, we don't want to
store them for later point-in-time recovery.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#33Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#32)
Re: Re: Backup and Recovery

Bruce Momjian <pgman@candle.pha.pa.us> writes:

In other words, we keep the page images and row records in one file so
we can do one fsync, but once we have written the page, we don't want to
store them for later point-in-time recovery.

What we'd want to do is strip the page images from the version of the
logs that's archived for recovery purposes. Ideally the archiving
process would also discard records from aborted transactions, but I'm
not sure how hard that'd be to do.

regards, tom lane

#34Noname
ncm@zembu.com
In reply to: Tom Lane (#33)
Re: Re: Backup and Recovery

On Thu, Jul 05, 2001 at 09:33:17PM -0400, Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

In other words, we keep the page images and row records in one file so
we can do one fsync, but once we have written the page, we don't want to
store them for later point-in-time recovery.

What we'd want to do is strip the page images from the version of the
logs that's archived for recovery purposes.

Am I correct in believing that the remaining row images would have to
be applied to a clean table-image snapshot? Maybe you can produce a
clean table-image snapshot by making a dirty image copy, and then
replaying the WAL from the time you started copying up to the time
when you finish copying.

How hard would it be to turn these row records into updates against a
pg_dump image, assuming access to a good table-image file?

Ideally the archiving process would also discard records from aborted
transactions, but I'm not sure how hard that'd be to do.

A second pass over the WAL file -- or the log-picker daemon's
first-pass output -- could eliminate the dead row images. Handling
WAL file boundaries might be tricky if one WAL file has dead row-images
and the next has the abort-or-commit record. Maybe the daemon has to
look ahead into the next WAL file to know what to discard from the
current file.

Would it be useful to mark points in a WAL file where there are no
transactions with outstanding writes?

Nathan Myers
ncm@zembu.com

#35Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Noname (#34)
AW: Re: Backup and Recovery

In other words, we keep the page images and row records in one file so
we can do one fsync, but once we have written the page, we don't want to
store them for later point-in-time recovery.

What we'd want to do is strip the page images from the version of the
logs that's archived for recovery purposes. Ideally the archiving
process would also discard records from aborted transactions, but I'm
not sure how hard that'd be to do.

Unless we have UNDO we also need to roll forward the physical changes of
aborted transactions, or later redo records will "sit on a wrong physical image".

Andreas

#36Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Zeugswetter Andreas SB (#35)
AW: Re: Backup and Recovery

Also, isn't the WAL format rather bulky to archive hours and hours of?

If it were actually too bulky, then it needs to be made less so, since
that directly affects overall performance :-)

ISTM that WAL record size trades off against lots of things, including
(at least) complexity of recovery code, complexity of WAL generation
code, usefulness in fixing corrupt table images, and processing time
it would take to produce smaller log entries.

Complexity is always expensive, and CPU time spent "pre-sync" is a lot
more expensive than time spent in background. That is, time spent
generating the raw log entries affects latency and peak capacity,
where time in background mainly affects average system load.

For a WAL, the balance seems to be far to the side of simple-and-bulky.
For other uses, the balance is sure to be different.

I do not agree with the conclusions you make above.
The limiting factor on the WAL is almost always the IO bottleneck.
How long startup rollforward takes after a crash is mainly influenced
by the checkpoint interval and IO. Thus you can spend enough additional
CPU to reduce WAL size if that leads to a substantial reduction.
Keep in mind though, that because of Toast long column values that do not
change, already do not need to be written to the WAL. Thus the potential is
not as large as it might seem.

I would expect high-level transaction redo records to be much more
compact; mixed into the WAL, such records shouldn't make the WAL
grow much faster.

All redo records have to be at the tuple level, so what higher-level
are you talking about ? (statement level redo records would not be
able to reproduce the same resulting table data (keyword: transaction
isolation level))

Statement-level redo records would be nice, but as you note they are
rarely practical if done by the database.

The point is, that the database cannot do it, unless it only allows
serializable access and allows no user defined functions with external
or runtime dependencies.

Redo records that contain that contain whole blocks may be much bulkier
than records of whole tuples.

What is written in whole pages is the physical log, and yes those pages can
be stripped before the log is copied to the backup location.

Redo records of whole tuples may be much bulkier than those that just
identify changed fields.

Yes, that might help in some cases, but as I said above, if it actually
makes a substantial difference it would be best already done before the WAL
is written.

Bulky logs mean more-frequent snapshot backups, and bulky log formats
are less suitable for network transmission, and therefore less useful
for replication.

Any reasonably flexible replication that is based on the WAL will need to
preprocess the WAL files (or buffers) before transmission anyway.

Andreas

#37Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Noname (#34)
Re: Re: Backup and Recovery

On Thu, Jul 05, 2001 at 09:33:17PM -0400, Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

In other words, we keep the page images and row records in one file so
we can do one fsync, but once we have written the page, we don't want to
store them for later point-in-time recovery.

What we'd want to do is strip the page images from the version of the
logs that's archived for recovery purposes.

Am I correct in believing that the remaining row images would have to
be applied to a clean table-image snapshot? Maybe you can produce a
clean table-image snapshot by making a dirty image copy, and then
replaying the WAL from the time you started copying up to the time
when you finish copying.

Good point. You are going to need a tar image of the data files to
restore via WAL and skip all WAL records from before the tar image. WAL
does some of the tricky stuff now as part of crash recovery but it gets
more complited for a point-in-time recovery because the binary images
was taken over time, not at a single point in time like crash recovery.

How hard would it be to turn these row records into updates against a
pg_dump image, assuming access to a good table-image file?

pg_dump is very hard because WAL contains only tids. No way to match
that to pg_dump-loaded rows.

Ideally the archiving process would also discard records from aborted
transactions, but I'm not sure how hard that'd be to do.

A second pass over the WAL file -- or the log-picker daemon's
first-pass output -- could eliminate the dead row images. Handling
WAL file boundaries might be tricky if one WAL file has dead row-images
and the next has the abort-or-commit record. Maybe the daemon has to
look ahead into the next WAL file to know what to discard from the
current file.

Would it be useful to mark points in a WAL file where there are no
transactions with outstanding writes?

I think CHECKPOINT is as good as we are going to get in that area, but
of course there are outstanding transactions that are not going to be
picked up because they weren't committed before the checkpoint
completed.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#38Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zeugswetter Andreas SB (#35)
Re: AW: Re: Backup and Recovery

Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at> writes:

Ideally the archiving
process would also discard records from aborted transactions, but I'm
not sure how hard that'd be to do.

Unless we have UNDO we also need to roll forward the physical changes of
aborted transactions, or later redo records will "sit on a wrong physical image".

Wouldn't it be the same as the case where we *do* have UNDO? How is a
removed tuple different from a tuple that was never there?

regards, tom lane

#39Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Tom Lane (#38)
AW: AW: Re: Backup and Recovery

Ideally the archiving
process would also discard records from aborted transactions, but I'm
not sure how hard that'd be to do.

Unless we have UNDO we also need to roll forward the physical changes of
aborted transactions, or later redo records will "sit on a

wrong physical image".

Wouldn't it be the same as the case where we *do* have UNDO? How is a
removed tuple different from a tuple that was never there?

HiHi, the problem is a subtile one. What if a previously aborted txn
produced a btree page split, that would otherwise not have happened ?
Another issue is "physical log" if first modification after checkpoint
was from an aborted txn. Now because you need to write that physical log
page you will also need to write the abort to pg_log ...

I guess you can however discard heap tuple *column values* from aborted
txns, but I am not sure that is worth it.

Andreas

#40Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Zeugswetter Andreas SB (#39)
AW: AW: Re: Backup and Recovery

Another issue is "physical log" if first modification after checkpoint
was from an aborted txn. Now because you need to write that physical log
page you will also need to write the abort to pg_log ...

Forget that part, sorry, how stupid. We were getting rid of those pages anyway.

Andreas

#41Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zeugswetter Andreas SB (#39)
Re: AW: AW: Re: Backup and Recovery

Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at> writes:

Wouldn't it be the same as the case where we *do* have UNDO? How is a
removed tuple different from a tuple that was never there?

HiHi, the problem is a subtile one. What if a previously aborted txn
produced a btree page split, that would otherwise not have happened ?

Good point. We'd have to recognize btree splits (and possibly some
other operations) as things that must be done anyway, even if their
originating transaction is aborted.

There already is a mechanism for doing that: xlog entries can be written
without any transaction identifier (see XLOG_NO_TRAN). Seems to me that
btree split XLOG records should be getting written that way now --- Vadim,
don't you agree?

regards, tom lane

#42Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Tom Lane (#41)
RE: AW: AW: Re: Backup and Recovery

Good point. We'd have to recognize btree splits (and possibly some
other operations) as things that must be done anyway, even if their
originating transaction is aborted.

There already is a mechanism for doing that: xlog entries can
be written without any transaction identifier (see XLOG_NO_TRAN).
Seems to me that btree split XLOG records should be getting written
that way now --- Vadim, don't you agree?

We would have to write two records per split instead of one as now.
Another way is new xlog AM method: we have XXX_redo, XXX_undo (unfunctional)
and XXX_desc (for debug output) now - add XXX_compact (or whatever)
able to modify record somehow for BAR. For heap, etc this method could
be {return} (or NULL) and for btree it could remove inserted tuple
from record (for aborted TX).

Vadim

#43Noname
ncm@zembu.com
In reply to: Bruce Momjian (#37)
Re: Re: Backup and Recovery

On Fri, Jul 06, 2001 at 06:52:49AM -0400, Bruce Momjian wrote:

Nathan wrote:

How hard would it be to turn these row records into updates against a
pg_dump image, assuming access to a good table-image file?

pg_dump is very hard because WAL contains only tids. No way to match
that to pg_dump-loaded rows.

Maybe pg_dump can write out a mapping of TIDs to line numbers, and the
back-end can create a map of inserted records' line numbers when the dump
is reloaded, so that the original TIDs can be traced to the new TIDs.
I guess this would require a new option on IMPORT. I suppose the
mappings could be temporary tables.

Nathan Myers
ncm@zembu.com