pg_dump: Sorted output, referential integrity statements

sszabo@megazone23.bigpanda.com

about 24 years ago

In reply to: Christof Petig (#1)

Re: pg_dump: Sorted output, referential integrity statements

On Thu, 6 Dec 2001, Christof Petig wrote:

- pg_dump outputs referential constraints as 3 triggers (near to two
different tables) per constraint. A mode which outputs the original
statement (alter table ... add constraint) would be more sql standard
conformant, portable and readable. But ... you might get into trouble if
the referenced table creation command is output later.

There's some interesting timing things with this. Pretty much the
alter statements have to be after the creates for all the tables at least
due to recursive constraints. When you're using insert statements (-d)
since the restore doesn't appear to be in a transaction, all the data
needs to have been loaded as well (again due to recursive constraints).
In fact, there's *no* guarantee that even with a transaction that a
restore of the current database state statement by statement will succeed
since the user may have done odd things to insert the data.
If the data's already there, the alter table is going to check each row
for validity which can be kinda slow right now on big restores, we'd
probably need to make a better check.

pjw@rhyme.com.au

about 24 years ago

In reply to: Christof Petig (#1)

Re: pg_dump: Sorted output, referential integrity

At 12:05 6/12/01 +0100, Christof Petig wrote:

- pg_dump outputs the data unsorted

Not quite correct; it outputs them in an order that is designed to improve
the chances of dependencies being satisfied, and improve the performance of
a full restore (a modified OID order).

but to manage the data in a version
control system you need it consistently sorted. So a flag to sort by
either primary key or left to right would be of great value. (--sorted
?)

Not really very generalizable when you consider user defined types,
triggers etc.

- pg_dump outputs referential constraints as 3 triggers (near to two
different tables) per constraint. A mode which outputs the original
statement (alter table ... add constraint) would be more sql standard

Abosolutely; with time we are moving pg_dump to use standard SQL.

So, I'm kindly asking for your opinion regarding this two features.
Does anybody plan to implement them?

No plans for the first one, but sorting by ('object-type', 'object-name')
would be close to trivial, if there is much interest/support for it.

The second (SQL conformance) is high on my list; a few people (Chris &
Stephen?) have been working hard to implement 'alter table add/etc
constraint'. When this is stable, we will move pg_dump in that direction.
But as of 7.1, there were still wrinkles in the the implementation that
meant it was unsuitable for pg_dump. Not sure about the status in 7.2.

----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.B.N. 75 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/

christof@petig-baender.de

about 24 years ago

In reply to: Stephan Szabo (#2)

Re: pg_dump: Sorted output, referential integrity statements

Stephan Szabo wrote:

Since nobody answered concerning the sort issue, I guess
- nobody is planning or implementing this
- nobody disagrees this might be handy to have

On Thu, 6 Dec 2001, Christof Petig wrote:

- pg_dump outputs referential constraints as 3 triggers (near to two
different tables) per constraint. A mode which outputs the original
statement (alter table ... add constraint) would be more sql standard
conformant, portable and readable. But ... you might get into trouble if
the referenced table creation command is output later.

There's some interesting timing things with this. Pretty much the
alter statements have to be after the creates for all the tables at least
due to recursive constraints. When you're using insert statements (-d)
since the restore doesn't appear to be in a transaction, all the data
needs to have been loaded as well (again due to recursive constraints).
In fact, there's *no* guarantee that even with a transaction that a
restore of the current database state statement by statement will succeed
since the user may have done odd things to insert the data.
If the data's already there, the alter table is going to check each row
for validity which can be kinda slow right now on big restores, we'd
probably need to make a better check.

The propose was mainly made to make the output more readable if you dump a
single table (per pg_dump call). This would also use portable sql commands so
it's easier to migrate data (given that you also specify -D).

Yours
Christof

christof@petig-baender.de

about 24 years ago

In reply to: Philip Warner (#3)

Re: pg_dump: Sorted output, referential integrity

Philip Warner wrote:

Ah, yes. Now I remember it was you improving pg_dump.

At 12:05 6/12/01 +0100, Christof Petig wrote:

- pg_dump outputs the data unsorted

Not quite correct; it outputs them in an order that is designed to improve
the chances of dependencies being satisfied, and improve the performance of
a full restore (a modified OID order).

That's perfect - unless you want to diff two pg_dumps

but to manage the data in a version
control system you need it consistently sorted. So a flag to sort by
either primary key or left to right would be of great value. (--sorted
?)

Not really very generalizable when you consider user defined types,
triggers etc.

Hmmm. But if we have a primary key on columns (A,B,C) and request the data
'order by A,B,C' this should be portable, shouldn't it?
If we don't have a primary key simply ordering by 1,2,3,...n should also work.
Or am I missing something?

- pg_dump outputs referential constraints as 3 triggers (near to two
different tables) per constraint. A mode which outputs the original
statement (alter table ... add constraint) would be more sql standard

Abosolutely; with time we are moving pg_dump to use standard SQL.

Great news.

So, I'm kindly asking for your opinion regarding this two features.
Does anybody plan to implement them?

No plans for the first one, but sorting by ('object-type', 'object-name')
would be close to trivial, if there is much interest/support for it.

I don't understand what you mean by 'sorting by object-type/name', can you
give me an example. Simply adding an (optional) order by clause was the one I
intended.

The second (SQL conformance) is high on my list; a few people (Chris &
Stephen?) have been working hard to implement 'alter table add/etc
constraint'. When this is stable, we will move pg_dump in that direction.
But as of 7.1, there were still wrinkles in the the implementation that
meant it was unsuitable for pg_dump. Not sure about the status in 7.2.

Oh, I was targeting 7.2. I can not surely tell about 7.2, but have seen cvs
logs implementing similar things.

Christof

Ross J. Reedstrom

reedstrm@rice.edu

about 24 years ago

In reply to: Christof Petig (#5)

Re: pg_dump: Sorted output, referential integrity

On Fri, Dec 07, 2001 at 03:16:26PM +0100, Christof Petig wrote:

Philip Warner wrote:

Ah, yes. Now I remember it was you improving pg_dump.

At 12:05 6/12/01 +0100, Christof Petig wrote:

- pg_dump outputs the data unsorted

Not quite correct; it outputs them in an order that is designed to improve
the chances of dependencies being satisfied, and improve the performance of
a full restore (a modified OID order).

That's perfect - unless you want to diff two pg_dumps

I've ran into this myself. However, I've never wanted to diff a full dump,
usually just schema comparisions - I usually _know_ which database has
the current data, I just want to be sure I can move it over. For schema
comparisions, it's easy enough to generate a 'diffable' file that reflects
the schema, something like:

select relname||'.'||attname from pg_class c, pg_attribute a
where attrelid=c.oid and attnum >0 and relname !~ '^pg' order by
relname,attname;

Hmm, I do see that sometimes it'd be nice to do a full diff, really. The
'oid order' was a nice hack to avoid having to do a full dependency
analysis on db objects, but they're not stable. I think with oids
going away as much as possible, anyway, we're probably going to have
to bite the bullet and do dependencies, one way or another. There are
a number of features that are often requested that all boil down to
knowing dependencies: dropping the auto-generated sequence for a serial,
along with the table - reparsing various functions/views/etc. when the
underlying tables are modified, etc.

Ross

sszabo@megazone23.bigpanda.com

about 24 years ago

In reply to: Philip Warner (#3)

Re: pg_dump: Sorted output, referential integrity

The second (SQL conformance) is high on my list; a few people (Chris &
Stephen?) have been working hard to implement 'alter table add/etc
constraint'. When this is stable, we will move pg_dump in that direction.
But as of 7.1, there were still wrinkles in the the implementation that
meant it was unsuitable for pg_dump. Not sure about the status in 7.2.

Well, the biggest thing I see on using alter table add constraint for
foreign keys is the expense involved if you do it after the tables are
populated. I chose the theoretical cleanliness of checking each row
using the code we had over the speed of doing a special check for the
alter table case, although I'm considering reversing that for 7.3 to make
the alter table more reasonable and make it possible for you to consider
doing it.

pjw@rhyme.com.au

about 24 years ago

In reply to: Christof Petig (#5)

Re: pg_dump: Sorted output, referential integrity

At 15:16 7/12/01 +0100, Christof Petig wrote:

Not really very generalizable when you consider user defined types,
triggers etc.

Hmmm. But if we have a primary key on columns (A,B,C) and request the data
'order by A,B,C' this should be portable, shouldn't it?
If we don't have a primary key simply ordering by 1,2,3,...n should also

work.

Or am I missing something?

My mistake; I thought you wanted to compare metadata. Sorting data by PK
seems like a reasonable thing to do.

pjw@rhyme.com.au

about 24 years ago

In reply to: Philip Warner (#8)

Re: pg_dump: Sorted output, referential integrity

At 10:02 8/12/01 +1100, Philip Warner wrote:

At 15:16 7/12/01 +0100, Christof Petig wrote:

Not really very generalizable when you consider user defined types,
triggers etc.

Hmmm. But if we have a primary key on columns (A,B,C) and request the data
'order by A,B,C' this should be portable, shouldn't it?
If we don't have a primary key simply ordering by 1,2,3,...n should also

work.

Or am I missing something?

My mistake; I thought you wanted to compare metadata. Sorting data by PK
seems like a reasonable thing to do.

To make the dump diff-able, we probably need to sort the metadata by type &
name as well.

#10

Christopher Kings-Lynne

chriskl@familyhealth.com.au

about 24 years ago

In reply to: Christof Petig (#5)

Re: pg_dump: Sorted output, referential integrity

but to manage the data in a version
control system you need it consistently sorted. So a flag to sort by
either primary key or left to right would be of great value. (--sorted
?)

Not really very generalizable when you consider user defined types,
triggers etc.

Hmmm. But if we have a primary key on columns (A,B,C) and request the data
'order by A,B,C' this should be portable, shouldn't it?
If we don't have a primary key simply ordering by 1,2,3,...n
should also work.
Or am I missing something?

I can see how ordering a dump by the primary key would be a neat way of
'clustering' your data after a restore, however I have qualms about the
scalability of such a scheme. What if someone has a 100GB table? They may
have arranged things so that they never get a sort from it or something, or
it might take ages. However I guess if it's an optional parameter it might
be neat.

My feeling is that it won't happen unless you actually code it into a patch
that makes it a parameter to pg_dump. Having an actual patch is a great way
of getting something you want done ;)

Alternatively, have you tried just writing a PERL script (or some clever sed
script) that will just sort the COPY FROM sections...?

Chris

#11

christof@petig-baender.de

about 24 years ago

In reply to: Christopher Kings-Lynne (#10)

Re: pg_dump: Sorted output, referential integrity

Christopher Kings-Lynne wrote:

but to manage the data in a version
control system you need it consistently sorted. So a flag to sort by
either primary key or left to right would be of great value. (--sorted
?)

Not really very generalizable when you consider user defined types,
triggers etc.

Hmmm. But if we have a primary key on columns (A,B,C) and request the data
'order by A,B,C' this should be portable, shouldn't it?
If we don't have a primary key simply ordering by 1,2,3,...n
should also work.
Or am I missing something?

I can see how ordering a dump by the primary key would be a neat way of
'clustering' your data after a restore, however I have qualms about the
scalability of such a scheme. What if someone has a 100GB table? They may
have arranged things so that they never get a sort from it or something, or
it might take ages. However I guess if it's an optional parameter it might
be neat.

My feeling is that it won't happen unless you actually code it into a patch
that makes it a parameter to pg_dump. Having an actual patch is a great way
of getting something you want done ;)

Alternatively, have you tried just writing a PERL script (or some clever sed
script) that will just sort the COPY FROM sections...?

That's beyond my perl skills. And I believe sed to be not the right tool. (hmm,
perhaps split (at 'COPY FROM' and at '\.'), then sort, then cat ... many
(perhaps big) temporary files, let the db do the hard work)

But making a patch to pg_dump is a matter of (say) up to 4 hours.
I'll do it since you seem to like it and nobody started doing it so far.

Christof

#12

Christopher Kings-Lynne

chriskl@familyhealth.com.au

about 24 years ago

In reply to: Christof Petig (#11)

Re: pg_dump: Sorted output, referential integrity

But making a patch to pg_dump is a matter of (say) up to 4 hours.
I'll do it since you seem to like it and nobody started doing it so far.

Well, I'm in no way a major developer, so even if I do like it, I don't know
what the chances are of it making its way into the tree.

Chris

#13

christof@petig-baender.de

about 24 years ago

In reply to: Christopher Kings-Lynne (#12)

Re: pg_dump: Sorted output, referential integrity

Christopher Kings-Lynne wrote:

But making a patch to pg_dump is a matter of (say) up to 4 hours.
I'll do it since you seem to like it and nobody started doing it so far.

Well, I'm in no way a major developer, so even if I do like it, I don't know
what the chances are of it making its way into the tree.

If I stop using C++ comments '//', the chance might grew better ;-) [I
apologize again]

Since Philip also likes it ...
I would say it's a good feature to have and up to now most of my patches went
into the tree. So the chances are not that bad (though definitely not for 7.2).

Christof

#14

pjw@rhyme.com.au

about 24 years ago

In reply to: Stephan Szabo (#7)

Re: pg_dump: Sorted output, referential integrity

At 13:34 7/12/01 -0800, Stephan Szabo wrote:

Well, the biggest thing I see on using alter table add constraint for
foreign keys is the expense involved if you do it after the tables are
populated.

Is it really worse than loading the tables with the constraint in place?

I chose the theoretical cleanliness of checking each row
using the code we had over the speed of doing a special check for the
alter table case,

Out of curiosity - what was the difference?

By the sounds of it, we may get 'alter table' in pg_dump by 7.3 or 7.4.

#15

sszabo@megazone23.bigpanda.com

about 24 years ago

In reply to: Philip Warner (#14)

Re: pg_dump: Sorted output, referential integrity

On Tue, 11 Dec 2001, Philip Warner wrote:

At 13:34 7/12/01 -0800, Stephan Szabo wrote:

Well, the biggest thing I see on using alter table add constraint for
foreign keys is the expense involved if you do it after the tables are
populated.

Is it really worse than loading the tables with the constraint in place?

I'd say its better than while loading, but currently the check isn't
performed at all I think, because the create constraint trigger
statements are after data load and they don't check the data at all.
At least that's how I remember it, I could be wrong.

I chose the theoretical cleanliness of checking each row
using the code we had over the speed of doing a special check for the
alter table case,

Out of curiosity - what was the difference?

The check could be performed in a single statment on the fktable with
a not exists (limit 1). I've sort of hoped that the optimizer would
be able to potentially pick a better plan than run the subselect once
for every row in the fktable. :) But at the time, I wasn't comfortable
with mucking with the triggers themselves, and that would have meant
having two things that each had a copy of the fk check logic.

By the sounds of it, we may get 'alter table' in pg_dump by 7.3 or 7.4.

That'd be cool. :)

#16

christof@petig-baender.de

about 24 years ago

In reply to: Christopher Kings-Lynne (#12)

1 attachment(s)

Re: pg_dump: PATCH for Sorted output

Christof Petig wrote:

Christopher Kings-Lynne wrote:

But making a patch to pg_dump is a matter of (say) up to 4 hours.
I'll do it since you seem to like it and nobody started doing it so far.

Well, I'm in no way a major developer, so even if I do like it, I don't know
what the chances are of it making its way into the tree.

If I stop using C++ comments '//', the chance might grew better ;-) [I
apologize again]

Since Philip also likes it ...
I would say it's a good feature to have.

Here's the patch. It's not as efficient as it might be (if dumpTable_order_by had
indinfo around) but it works. I'm not clear about quoting when using sorted output
in 'COPY' style. So if anybody has good test cases around (tables with strange
characters), please check it.

Also I don't know whether the sorting behaviour is sensible when it comes to
inheritance. Can someone using inheritance please check it.

If you like the patch I'll provide documentation patches.

-----

This patch implements:
-T alias '--sort' which sorts by primary key / the columns in output order

Yours
Christof

#17

Jan Wieck

janwieck@yahoo.com

about 24 years ago

In reply to: Stephan Szabo (#15)

Re: pg_dump: Sorted output, referential integrity

Stephan Szabo wrote:

On Tue, 11 Dec 2001, Philip Warner wrote:

At 13:34 7/12/01 -0800, Stephan Szabo wrote:

Well, the biggest thing I see on using alter table add constraint for
foreign keys is the expense involved if you do it after the tables are
populated.

Is it really worse than loading the tables with the constraint in place?

I'd say its better than while loading, but currently the check isn't
performed at all I think, because the create constraint trigger
statements are after data load and they don't check the data at all.
At least that's how I remember it, I could be wrong.

You're not. This discussion came up a couple of times, and
the answer is allways the same.

We don't want to define the constraints with ALTER TABLE
because this means checking data on restore that doesn't need
to be checked at all (in theory). If he has a crash of a
critical system and restores from a dump, I bet the farm that
he wants it FAST.

Jan

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com

#18

sszabo@megazone23.bigpanda.com

about 24 years ago

In reply to: Jan Wieck (#17)

Re: pg_dump: Sorted output, referential integrity

On Tue, 11 Dec 2001, Jan Wieck wrote:

Stephan Szabo wrote:

On Tue, 11 Dec 2001, Philip Warner wrote:

At 13:34 7/12/01 -0800, Stephan Szabo wrote:

Well, the biggest thing I see on using alter table add constraint for
foreign keys is the expense involved if you do it after the tables are
populated.

Is it really worse than loading the tables with the constraint in place?

I'd say its better than while loading, but currently the check isn't
performed at all I think, because the create constraint trigger
statements are after data load and they don't check the data at all.
At least that's how I remember it, I could be wrong.

You're not. This discussion came up a couple of times, and
the answer is allways the same.

We don't want to define the constraints with ALTER TABLE
because this means checking data on restore that doesn't need
to be checked at all (in theory). If he has a crash of a
critical system and restores from a dump, I bet the farm that
he wants it FAST.

I'd say as an optional parameter to dump, it's definately not a bad idea
(like the idea of a --sql or whatever) since the user has to explicitly
ask for it. I think for the rest of the cases it comes down to what people
want it to do.

#19