pg_dump restore time and Foreign Keys

Started by Simon Riggsabout 18 years ago34 messageshackers

simon@2ndQuadrant.com

about 18 years ago

pg_dump restore times can be high when they include many ALTER TABLE ADD
FORIEGN KEY statements, since each statement checks the data to see if
it is fully valid in all cases.

I've been asked "why we run that at all?", since if we dumped the tables
together, we already know they match.

If we had a way of pg_dump passing on the information that the test
already passes, we would be able to skip the checks.

Proposal:

* Introduce a new mode for ALTER TABLE ADD FOREIGN KEY [WITHOUT CHECK];
When we run WITHOUT CHECK, iff both the source and target table are
newly created in this transaction, then we skip the check. If the check
is skipped we mark the constraint as being unchecked, so we can tell
later if this has been used.

* Have pg_dump write the new syntax into its dumps, when both the source
and target table are dumped in same run

I'm guessing that the WITHOUT CHECK option would not be acceptable as an
unprotected trap for our lazy and wicked users. :-)

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 18 years ago

In reply to: Simon Riggs (#1)

Re: pg_dump restore time and Foreign Keys

Simon Riggs wrote:

I'm guessing that the WITHOUT CHECK option would not be acceptable as an
unprotected trap for our lazy and wicked users. :-)

Yes, that sounds scary.

Instead, I'd suggest finding ways to speed up the ALTER TABLE ADD
FOREIGN KEY. Or speeding up COPY into a table with foreign keys already
defined. For example, you might want to build an in-memory hash table of
the keys in the target table, instead of issuing a query on each INSERT,
if the target table isn't huge.

Nothing beats the speed of simply not checking the constraint, of
course, but I'd hate to lose the protection it gives.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Simon Riggs

simon@2ndQuadrant.com

about 18 years ago

In reply to: Heikki Linnakangas (#2)

Re: pg_dump restore time and Foreign Keys

On Thu, 2008-06-05 at 10:19 +0300, Heikki Linnakangas wrote:

Simon Riggs wrote:

I'm guessing that the WITHOUT CHECK option would not be acceptable as an
unprotected trap for our lazy and wicked users. :-)

Yes, that sounds scary.

Instead, I'd suggest finding ways to speed up the ALTER TABLE ADD
FOREIGN KEY.

I managed a suggestion for improving it for integers only, but if
anybody has any other ideas, I'm all ears.

Or speeding up COPY into a table with foreign keys already
defined. For example, you might want to build an in-memory hash table of
the keys in the target table, instead of issuing a query on each INSERT,
if the target table isn't huge.

No, that's not the problem, but I agree that is a problem also.

Nothing beats the speed of simply not checking the constraint, of
course, but I'd hate to lose the protection it gives.

Are you saying you don't like the rest of the proposal, or just don't
like the idea of having that added as an unprotected option, but find
the proposal acceptable?

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

Richard Huxton

dev@archonet.com

about 18 years ago

In reply to: Simon Riggs (#1)

Re: pg_dump restore time and Foreign Keys

Simon Riggs wrote:

If we had a way of pg_dump passing on the information that the test
already passes, we would be able to skip the checks.

Proposal:

* Introduce a new mode for ALTER TABLE ADD FOREIGN KEY [WITHOUT CHECK];

* Have pg_dump write the new syntax into its dumps, when both the source
and target table are dumped in same I've been known to manually tweak dumps before now. I can see me

forgetting this.

What about pg_dump writing out a row-count and MD5 of the rows in the
COPY (just a textual calculation). Iff the restore checksum matches the
dump checksum for both tables then the foreign-keys can be skipped.

If the restore checksum doesn't match the dump then it can issue a
warning, but continue and run the full fkey check.

--
Richard Huxton
Archonet Ltd

Andrew Dunstan

andrew@dunslane.net

about 18 years ago

In reply to: Simon Riggs (#1)

Re: pg_dump restore time and Foreign Keys

Simon Riggs wrote:

pg_dump restore times can be high when they include many ALTER TABLE ADD
FORIEGN KEY statements, since each statement checks the data to see if
it is fully valid in all cases.

I've been asked "why we run that at all?", since if we dumped the tables
together, we already know they match.

If we had a way of pg_dump passing on the information that the test
already passes, we would be able to skip the checks.

Proposal:

* Introduce a new mode for ALTER TABLE ADD FOREIGN KEY [WITHOUT CHECK];
When we run WITHOUT CHECK, iff both the source and target table are
newly created in this transaction, then we skip the check. If the check
is skipped we mark the constraint as being unchecked, so we can tell
later if this has been used.

* Have pg_dump write the new syntax into its dumps, when both the source
and target table are dumped in same run

I'm guessing that the WITHOUT CHECK option would not be acceptable as an
unprotected trap for our lazy and wicked users. :-)

This whole proposal would be a major footgun which would definitely be
abused, IMNSHO.

I think Heikki's idea of speeding up the check using a hash table of the
foreign keys possibly has merit.

cheers

andrew

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 18 years ago

In reply to: Simon Riggs (#3)

Re: pg_dump restore time and Foreign Keys

Simon Riggs wrote:

Are you saying you don't like the rest of the proposal, or just don't
like the idea of having that added as an unprotected option, but find
the proposal acceptable?

I don't like the idea of having an unprotected option. If we were going
to have one, I wouldn't bother with the extra checks you proposed; it's
going to be unsafe anyway.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Simon Riggs

simon@2ndQuadrant.com

about 18 years ago

In reply to: Andrew Dunstan (#5)

Re: pg_dump restore time and Foreign Keys

On Thu, 2008-06-05 at 07:57 -0400, Andrew Dunstan wrote:

Simon Riggs wrote:

pg_dump restore times can be high when they include many ALTER TABLE ADD
FORIEGN KEY statements, since each statement checks the data to see if
it is fully valid in all cases.

I've been asked "why we run that at all?", since if we dumped the tables
together, we already know they match.

If we had a way of pg_dump passing on the information that the test
already passes, we would be able to skip the checks.

Proposal:

* Introduce a new mode for ALTER TABLE ADD FOREIGN KEY [WITHOUT CHECK];
When we run WITHOUT CHECK, iff both the source and target table are
newly created in this transaction, then we skip the check. If the check
is skipped we mark the constraint as being unchecked, so we can tell
later if this has been used.

* Have pg_dump write the new syntax into its dumps, when both the source
and target table are dumped in same run

I'm guessing that the WITHOUT CHECK option would not be acceptable as an
unprotected trap for our lazy and wicked users. :-)

This whole proposal would be a major footgun which would definitely be
abused, IMNSHO.

OK, understood. Two negatives is enough to sink it.

I think Heikki's idea of speeding up the check using a hash table of the
foreign keys possibly has merit.

The query is sent through SPI, so if there was a way to speed this up,
we would already be using it implicitly. If we find a way to speed up
joins it will improve the FK check also.

The typical join plan for the check query is already a hash join,
assuming the target table is small enough. If not, its a huge sort/merge
join. So in a way, we already follow the suggestion.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 18 years ago

In reply to: Simon Riggs (#3)

Re: pg_dump restore time and Foreign Keys

Simon Riggs wrote:

On Thu, 2008-06-05 at 10:19 +0300, Heikki Linnakangas wrote:

Simon Riggs wrote:

I'm guessing that the WITHOUT CHECK option would not be acceptable as an
unprotected trap for our lazy and wicked users. :-)

Yes, that sounds scary.

Instead, I'd suggest finding ways to speed up the ALTER TABLE ADD
FOREIGN KEY.

I managed a suggestion for improving it for integers only, but if
anybody has any other ideas, I'm all ears.

Well, one idea would be to allow adding multiple foreign keys in one
command, and checking them all at once with one SQL query instead of one
per foreign key. Right now we need one seq scan over the table per
foreign key, by checking all references at once we would only need one
seq scan to check them all.

Or speeding up COPY into a table with foreign keys already
defined. For example, you might want to build an in-memory hash table of
the keys in the target table, instead of issuing a query on each INSERT,
if the target table isn't huge.

No, that's not the problem, but I agree that is a problem also.

It is related, because if we can make COPY into a table with foreign
keys fast enough, we could rearrange dumps so that foreign keys are
created before loading data. That would save the seqscan over the table
altogether.

Thinking about this idea a bit more, instead of loading the whole target
table into memory, it would probably make more sense to keep a hash
table as just a cache of the most recent keys that have been referenced.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Simon Riggs

simon@2ndQuadrant.com

about 18 years ago

In reply to: Heikki Linnakangas (#8)

Re: pg_dump restore time and Foreign Keys

On Thu, 2008-06-05 at 16:01 +0300, Heikki Linnakangas wrote:

Simon Riggs wrote:

On Thu, 2008-06-05 at 10:19 +0300, Heikki Linnakangas wrote:

Simon Riggs wrote:

I'm guessing that the WITHOUT CHECK option would not be acceptable as an
unprotected trap for our lazy and wicked users. :-)

Yes, that sounds scary.

Instead, I'd suggest finding ways to speed up the ALTER TABLE ADD
FOREIGN KEY.

I managed a suggestion for improving it for integers only, but if
anybody has any other ideas, I'm all ears.

Well, one idea would be to allow adding multiple foreign keys in one
command, and checking them all at once with one SQL query instead of one
per foreign key. Right now we need one seq scan over the table per
foreign key, by checking all references at once we would only need one
seq scan to check them all.

No need. Just parallelise the restore with concurrent psql. Which would
speed up the index creation also. Does Greg have plans for further work?

Or speeding up COPY into a table with foreign keys already
defined. For example, you might want to build an in-memory hash table of
the keys in the target table, instead of issuing a query on each INSERT,
if the target table isn't huge.

No, that's not the problem, but I agree that is a problem also.

It is related, because if we can make COPY into a table with foreign
keys fast enough, we could rearrange dumps so that foreign keys are
created before loading data. That would save the seqscan over the table
altogether.

True.

Thinking about this idea a bit more, instead of loading the whole target
table into memory, it would probably make more sense to keep a hash
table as just a cache of the most recent keys that have been referenced.

If you can think of a way of improving hash joins generally, then it
will work for this specific case also.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#10

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 18 years ago

In reply to: Simon Riggs (#9)

Re: pg_dump restore time and Foreign Keys

Simon Riggs wrote:

On Thu, 2008-06-05 at 16:01 +0300, Heikki Linnakangas wrote:

Well, one idea would be to allow adding multiple foreign keys in one
command, and checking them all at once with one SQL query instead of one
per foreign key. Right now we need one seq scan over the table per
foreign key, by checking all references at once we would only need one
seq scan to check them all.

No need. Just parallelise the restore with concurrent psql. Which would
speed up the index creation also.

True, you could do that.

Does Greg have plans for further work?

I believe he's busy with other stuff at the moment.

Thinking about this idea a bit more, instead of loading the whole target
table into memory, it would probably make more sense to keep a hash
table as just a cache of the most recent keys that have been referenced.

If you can think of a way of improving hash joins generally, then it
will work for this specific case also.

Individual RI checks performed on inserts/COPY don't do a hash join. The
bulk check done by ALTER TABLE ADD FOREIGN KEY does, but that's
different issue.

This hash table would be a specific trick to speed up RI checks. If
you're anyway I/O bound, it wouldn't help, and you'd already be better
off creating the foreign key first and loading the data after that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#11

Robert Treat

xzilla@users.sourceforge.net

about 18 years ago

In reply to: Simon Riggs (#7)

Re: pg_dump restore time and Foreign Keys

On Thursday 05 June 2008 08:56:35 Simon Riggs wrote:

On Thu, 2008-06-05 at 07:57 -0400, Andrew Dunstan wrote:

Simon Riggs wrote:

pg_dump restore times can be high when they include many ALTER TABLE
ADD FORIEGN KEY statements, since each statement checks the data to see
if it is fully valid in all cases.

I've been asked "why we run that at all?", since if we dumped the
tables together, we already know they match.

If we had a way of pg_dump passing on the information that the test
already passes, we would be able to skip the checks.

Proposal:

* Introduce a new mode for ALTER TABLE ADD FOREIGN KEY [WITHOUT CHECK];
When we run WITHOUT CHECK, iff both the source and target table are
newly created in this transaction, then we skip the check. If the check
is skipped we mark the constraint as being unchecked, so we can tell
later if this has been used.

* Have pg_dump write the new syntax into its dumps, when both the
source and target table are dumped in same run

I'm guessing that the WITHOUT CHECK option would not be acceptable as
an unprotected trap for our lazy and wicked users. :-)

This whole proposal would be a major footgun which would definitely be
abused, IMNSHO.

OK, understood. Two negatives is enough to sink it.

Heh, I would have argued that the idea should go the other way and just make
this part of the normal syntax. Oracle DBA's have been doing this for years
(MS SQL supports it too actually) and it really helps working around having
to hold locks on large relations for lengthy periods of times. Heck, I'd like
to see a no check option for all constraints really.

--
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

#12

Simon Riggs

simon@2ndQuadrant.com

about 18 years ago

In reply to: Robert Treat (#11)

Re: pg_dump restore time and Foreign Keys

On Sat, 2008-06-07 at 13:08 -0400, Robert Treat wrote:

On Thursday 05 June 2008 08:56:35 Simon Riggs wrote:

On Thu, 2008-06-05 at 07:57 -0400, Andrew Dunstan wrote:

Heh, I would have argued that the idea should go the other way and
just make this part of the normal syntax. Oracle DBA's have been
doing this for years (MS SQL supports it too actually) and it really
helps working around having to hold locks on large relations for
lengthy periods of times. Heck, I'd like to see a no check option for
all constraints really.

Interesting that SQL Server does it also.

Holding the lock for a long period is just one more problem. :-)

I'm always torn between the I-know-what-Im-doing-so-give-me-the-option
viewpoint and the some-dumbass-will-abuse-it viewpoint. I see the
results of both viewpoints daily.

Perhaps we need a GUC that says expert_mode = on. In expert_mode we are
allowed to do a range of things that are normally avoided - there would
be an explicit list. Managers can then take a single considered decision
as to whether the situation warrants extreme action and their DBA is
good enough to handle it. That might resolve our continued angst about
whether our users our smart enough to avoid the gotchas, or just smart
enough to win a DBA's Darwin Award.

The UNIX philosophy has always been to allow the power to exist, yet
seek to minimise the number of people who exercise it. Another idea
might be to make such command options superuser only, to ensure the
power is available, yet only in the hands of, by-definition, the trusted
few.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#13

Andrew Dunstan

andrew@dunslane.net

about 18 years ago

In reply to: Simon Riggs (#12)

Re: pg_dump restore time and Foreign Keys

Simon Riggs wrote:

On Sat, 2008-06-07 at 13:08 -0400, Robert Treat wrote:

On Thursday 05 June 2008 08:56:35 Simon Riggs wrote:

On Thu, 2008-06-05 at 07:57 -0400, Andrew Dunstan wrote:

Heh, I would have argued that the idea should go the other way and
just make this part of the normal syntax. Oracle DBA's have been
doing this for years (MS SQL supports it too actually) and it really
helps working around having to hold locks on large relations for
lengthy periods of times. Heck, I'd like to see a no check option for
all constraints really.

Interesting that SQL Server does it also.

Holding the lock for a long period is just one more problem. :-)

I'm always torn between the I-know-what-Im-doing-so-give-me-the-option
viewpoint and the some-dumbass-will-abuse-it viewpoint. I see the
results of both viewpoints daily.

Perhaps we need a GUC that says expert_mode = on. In expert_mode we are
allowed to do a range of things that are normally avoided - there would
be an explicit list. Managers can then take a single considered decision
as to whether the situation warrants extreme action and their DBA is
good enough to handle it. That might resolve our continued angst about
whether our users our smart enough to avoid the gotchas, or just smart
enough to win a DBA's Darwin Award.

The UNIX philosophy has always been to allow the power to exist, yet
seek to minimise the number of people who exercise it. Another idea
might be to make such command options superuser only, to ensure the
power is available, yet only in the hands of, by-definition, the trusted
few.

If we go down this road then I would far rather we tried to devise some
safe (or semi-safe) way of doing it instead of simply providing expert
(a.k.a. footgun) mode.

For instance, I'm wondering if we could do something with checksums of
the input lines or something else that would make this difficult to do
in circumstances other than pg_restore.

cheers

andrew

#14

Tom Lane

tgl@sss.pgh.pa.us

about 18 years ago

In reply to: Simon Riggs (#12)

Re: pg_dump restore time and Foreign Keys

Simon Riggs <simon@2ndquadrant.com> writes:

Perhaps we need a GUC that says expert_mode = on. ... Another idea
might be to make such command options superuser only, to ensure the
power is available, yet only in the hands of, by-definition, the trusted
few.

This all seems pretty useless, as the sort of user most likely to shoot
himself in the foot will also always be running as superuser.

I'd much rather see us expend more effort on speeding up the checks
than open holes in the system.

regards, tom lane

#15

Robert Treat

xzilla@users.sourceforge.net

about 18 years ago

In reply to: Tom Lane (#14)

Re: pg_dump restore time and Foreign Keys

On Saturday 07 June 2008 16:22:56 Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

Perhaps we need a GUC that says expert_mode = on. ... Another idea
might be to make such command options superuser only, to ensure the
power is available, yet only in the hands of, by-definition, the trusted
few.

This all seems pretty useless, as the sort of user most likely to shoot
himself in the foot will also always be running as superuser.

yeah, i'm not a big fan of "set enable_footgun=true" since the people likely
to get tripped up are going to blindly enable these modes.

otoh, if we do such a thing, i would be a big fan of calling
it "enable_footgun" :-)

I'd much rather see us expend more effort on speeding up the checks
than open holes in the system.

and i'm sure no one is against that idea, but you're never going to be able to
match the performance of just avoiding the check.

--
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

#16

Tom Lane

tgl@sss.pgh.pa.us

about 18 years ago

In reply to: Robert Treat (#15)

Re: pg_dump restore time and Foreign Keys

Robert Treat <xzilla@users.sourceforge.net> writes:

and i'm sure no one is against that idea, but you're never going to be able to
match the performance of just avoiding the check.

We'll never be able to match the performance of not having transactions,
either, but the community has never for a moment considered having a
"no transactions" mode.

regards, tom lane

#17

Robert Treat

xzilla@users.sourceforge.net

about 18 years ago

In reply to: Tom Lane (#16)

Re: pg_dump restore time and Foreign Keys

On Sunday 08 June 2008 20:12:15 Tom Lane wrote:

Robert Treat <xzilla@users.sourceforge.net> writes:

and i'm sure no one is against that idea, but you're never going to be
able to match the performance of just avoiding the check.

We'll never be able to match the performance of not having transactions,
either, but the community has never for a moment considered having a
"no transactions" mode.

it's unclear what a "no transaction" mode would mean, but I'd be willing to
guess some people have consider aspects of it (we've just never had
agreement)

--
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

#18

Jim Nasby

Jim.Nasby@BlueTreble.com

about 18 years ago

In reply to: Andrew Dunstan (#13)

Re: pg_dump restore time and Foreign Keys

On Jun 7, 2008, at 2:00 PM, Andrew Dunstan wrote:

If we go down this road then I would far rather we tried to devise
some safe (or semi-safe) way of doing it instead of simply
providing expert (a.k.a. footgun) mode.

For instance, I'm wondering if we could do something with checksums
of the input lines or something else that would make this difficult
to do in circumstances other than pg_restore.

Yes, but that provides no help at all outside of pg_dump. Being able
to add a FK with NO CHECK would be tremendously useful outside of
pg_dump. Actually, in the interest of stating the problem and not the
solution, what we need is a way to add FKs that doesn't lock
everything up to perform the key checks. Perhaps there is some semi-
safe way that the constraint could be added and the checks done in
the background...

As for the footgun aspect, are we the enterprise-class OSS database
or the one that caters itself to noobs that will go out of their way
to make life hard on themselves? I'm all in favor of not adding
footguns that don't have value, but this one holds a lot of value for
anyone trying to maintain a large database in a 24/7 environment. To
put this in perspective, the amount of revenue we would loose from
adding just one FK to one of our larger tables would more than cover
paying someone to develop this feature.
--
Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

#19

Andrew Dunstan

andrew@dunslane.net

about 18 years ago

In reply to: Jim Nasby (#18)

Re: pg_dump restore time and Foreign Keys

Decibel! wrote:

Yes, but that provides no help at all outside of pg_dump. Being able
to add a FK with NO CHECK would be tremendously useful outside of
pg_dump. Actually, in the interest of stating the problem and not the
solution, what we need is a way to add FKs that doesn't lock
everything up to perform the key checks. Perhaps there is some
semi-safe way that the constraint could be added and the checks done
in the background...

I had some thoughts along the same lines.

But how do you propose to recover when the check fails? What should
pg_restore do if the dump is corrupt causing an FK check to fail?

I suppose we could have some sort of marking for FK constraints along
the lines of {checked, unchecked, invalid}.

As for the footgun aspect, are we the enterprise-class OSS database or
the one that caters itself to noobs that will go out of their way to
make life hard on themselves?

We are the database that tries very hard to keep its promises. If you
want to change or relax those promises then the implications need to be
very very clear.

I'm all in favor of not adding footguns that don't have value, but
this one holds a lot of value for anyone trying to maintain a large
database in a 24/7 environment. To put this in perspective, the amount
of revenue we would loose from adding just one FK to one of our larger
tables would more than cover paying someone to develop this feature.

Come up with a good proposal and I'm your man :-) I haven't seen one yet.

cheers

andrew

#20

Tom Lane

tgl@sss.pgh.pa.us

about 18 years ago

In reply to: Jim Nasby (#18)

Re: pg_dump restore time and Foreign Keys

Decibel! <decibel@decibel.org> writes:

Actually, in the interest of stating the problem and not the
solution, what we need is a way to add FKs that doesn't lock
everything up to perform the key checks.

Ah, finally a useful comment. I think it might be possible to do an
"add FK concurrently" type of command that would take exclusive lock
for just long enough to add the triggers, then scan the tables with just
AccessShareLock to see if the existing rows meet the constraint, and
if so finally mark the constraint "valid". Meanwhile the constraint
would be enforced against newly-added rows by the triggers, so nothing
gets missed. You'd still get a small hiccup in system performance
from the transient exclusive lock, but nothing like as bad as it is
now. Would that solve your problem?

regards, tom lane

#21

Simon Riggs

simon@2ndQuadrant.com

about 18 years ago

In reply to: Tom Lane (#20)

#22

Andrew Dunstan

andrew@dunslane.net

about 18 years ago

In reply to: Simon Riggs (#21)

#23