Further pg_upgrade analysis for many tables

Started by Bruce Momjianabout 13 years ago76 messages

bruce@momjian.us

about 13 years ago

1 attachment(s)

As a followup to Magnus's report that pg_upgrade was slow for many
tables, I did some more testing with many tables, e.g.:

CREATE TABLE test991 (x SERIAL);

I ran it for 0, 1k, 2k, ... 16k tables, and got these results:

tables pg_dump restore pg_upgrade(increase)
0 0.30 0.24 11.73(-)
1000 6.46 6.55 28.79(2.45x)
2000 29.82 20.96 69.75(2.42x)
4000 95.70 115.88 289.82(4.16x)
8000 405.38 505.93 1168.60(4.03x)
16000 1702.23 2197.56 5022.82(4.30x)

Things look fine through 2k, but at 4k the duration of pg_dump, restore,
and pg_upgrade (which is mostly a combination of these two) is 4x,
rather than the 2x as predicted by the growth in the number of tables.
To see how bad it is, 16k tables is 1.3 hours, and 32k tables would be
5.6 hours by my estimates.

You can see the majority of pg_upgrade duration is made up of the
pg_dump and the schema restore, so I can't really speed up pg_upgrade
without speeding those up, and the 4x increase is in _both_ of those
operations, not just one.

Also, for 16k, I had to increase max_locks_per_transaction or the dump
would fail, which kind of surprised me.

I tested 9.2 and git head, but they produced identical numbers. I did
use synchronous_commit=off.

Any ideas? I am attaching my test script.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

Bruce Momjian

bruce@momjian.us

about 13 years ago

In reply to: Bruce Momjian (#1)

Re: Further pg_upgrade analysis for many tables

On Wed, Nov 7, 2012 at 09:17:29PM -0500, Bruce Momjian wrote:

Things look fine through 2k, but at 4k the duration of pg_dump, restore,
and pg_upgrade (which is mostly a combination of these two) is 4x,
rather than the 2x as predicted by the growth in the number of tables.
To see how bad it is, 16k tables is 1.3 hours, and 32k tables would be
5.6 hours by my estimates.

You can see the majority of pg_upgrade duration is made up of the
pg_dump and the schema restore, so I can't really speed up pg_upgrade
without speeding those up, and the 4x increase is in _both_ of those
operations, not just one.

Also, for 16k, I had to increase max_locks_per_transaction or the dump
would fail, which kind of surprised me.

I tested 9.2 and git head, but they produced identical numbers. I did
use synchronous_commit=off.

Any ideas? I am attaching my test script.

Thinking this might be related to some server setting, I increased
shared buffers, work_mem, and maintenance_work_mem, but this produced
almost no improvement:

tables pg_dump restore pg_upgrade
1 0.30 0.24 11.73(-)
1000 6.46 6.55 28.79(2.45)
2000 29.82 20.96 69.75(2.42)
4000 95.70 115.88 289.82(4.16)
8000 405.38 505.93 1168.60(4.03)

shared_buffers=1GB
tables pg_dump restore pg_upgrade
1 0.26 0.23
1000 6.22 7.00
2000 23.92 22.51
4000 88.44 111.99
8000 376.20 531.07

shared_buffers=1GB
work_mem/maintenance_work_mem = 500MB
1 0.27 0.23
1000 6.39 8.27
2000 26.34 20.53
4000 89.47 104.59
8000 397.13 486.99

Any ideas what else I should test? It this O(2n) or O(n^2) behavior?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

Peter Eisentraut

peter@eisentraut.org

about 13 years ago

In reply to: Bruce Momjian (#1)

Re: Further pg_upgrade analysis for many tables

On 11/7/12 9:17 PM, Bruce Momjian wrote:

As a followup to Magnus's report that pg_upgrade was slow for many
tables, I did some more testing with many tables, e.g.:

CREATE TABLE test991 (x SERIAL);

I ran it for 0, 1k, 2k, ... 16k tables, and got these results:

tables pg_dump restore pg_upgrade(increase)
0 0.30 0.24 11.73(-)
1000 6.46 6.55 28.79(2.45x)
2000 29.82 20.96 69.75(2.42x)
4000 95.70 115.88 289.82(4.16x)
8000 405.38 505.93 1168.60(4.03x)
16000 1702.23 2197.56 5022.82(4.30x)

I can reproduce these numbers, more or less. (Additionally, it ran out
of shared memory with the default setting when dumping the 8000 tables.)

But this issue seems to be entirely the fault of sequences being
present. When I replace the serial column with an int, everything
finishes within seconds and scales seemingly linearly.

Jeff Janes

jeff.janes@gmail.com

about 13 years ago

In reply to: Bruce Momjian (#1)

Re: Further pg_upgrade analysis for many tables

On Wed, Nov 7, 2012 at 6:17 PM, Bruce Momjian <bruce@momjian.us> wrote:

As a followup to Magnus's report that pg_upgrade was slow for many
tables, I did some more testing with many tables, e.g.:

...

Any ideas? I am attaching my test script.

Have you reviewed the thread at:
http://archives.postgresql.org/pgsql-performance/2012-09/msg00003.php
?

There is a known N^2 behavior when using pg_dump against pre-9.3 servers.

There was a proposed patch to pg_dump to work around the problem when
it is used against older servers, but it is was not accepted and not
entered into a commitfest. For one thing because it there was doubts
about how stable it would be at very large scale and it wasn't tested
all that thoroughly, and for another, it would be a temporary
improvement as once the server itself is upgraded to 9.3, the kludge
in pg_dump would no longer be an improvement.

The most recent version (that I can find) of that work-around patch is at:

http://archives.postgresql.org/pgsql-performance/2012-06/msg00071.php

I don't know if that will solve your particular case, but it is
probably worth a try.

Cheers,

Jeff

Bruce Momjian

bruce@momjian.us

about 13 years ago

In reply to: Jeff Janes (#4)

Re: Further pg_upgrade analysis for many tables

On Thu, Nov 8, 2012 at 03:46:09PM -0800, Jeff Janes wrote:

On Wed, Nov 7, 2012 at 6:17 PM, Bruce Momjian <bruce@momjian.us> wrote:

As a followup to Magnus's report that pg_upgrade was slow for many
tables, I did some more testing with many tables, e.g.:

...

Any ideas? I am attaching my test script.

Have you reviewed the thread at:
http://archives.postgresql.org/pgsql-performance/2012-09/msg00003.php
?

There is a known N^2 behavior when using pg_dump against pre-9.3 servers.

I am actually now dumping git head/9.3, so I assume all the problems we
know about should be fixed.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

Bruce Momjian

bruce@momjian.us

about 13 years ago

In reply to: Peter Eisentraut (#3)

Re: Further pg_upgrade analysis for many tables

On Thu, Nov 8, 2012 at 12:30:11PM -0500, Peter Eisentraut wrote:

On 11/7/12 9:17 PM, Bruce Momjian wrote:

As a followup to Magnus's report that pg_upgrade was slow for many
tables, I did some more testing with many tables, e.g.:

CREATE TABLE test991 (x SERIAL);

I ran it for 0, 1k, 2k, ... 16k tables, and got these results:

tables pg_dump restore pg_upgrade(increase)
0 0.30 0.24 11.73(-)
1000 6.46 6.55 28.79(2.45x)
2000 29.82 20.96 69.75(2.42x)
4000 95.70 115.88 289.82(4.16x)
8000 405.38 505.93 1168.60(4.03x)
16000 1702.23 2197.56 5022.82(4.30x)

I can reproduce these numbers, more or less. (Additionally, it ran out
of shared memory with the default setting when dumping the 8000 tables.)

But this issue seems to be entirely the fault of sequences being
present. When I replace the serial column with an int, everything
finishes within seconds and scales seemingly linearly.

I did some more research and realized that I was not using --schema-only
like pg_upgrade uses. With that setting, things look like this:

--schema-only
tables pg_dump restore pg_upgrade
1 0.27 0.23 11.73(-)
1000 3.64 5.18 28.79(2.45)
2000 13.07 14.63 69.75(2.42)
4000 43.93 66.87 289.82(4.16)
8000 190.63 326.67 1168.60(4.03)
16000 757.80 1402.82 5022.82(4.30)

You can still see the 4x increase, but it now for all tests ---
basically, every time the number of tables doubles, the time to dump or
restore a _single_ table doubles, e.g. for 1k tables, a single table
takes 0.00364 to dump, for 16k tables, a single table takes 0.04736 to
dump, a 13x slowdown.

Second, with --schema-only, you can see the dump/restore is only 50% of
the duration of pg_upgrade, and you can also see that pg_upgrade itself
is slowing down as the number of tables increases, even ignoring the
dump/reload time.

This is all bad news. :-( I will keep digging.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

Jeff Janes

jeff.janes@gmail.com

about 13 years ago

In reply to: Bruce Momjian (#5)

Re: Further pg_upgrade analysis for many tables

On Thu, Nov 8, 2012 at 4:33 PM, Bruce Momjian <bruce@momjian.us> wrote:

On Thu, Nov 8, 2012 at 03:46:09PM -0800, Jeff Janes wrote:

On Wed, Nov 7, 2012 at 6:17 PM, Bruce Momjian <bruce@momjian.us> wrote:

As a followup to Magnus's report that pg_upgrade was slow for many
tables, I did some more testing with many tables, e.g.:

...

Any ideas? I am attaching my test script.

Have you reviewed the thread at:
http://archives.postgresql.org/pgsql-performance/2012-09/msg00003.php
?

There is a known N^2 behavior when using pg_dump against pre-9.3 servers.

I am actually now dumping git head/9.3, so I assume all the problems we
know about should be fixed.

Are sure the server you are dumping out of is head?

Using head's pg_dump, but 9.2.1 server, it takes me 179.11 seconds to
dump 16,000 tables (schema only) like your example, and it is
definitely quadratic.

But using head's pg_dump do dump tables out of head's server, it only
took 24.95 seconds, and the quadratic term is not yet important,
things still look linear.

But even the 179.11 seconds is several times faster than your report
of 757.8, so I'm not sure what is going on there. I don't think my
laptop is particularly fast:

Intel(R) Pentium(R) CPU B960 @ 2.20GHz

Is the next value, increment, etc. for a sequence stored in a catalog,
or are they stored in the 8kb file associated with each sequence? If
they are stored in the file, than it is shame that pg_dump goes to the
effort of extracting that info if pg_upgrade is just going to
overwrite it anyway.

Cheers,

Jeff

Tom Lane

tgl@sss.pgh.pa.us

about 13 years ago

In reply to: Jeff Janes (#7)

Re: Further pg_upgrade analysis for many tables

Jeff Janes <jeff.janes@gmail.com> writes:

Are sure the server you are dumping out of is head?

I experimented a bit with dumping/restoring 16000 tables matching
Bruce's test case (ie, one serial column apiece). The pg_dump profile
seems fairly flat, without any easy optimization targets. But
restoring the dump script shows a rather interesting backend profile:

samples % image name symbol name
30861 39.6289 postgres AtEOXact_RelationCache
9911 12.7268 postgres hash_seq_search
2682 3.4440 postgres init_sequence
2218 2.8482 postgres _bt_compare
2120 2.7223 postgres hash_search_with_hash_value
1976 2.5374 postgres XLogInsert
1429 1.8350 postgres CatalogCacheIdInvalidate
1282 1.6462 postgres LWLockAcquire
973 1.2494 postgres LWLockRelease
702 0.9014 postgres hash_any

The hash_seq_search time is probably mostly associated with
AtEOXact_RelationCache, which is run during transaction commit and scans
the relcache hashtable looking for tables created in the current
transaction. So that's about 50% of the runtime going into that one
activity.

There are at least three ways we could whack that mole:

* Run the psql script in --single-transaction mode, as I was mumbling
about the other day. If we were doing AtEOXact_RelationCache only once,
rather than once per CREATE TABLE statement, it wouldn't be a problem.
Easy but has only a narrow scope of applicability.

* Keep a separate list (or data structure of your choice) so that
relcache entries created in the current xact could be found directly
rather than having to scan the whole relcache. That'd add complexity
though, and could perhaps be a net loss for cases where the relcache
isn't so bloated.

* Limit the size of the relcache (eg by aging out
not-recently-referenced entries) so that we aren't incurring O(N^2)
costs for scripts touching N tables. Again, this adds complexity and
could be counterproductive in some scenarios.

regards, tom lane

Ants Aasma

ants@cybertec.at

about 13 years ago

In reply to: Jeff Janes (#7)

2 attachment(s)

Re: Further pg_upgrade analysis for many tables

On Fri, Nov 9, 2012 at 6:59 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Thu, Nov 8, 2012 at 4:33 PM, Bruce Momjian <bruce@momjian.us> wrote:

On Thu, Nov 8, 2012 at 03:46:09PM -0800, Jeff Janes wrote:

On Wed, Nov 7, 2012 at 6:17 PM, Bruce Momjian <bruce@momjian.us> wrote:

As a followup to Magnus's report that pg_upgrade was slow for many
tables, I did some more testing with many tables, e.g.:

...

Any ideas? I am attaching my test script.

Have you reviewed the thread at:
http://archives.postgresql.org/pgsql-performance/2012-09/msg00003.php
?

There is a known N^2 behavior when using pg_dump against pre-9.3 servers.

I am actually now dumping git head/9.3, so I assume all the problems we
know about should be fixed.

Are sure the server you are dumping out of is head?

Using head's pg_dump, but 9.2.1 server, it takes me 179.11 seconds to
dump 16,000 tables (schema only) like your example, and it is
definitely quadratic.

But using head's pg_dump do dump tables out of head's server, it only
took 24.95 seconds, and the quadratic term is not yet important,
things still look linear.

I also ran a couple of experiments with git head. From 8k to 16k I'm
seeing slightly super-linear scaling (2.25x), from 32k to 64k a
quadratic term has taken over (3.74x).

I ran the experiments on a slightly beefier machine (Intel i5 @ 4GHz,
Intel SSD 320, Linux 3.2, ext4). For 16k, pg_dump took 29s, pg_upgrade
111s. At 64k the times were 150s/1237s. I didn't measure it, but
occasional peek at top suggested that most of the time was spent doing
server side processing of restore.

I also took two profiles (attached). AtEOXact_RelationCache seems to
be the culprit for the quadratic growth.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

#10

Ants Aasma

ants@cybertec.at

about 13 years ago

In reply to: Ants Aasma (#9)

Re: Further pg_upgrade analysis for many tables

On Fri, Nov 9, 2012 at 7:53 AM, Ants Aasma <ants@cybertec.at> wrote:

I also took two profiles (attached). AtEOXact_RelationCache seems to
be the culprit for the quadratic growth.

One more thing that jumps out as quadratic from the profiles is
transfer_all_new_dbs from pg_upgrade (20% of total CPU time at 64k).
Searching for non-primary files loops over the whole file list for
each relation. This would be a lot faster if we would sort the file
list first and use binary search to find the related files.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

#11

Jeff Janes

jeff.janes@gmail.com

about 13 years ago

In reply to: Tom Lane (#8)

1 attachment(s)

Re: Further pg_upgrade analysis for many tables

On Thu, Nov 8, 2012 at 9:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jeff Janes <jeff.janes@gmail.com> writes:

Are sure the server you are dumping out of is head?

I experimented a bit with dumping/restoring 16000 tables matching
Bruce's test case (ie, one serial column apiece). The pg_dump profile
seems fairly flat, without any easy optimization targets. But
restoring the dump script shows a rather interesting backend profile:

samples % image name symbol name
30861 39.6289 postgres AtEOXact_RelationCache
9911 12.7268 postgres hash_seq_search

...

There are at least three ways we could whack that mole:

* Run the psql script in --single-transaction mode, as I was mumbling
about the other day. If we were doing AtEOXact_RelationCache only once,
rather than once per CREATE TABLE statement, it wouldn't be a problem.
Easy but has only a narrow scope of applicability.

That is effective when loading into 9.3 (assuming you make
max_locks_per_transaction large enough). But when loading into <9.3,
using --single-transaction will evoke the quadratic behavior in the
resource owner/lock table and make things worse rather than better.

But there is still the question of how people can start using 9.3 if
they can't use pg_upgrade, or use the pg_dump half of the dump/restore
in, order to get there.

It seems to me that pg_upgrade takes some pains to ensure that no one
else attaches to the database during its operation. In that case, is
it necessary to run the entire dump in a single transaction in order
to get a consistent picture? The attached crude patch allows pg_dump
to not use a single transaction (and thus not accumulate a huge number
of locks) by using the --pg_upgrade flag.

This seems to remove the quadratic behavior of running pg_dump against
pre-9.3 servers. It is linear up to 30,000 tables with a single
serial column, at about 1.5 msec per table.

I have no evidence other than a gut feeling that this is a safe thing to do.

I've also tested Tatsuo-san's group-"LOCK TABLE" patch against this
case, and it is minimal help. The problem is that there is no syntax
for locking sequences, so they cannot be explicitly locked as a group
but rather are implicitly locked one by one and so still suffer from
the quadratic behavior.

Cheers,

Jeff

Attachments:

pg_dump_for_upgrade.patchapplication/octet-stream; name=pg_dump_for_upgrade.patchDownload

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
new file mode 100644
index 82330cb..70e2fd4
*** a/src/bin/pg_dump/pg_dump.c
--- b/src/bin/pg_dump/pg_dump.c
*************** extern int	optind,
*** 67,72 ****
--- 67,74 ----
  			opterr;
  
  
+ int pg_upgrade=0;
+ 
  typedef struct
  {
  	const char *descr;			/* comment for an object */
*************** main(int argc, char **argv)
*** 345,350 ****
--- 347,353 ----
  		{"inserts", no_argument, &dump_inserts, 1},
  		{"lock-wait-timeout", required_argument, NULL, 2},
  		{"no-tablespaces", no_argument, &outputNoTablespaces, 1},
+ 		{"pg_upgrade", no_argument, &pg_upgrade, 1},
  		{"quote-all-identifiers", no_argument, &quote_all_identifiers, 1},
  		{"role", required_argument, NULL, 3},
  		{"section", required_argument, NULL, 5},
*************** main(int argc, char **argv)
*** 608,613 ****
--- 611,617 ----
  	/*
  	 * Start transaction-snapshot mode transaction to dump consistent data.
  	 */
+ 	if (!pg_upgrade) {
  	ExecuteSqlStatement(fout, "BEGIN");
  	if (fout->remoteVersion >= 90100)
  	{
*************** main(int argc, char **argv)
*** 623,628 ****
--- 627,633 ----
  	else
  		ExecuteSqlStatement(fout,
  							"SET TRANSACTION ISOLATION LEVEL SERIALIZABLE");
+ 	};
  
  	/* Select the appropriate subquery to convert user IDs to names */
  	if (fout->remoteVersion >= 80100)
*************** getTables(Archive *fout, int *numTables)
*** 4349,4354 ****
--- 4354,4360 ----
  		 * NOTE: it'd be kinda nice to lock other relations too, not only
  		 * plain tables, but the backend doesn't presently allow that.
  		 */
+ 		if (!pg_upgrade) {
  		if (tblinfo[i].dobj.dump && tblinfo[i].relkind == RELKIND_RELATION)
  		{
  			resetPQExpBuffer(query);
*************** getTables(Archive *fout, int *numTables)
*** 4359,4364 ****
--- 4365,4371 ----
  											 tblinfo[i].dobj.name));
  			ExecuteSqlStatement(fout, query->data);
  		}
+ 		};
  
  		/* Emit notice if join for owner failed */
  		if (strlen(tblinfo[i].rolname) == 0)

#12

Bruce Momjian

bruce@momjian.us

about 13 years ago

In reply to: Jeff Janes (#7)

1 attachment(s)

Re: Further pg_upgrade analysis for many tables

On Thu, Nov 8, 2012 at 08:59:21PM -0800, Jeff Janes wrote:

On Thu, Nov 8, 2012 at 4:33 PM, Bruce Momjian <bruce@momjian.us> wrote:

On Thu, Nov 8, 2012 at 03:46:09PM -0800, Jeff Janes wrote:

On Wed, Nov 7, 2012 at 6:17 PM, Bruce Momjian <bruce@momjian.us> wrote:

As a followup to Magnus's report that pg_upgrade was slow for many
tables, I did some more testing with many tables, e.g.:

...

Any ideas? I am attaching my test script.

Have you reviewed the thread at:
http://archives.postgresql.org/pgsql-performance/2012-09/msg00003.php
?

There is a known N^2 behavior when using pg_dump against pre-9.3 servers.

I am actually now dumping git head/9.3, so I assume all the problems we
know about should be fixed.

Are sure the server you are dumping out of is head?

Well, I tested again with 9.2 dumping/loading 9.2 and the same for git
head, and got these results:

pg_dump restore
9.2 git 9.2 git

1 0.13 0.11 0.07 0.07
1000 4.37 3.98 4.32 5.28
2000 12.98 12.19 13.64 14.25
4000 47.85 50.14 61.31 70.97
8000 210.39 183.00 302.67 294.20
16000 901.53 769.83 1399.25 1359.09

As you can see, there is very little difference between 9.2 and git
head, except maybe at the 16k level for pg_dump.

Is there some slowdown with a mismatched version dump/reload? I am
attaching my test script.

Using head's pg_dump, but 9.2.1 server, it takes me 179.11 seconds to
dump 16,000 tables (schema only) like your example, and it is
definitely quadratic.

Are you using a SERIAL column for the tables. I am, and Peter
Eisentraut reported that was a big slowdown.

But using head's pg_dump do dump tables out of head's server, it only
took 24.95 seconds, and the quadratic term is not yet important,
things still look linear.

Again, using SERIAL?

But even the 179.11 seconds is several times faster than your report
of 757.8, so I'm not sure what is going on there. I don't think my
laptop is particularly fast:

Intel(R) Pentium(R) CPU B960 @ 2.20GHz

I am using server-grade hardware, Xeon E5620 2.4GHz:

http://momjian.us/main/blogs/pgblog/2012.html#January_20_2012

Is the next value, increment, etc. for a sequence stored in a catalog,
or are they stored in the 8kb file associated with each sequence? If

Each sequence is stored in its own 1-row 8k table:

test=> CREATE SEQUENCE seq;
CREATE SEQUENCE

they are stored in the file, than it is shame that pg_dump goes to the
effort of extracting that info if pg_upgrade is just going to
overwrite it anyway.

Actually, pg_upgrade needs pg_dump to restore all those sequence values.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#13

Andres Freund

andres@anarazel.de

about 13 years ago

In reply to: Peter Eisentraut (#3)

Re: Further pg_upgrade analysis for many tables

On 2012-11-08 12:30:11 -0500, Peter Eisentraut wrote:

On 11/7/12 9:17 PM, Bruce Momjian wrote:

As a followup to Magnus's report that pg_upgrade was slow for many
tables, I did some more testing with many tables, e.g.:

CREATE TABLE test991 (x SERIAL);

I ran it for 0, 1k, 2k, ... 16k tables, and got these results:

tables pg_dump restore pg_upgrade(increase)
0 0.30 0.24 11.73(-)
1000 6.46 6.55 28.79(2.45x)
2000 29.82 20.96 69.75(2.42x)
4000 95.70 115.88 289.82(4.16x)
8000 405.38 505.93 1168.60(4.03x)
16000 1702.23 2197.56 5022.82(4.30x)

I can reproduce these numbers, more or less. (Additionally, it ran out
of shared memory with the default setting when dumping the 8000 tables.)

But this issue seems to be entirely the fault of sequences being
present. When I replace the serial column with an int, everything
finishes within seconds and scales seemingly linearly.

I don't know the pg_dump code at all but I would guess that without the
serial there are no dependencies, so the whole dependency sorting
business doesn't need to do very much...

Greetings,

Andres Freund

#14

Jeff Janes

jeff.janes@gmail.com

about 13 years ago

In reply to: Bruce Momjian (#6)

Re: Further pg_upgrade analysis for many tables

On Thu, Nov 8, 2012 at 7:25 PM, Bruce Momjian <bruce@momjian.us> wrote:

I did some more research and realized that I was not using --schema-only
like pg_upgrade uses. With that setting, things look like this:

...

For profiling pg_dump in isolation, you should also specify
--binary-upgrade. I was surprised that it makes a big difference,
slowing it down by about 2 fold.

Cheers,

Jeff

#15

Jeff Janes

jeff.janes@gmail.com

about 13 years ago

In reply to: Bruce Momjian (#12)

Re: Further pg_upgrade analysis for many tables

On Fri, Nov 9, 2012 at 3:06 PM, Bruce Momjian <bruce@momjian.us> wrote:

On Thu, Nov 8, 2012 at 08:59:21PM -0800, Jeff Janes wrote:

On Thu, Nov 8, 2012 at 4:33 PM, Bruce Momjian <bruce@momjian.us> wrote:

I am actually now dumping git head/9.3, so I assume all the problems we
know about should be fixed.

Are sure the server you are dumping out of is head?

Well, I tested again with 9.2 dumping/loading 9.2 and the same for git
head, and got these results:

pg_dump restore
9.2 git 9.2 git

1 0.13 0.11 0.07 0.07
1000 4.37 3.98 4.32 5.28
2000 12.98 12.19 13.64 14.25
4000 47.85 50.14 61.31 70.97
8000 210.39 183.00 302.67 294.20
16000 901.53 769.83 1399.25 1359.09

For pg_dump, there are 4 possible combinations, not just two. you can
use 9.2's pg_dump to dump from a 9.2 server, use git's pg_dump to dump
from a 9.2 server, use git's pg_dump to dump from a git server, or use
9.2's pg_dump to dump from a git server (although that last one isn't
very relevant)

As you can see, there is very little difference between 9.2 and git
head, except maybe at the 16k level for pg_dump.

Is there some slowdown with a mismatched version dump/reload? I am
attaching my test script.

Sorry, from the script I can't really tell what versions are being
used for what.

Using head's pg_dump, but 9.2.1 server, it takes me 179.11 seconds to
dump 16,000 tables (schema only) like your example, and it is
definitely quadratic.

Are you using a SERIAL column for the tables. I am, and Peter
Eisentraut reported that was a big slowdown.

Yes, I'm using the same table definition as your example.

But using head's pg_dump do dump tables out of head's server, it only
took 24.95 seconds, and the quadratic term is not yet important,
things still look linear.

Again, using SERIAL?

Yep.

Is the next value, increment, etc. for a sequence stored in a catalog,
or are they stored in the 8kb file associated with each sequence? If

Each sequence is stored in its own 1-row 8k table:

test=> CREATE SEQUENCE seq;
CREATE SEQUENCE

test=> SELECT * FROM seq;
-[ RECORD 1 ]-+--------------------
sequence_name | seq
last_value | 1
start_value | 1
increment_by | 1
max_value | 9223372036854775807
min_value | 1
cache_value | 1
log_cnt | 0
is_cycled | f
is_called | f

they are stored in the file, than it is shame that pg_dump goes to the
effort of extracting that info if pg_upgrade is just going to
overwrite it anyway.

Actually, pg_upgrade needs pg_dump to restore all those sequence values.

I did an experiment where I had pg_dump just output dummy values
rather than hitting the database. Once pg_upgrade moves the relation
files over, the dummy values disappear and are set back to their
originals. So I think that pg_upgrade depends on pg_dump only in a
trivial way--they need to be there, but it doesn't matter what they
are.

Cheers,

Jeff

#16

Bruce Momjian

bruce@momjian.us

about 13 years ago

In reply to: Ants Aasma (#10)

Re: Further pg_upgrade analysis for many tables

On Fri, Nov 9, 2012 at 08:20:59AM +0200, Ants Aasma wrote:

On Fri, Nov 9, 2012 at 7:53 AM, Ants Aasma <ants@cybertec.at> wrote:

I also took two profiles (attached). AtEOXact_RelationCache seems to
be the culprit for the quadratic growth.

One more thing that jumps out as quadratic from the profiles is
transfer_all_new_dbs from pg_upgrade (20% of total CPU time at 64k).
Searching for non-primary files loops over the whole file list for
each relation. This would be a lot faster if we would sort the file
list first and use binary search to find the related files.

I am confused why you see a loop. transfer_all_new_dbs() does a
merge-join of old/new database names, then calls gen_db_file_maps(),
which loops over the relations and calls create_rel_filename_map(),
which adds to the map via array indexing. I don't see any file loops
in there --- can you be more specific?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#17

Bruce Momjian

bruce@momjian.us

about 13 years ago

In reply to: Jeff Janes (#15)

Re: Further pg_upgrade analysis for many tables

On Fri, Nov 9, 2012 at 04:23:40PM -0800, Jeff Janes wrote:

On Fri, Nov 9, 2012 at 3:06 PM, Bruce Momjian <bruce@momjian.us> wrote:

On Thu, Nov 8, 2012 at 08:59:21PM -0800, Jeff Janes wrote:

On Thu, Nov 8, 2012 at 4:33 PM, Bruce Momjian <bruce@momjian.us> wrote:

I am actually now dumping git head/9.3, so I assume all the problems we
know about should be fixed.

Are sure the server you are dumping out of is head?

Well, I tested again with 9.2 dumping/loading 9.2 and the same for git
head, and got these results:

pg_dump restore
9.2 git 9.2 git

1 0.13 0.11 0.07 0.07
1000 4.37 3.98 4.32 5.28
2000 12.98 12.19 13.64 14.25
4000 47.85 50.14 61.31 70.97
8000 210.39 183.00 302.67 294.20
16000 901.53 769.83 1399.25 1359.09

For pg_dump, there are 4 possible combinations, not just two. you can
use 9.2's pg_dump to dump from a 9.2 server, use git's pg_dump to dump
from a 9.2 server, use git's pg_dump to dump from a git server, or use
9.2's pg_dump to dump from a git server (although that last one isn't
very relevant)

True, but I thought doing matching versions was a sufficient test.

Using head's pg_dump, but 9.2.1 server, it takes me 179.11 seconds to
dump 16,000 tables (schema only) like your example, and it is
definitely quadratic.

Are you using a SERIAL column for the tables. I am, and Peter
Eisentraut reported that was a big slowdown.

Yes, I'm using the same table definition as your example.

OK.

But using head's pg_dump do dump tables out of head's server, it only
took 24.95 seconds, and the quadratic term is not yet important,
things still look linear.

Again, using SERIAL?

Yep.

Odd why yours is so much after.

they are stored in the file, than it is shame that pg_dump goes to the
effort of extracting that info if pg_upgrade is just going to
overwrite it anyway.

Actually, pg_upgrade needs pg_dump to restore all those sequence values.

I did an experiment where I had pg_dump just output dummy values
rather than hitting the database. Once pg_upgrade moves the relation
files over, the dummy values disappear and are set back to their
originals. So I think that pg_upgrade depends on pg_dump only in a
trivial way--they need to be there, but it doesn't matter what they
are.

Oh, wow, I had not thought of that. Once we move the sequence files
into place from the old cluster, whatever was assigned to the sequence
counter by pg_dump restored is thrown away. Good point.

I am hesistant to add an optimization to pg_dump to fix this unless we
decide that pg_dump uses sequences in some non-optimal way that would
not warrant us improving general sequence creation performance.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#18

Ants Aasma

ants@cybertec.at

about 13 years ago

In reply to: Bruce Momjian (#16)

Re: Further pg_upgrade analysis for many tables

On Sat, Nov 10, 2012 at 7:10 PM, Bruce Momjian <bruce@momjian.us> wrote:

I am confused why you see a loop. transfer_all_new_dbs() does a
merge-join of old/new database names, then calls gen_db_file_maps(),
which loops over the relations and calls create_rel_filename_map(),
which adds to the map via array indexing. I don't see any file loops
in there --- can you be more specific?

Sorry, I was too tired when posting that. I actually meant
transfer_single_new_db(). More specifically the profile clearly showed
that most of the time was spent in the two loops starting on lines 193
and 228.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

#19

Bruce Momjian

bruce@momjian.us

about 13 years ago

In reply to: Jeff Janes (#15)

Re: Further pg_upgrade analysis for many tables

On Fri, Nov 9, 2012 at 04:23:40PM -0800, Jeff Janes wrote:

Actually, pg_upgrade needs pg_dump to restore all those sequence values.

I did an experiment where I had pg_dump just output dummy values
rather than hitting the database. Once pg_upgrade moves the relation
files over, the dummy values disappear and are set back to their
originals. So I think that pg_upgrade depends on pg_dump only in a
trivial way--they need to be there, but it doesn't matter what they
are.

FYI, thanks everyone for testing this.  I will keep going on my tests
--- seems I have even more things to try in my benchmarks.  I will
publish my results soon.

In general, I think we are getting some complaints about dump/restore
performance with a large number of tables, irregardless of pg_upgrade,
so it seems worthwhile to try to find the cause.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#20

Bruce Momjian

bruce@momjian.us

about 13 years ago

In reply to: Ants Aasma (#18)

Re: Further pg_upgrade analysis for many tables

On Sat, Nov 10, 2012 at 07:17:34PM +0200, Ants Aasma wrote:

On Sat, Nov 10, 2012 at 7:10 PM, Bruce Momjian <bruce@momjian.us> wrote:

I am confused why you see a loop. transfer_all_new_dbs() does a
merge-join of old/new database names, then calls gen_db_file_maps(),
which loops over the relations and calls create_rel_filename_map(),
which adds to the map via array indexing. I don't see any file loops
in there --- can you be more specific?

Sorry, I was too tired when posting that. I actually meant
transfer_single_new_db(). More specifically the profile clearly showed
that most of the time was spent in the two loops starting on lines 193
and 228.

Wow, you are right on target. I was so focused on making logical
lookups linear that I did not consider file system vm/fsm and file
extension lookups. Let me think a little and I will report back.
Thanks.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#21

Bruce Momjian

bruce@momjian.us

about 13 years ago

In reply to: Jeff Janes (#14)

Re: Further pg_upgrade analysis for many tables

On Fri, Nov 9, 2012 at 04:06:38PM -0800, Jeff Janes wrote:

On Thu, Nov 8, 2012 at 7:25 PM, Bruce Momjian <bruce@momjian.us> wrote:

I did some more research and realized that I was not using --schema-only
like pg_upgrade uses. With that setting, things look like this:

...

For profiling pg_dump in isolation, you should also specify
--binary-upgrade. I was surprised that it makes a big difference,
slowing it down by about 2 fold.

Yes, I see that now:

pg_dump vs. pg_dump --binary-upgrade
9.2 w/ b-u git w/ b-u pg_upgrade
1 0.13 0.13 0.11 0.13 11.73
1000 4.37 8.18 3.98 8.08 28.79
2000 12.98 33.29 12.19 28.11 69.75
4000 47.85 140.62 50.14 138.02 289.82
8000 210.39 604.95 183.00 517.35 1168.60
16000 901.53 2373.79 769.83 1975.94 5022.82

I didn't show the restore numbers yet because I haven't gotten automated
pg_dump --binary-upgrade restore to work yet, but a normal restore for
16k takes 2197.56, so adding that to 1975.94, you get 4173.5, which is
83% of 5022.82. That is a big chunk of the total time for pg_upgrade.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#22

Jeff Janes

jeff.janes@gmail.com

about 13 years ago

In reply to: Bruce Momjian (#17)

Re: Further pg_upgrade analysis for many tables

On Sat, Nov 10, 2012 at 9:15 AM, Bruce Momjian <bruce@momjian.us> wrote:

On Fri, Nov 9, 2012 at 04:23:40PM -0800, Jeff Janes wrote:

On Fri, Nov 9, 2012 at 3:06 PM, Bruce Momjian <bruce@momjian.us> wrote:

Again, using SERIAL?

Yep.

Odd why yours is so much after.

You didn't build git head under --enable-cassert, did you?

Any chance you can do a oprofile or gprof of head's pg_dump dumping
out of head's server? That really should be a lot faster (since
commit eeb6f37d89fc60c6449ca12ef9e) than dumping out of 9.2 server.
If it is not for you, I don't see how to figure it out without a
profile of the slow system.

Cheers,

Jeff

#23

Bruce Momjian

bruce@momjian.us

about 13 years ago

In reply to: Bruce Momjian (#21)

Re: Further pg_upgrade analysis for many tables

On Sat, Nov 10, 2012 at 05:20:55PM -0500, Bruce Momjian wrote:

On Fri, Nov 9, 2012 at 04:06:38PM -0800, Jeff Janes wrote:

On Thu, Nov 8, 2012 at 7:25 PM, Bruce Momjian <bruce@momjian.us> wrote:

I did some more research and realized that I was not using --schema-only
like pg_upgrade uses. With that setting, things look like this:

...

For profiling pg_dump in isolation, you should also specify
--binary-upgrade. I was surprised that it makes a big difference,
slowing it down by about 2 fold.

Yes, I see that now:

pg_dump vs. pg_dump --binary-upgrade
9.2 w/ b-u git w/ b-u pg_upgrade
1 0.13 0.13 0.11 0.13 11.73
1000 4.37 8.18 3.98 8.08 28.79
2000 12.98 33.29 12.19 28.11 69.75
4000 47.85 140.62 50.14 138.02 289.82
8000 210.39 604.95 183.00 517.35 1168.60
16000 901.53 2373.79 769.83 1975.94 5022.82

I didn't show the restore numbers yet because I haven't gotten automated
pg_dump --binary-upgrade restore to work yet, but a normal restore for
16k takes 2197.56, so adding that to 1975.94, you get 4173.5, which is
83% of 5022.82. That is a big chunk of the total time for pg_upgrade.

What I am seeing here is the same 4x increase for a 2x increase in the
number of tables. Something must be going on there. I have oprofile
set up, so I will try to run oprofile and try to find which functions
are taking up most of the time, though I am confused why Tom didn't see
any obvious causes. I will keep going, and will focus on git head, and
schema-only, non-binary-upgrade mode, for simplicity. I am just not
seeing 9.2 or --binary-upgrade causing any fundamental affects ---
pg_dump --schema-only itself has the same problems, and probably the
same cause.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#24

Bruce Momjian

bruce@momjian.us

about 13 years ago

In reply to: Jeff Janes (#22)

Re: Further pg_upgrade analysis for many tables

On Sat, Nov 10, 2012 at 02:45:54PM -0800, Jeff Janes wrote:

On Sat, Nov 10, 2012 at 9:15 AM, Bruce Momjian <bruce@momjian.us> wrote:

On Fri, Nov 9, 2012 at 04:23:40PM -0800, Jeff Janes wrote:

On Fri, Nov 9, 2012 at 3:06 PM, Bruce Momjian <bruce@momjian.us> wrote:

Again, using SERIAL?

Yep.

Odd why yours is so much after.

You didn't build git head under --enable-cassert, did you?

Yikes, you got me! I have not done performance testing in so long, I
had forgotten I changed my defaults. New numbers to follow. Sorry.

Any chance you can do a oprofile or gprof of head's pg_dump dumping
out of head's server? That really should be a lot faster (since
commit eeb6f37d89fc60c6449ca12ef9e) than dumping out of 9.2 server.
If it is not for you, I don't see how to figure it out without a
profile of the slow system.

Yes, coming.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#25

Bruce Momjian

bruce@momjian.us

about 13 years ago

In reply to: Bruce Momjian (#24)

Re: Further pg_upgrade analysis for many tables

On Sat, Nov 10, 2012 at 05:59:54PM -0500, Bruce Momjian wrote:

On Sat, Nov 10, 2012 at 02:45:54PM -0800, Jeff Janes wrote:

On Sat, Nov 10, 2012 at 9:15 AM, Bruce Momjian <bruce@momjian.us> wrote:

On Fri, Nov 9, 2012 at 04:23:40PM -0800, Jeff Janes wrote:

On Fri, Nov 9, 2012 at 3:06 PM, Bruce Momjian <bruce@momjian.us> wrote:

Again, using SERIAL?

Yep.

Odd why yours is so much after.

You didn't build git head under --enable-cassert, did you?

Yikes, you got me! I have not done performance testing in so long, I
had forgotten I changed my defaults. New numbers to follow. Sorry.

Any chance you can do a oprofile or gprof of head's pg_dump dumping
out of head's server? That really should be a lot faster (since
commit eeb6f37d89fc60c6449ca12ef9e) than dumping out of 9.2 server.
If it is not for you, I don't see how to figure it out without a
profile of the slow system.

Yes, coming.

OK, here are my results. Again, apologies for posting non-linear
results based on assert builds:

---------- 9.2 ---------- ------------ 9.3 --------
-- normal -- -- bin-up -- -- normal -- -- bin-up --
dump rest dump rest dump rest dump rest pg_upgrade
1 0.12 0.06 0.12 0.06 0.11 0.07 0.11 0.07 11.11
1000 7.22 2.40 4.74 2.78 2.20 2.43 4.04 2.86 19.60
2000 5.67 5.10 8.82 5.57 4.50 4.97 8.07 5.69 30.55
4000 13.34 11.13 25.16 12.52 8.95 11.24 16.75 12.16 60.70
8000 29.12 25.98 59.60 28.08 16.68 24.02 30.63 27.08 123.05
16000 87.36 53.16 189.38 62.72 31.38 55.37 61.55 62.66 365.71

You can see the non-linear dump at 16k in 9.2, and the almost-linear in
9.3. :-)

pg_upgrade shows non-linear, but that is probably because of the
non-linear behavior of 9.2, and because of the two non-linear loops that
Ants found, that I will address in a separate email.

Thanks for the feedback.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#26

Bruce Momjian

bruce@momjian.us

about 13 years ago

In reply to: Tom Lane (#8)

Re: Further pg_upgrade analysis for many tables

On Fri, Nov 9, 2012 at 12:50:34AM -0500, Tom Lane wrote:

Jeff Janes <jeff.janes@gmail.com> writes:

Are sure the server you are dumping out of is head?

I experimented a bit with dumping/restoring 16000 tables matching
Bruce's test case (ie, one serial column apiece). The pg_dump profile
seems fairly flat, without any easy optimization targets. But
restoring the dump script shows a rather interesting backend profile:

samples % image name symbol name
30861 39.6289 postgres AtEOXact_RelationCache
9911 12.7268 postgres hash_seq_search
2682 3.4440 postgres init_sequence
2218 2.8482 postgres _bt_compare
2120 2.7223 postgres hash_search_with_hash_value
1976 2.5374 postgres XLogInsert
1429 1.8350 postgres CatalogCacheIdInvalidate
1282 1.6462 postgres LWLockAcquire
973 1.2494 postgres LWLockRelease
702 0.9014 postgres hash_any

The hash_seq_search time is probably mostly associated with
AtEOXact_RelationCache, which is run during transaction commit and scans
the relcache hashtable looking for tables created in the current
transaction. So that's about 50% of the runtime going into that one
activity.

Thanks for finding this. What is odd is that I am not seeing non-linear
restores at 16k in git head, so I am confused how something that
consumes ~50% of backend time could still perform linearly. Would this
consume 50% at lower table counts?

I agree we should do something, even if this is a rare case, because 50%
is a large percentage.

There are at least three ways we could whack that mole:

* Run the psql script in --single-transaction mode, as I was mumbling
about the other day. If we were doing AtEOXact_RelationCache only once,
rather than once per CREATE TABLE statement, it wouldn't be a problem.
Easy but has only a narrow scope of applicability.

* Keep a separate list (or data structure of your choice) so that
relcache entries created in the current xact could be found directly
rather than having to scan the whole relcache. That'd add complexity
though, and could perhaps be a net loss for cases where the relcache
isn't so bloated.

I like this one. Could we do it only when the cache gets to be above a
certain size, to avoid any penalty?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#27

Tom Lane

tgl@sss.pgh.pa.us

about 13 years ago

In reply to: Bruce Momjian (#26)

Re: Further pg_upgrade analysis for many tables

Bruce Momjian <bruce@momjian.us> writes:

On Fri, Nov 9, 2012 at 12:50:34AM -0500, Tom Lane wrote:

The hash_seq_search time is probably mostly associated with
AtEOXact_RelationCache, which is run during transaction commit and scans
the relcache hashtable looking for tables created in the current
transaction. So that's about 50% of the runtime going into that one
activity.

Thanks for finding this. What is odd is that I am not seeing non-linear
restores at 16k in git head, so I am confused how something that
consumes ~50% of backend time could still perform linearly. Would this
consume 50% at lower table counts?

No, the cost from that is O(N^2), though with a pretty small multiplier.
16K tables is evidently where the cost reaches the point of being
significant --- if you went up from there, you'd probably start to
notice an overall O(N^2) behavior.

regards, tom lane

#28

Robert Haas

robertmhaas@gmail.com

about 13 years ago

In reply to: Tom Lane (#8)

Re: Further pg_upgrade analysis for many tables

On Fri, Nov 9, 2012 at 12:50 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jeff Janes <jeff.janes@gmail.com> writes:

Are sure the server you are dumping out of is head?

I experimented a bit with dumping/restoring 16000 tables matching
Bruce's test case (ie, one serial column apiece). The pg_dump profile
seems fairly flat, without any easy optimization targets. But
restoring the dump script shows a rather interesting backend profile:

samples % image name symbol name
30861 39.6289 postgres AtEOXact_RelationCache
9911 12.7268 postgres hash_seq_search
2682 3.4440 postgres init_sequence
2218 2.8482 postgres _bt_compare
2120 2.7223 postgres hash_search_with_hash_value
1976 2.5374 postgres XLogInsert
1429 1.8350 postgres CatalogCacheIdInvalidate
1282 1.6462 postgres LWLockAcquire
973 1.2494 postgres LWLockRelease
702 0.9014 postgres hash_any

The hash_seq_search time is probably mostly associated with
AtEOXact_RelationCache, which is run during transaction commit and scans
the relcache hashtable looking for tables created in the current
transaction. So that's about 50% of the runtime going into that one
activity.

There are at least three ways we could whack that mole:

* Run the psql script in --single-transaction mode, as I was mumbling
about the other day. If we were doing AtEOXact_RelationCache only once,
rather than once per CREATE TABLE statement, it wouldn't be a problem.
Easy but has only a narrow scope of applicability.

* Keep a separate list (or data structure of your choice) so that
relcache entries created in the current xact could be found directly
rather than having to scan the whole relcache. That'd add complexity
though, and could perhaps be a net loss for cases where the relcache
isn't so bloated.

* Limit the size of the relcache (eg by aging out
not-recently-referenced entries) so that we aren't incurring O(N^2)
costs for scripts touching N tables. Again, this adds complexity and
could be counterproductive in some scenarios.

Although there may be some workloads that access very large numbers of
tables repeatedly, I bet that's not typical. Rather, I bet that a
session which accesses 10,000 tables is most likely to access them
just once each - and right now we don't handle that case very well;
this is not the first complaint about big relcaches causing problems.
On the flip side, we don't want workloads that exceed some baked-in
cache size to fall off a cliff. So I think we should be looking for a
solution that doesn't put a hard limit on the size of the relcache,
but does provide at least some latitude to get rid of old entries.

So maybe something like this. Add a flag to each relcache entry
indicating whether or not it has been used. After adding 1024 entries
to the relcache, scan all the entries: clear the flag if it's set,
flush the entry if it's already clear. This allows the size of the
relcache to grow without bound, but only if we're continuing to access
the old tables in between adding new ones to the mix. As an
additional safeguard, we could count the number of toplevel SQL
commands that have been executed and require that a flush not be
performed more often than, say, every 64 toplevel SQL commands. That
way, if a single operation on an inheritance parent with many children
sucks a lot of stuff into the relcache, we'll avoid cleaning it out
too quickly.

Maybe this is all too ad-hoc, but I feel like we don't need to
overengineer this. The existing system is fine in 99% of the cases,
so we really only need to find a way to detect the really egregious
case where we are doing a neverending series of one-time table
accesses and apply a very light tap to avoid the pain point in that
case.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#29

Bruce Momjian

bruce@momjian.us

about 13 years ago

In reply to: Bruce Momjian (#20)

Re: Further pg_upgrade analysis for many tables

On Sat, Nov 10, 2012 at 12:41:44PM -0500, Bruce Momjian wrote:

On Sat, Nov 10, 2012 at 07:17:34PM +0200, Ants Aasma wrote:

On Sat, Nov 10, 2012 at 7:10 PM, Bruce Momjian <bruce@momjian.us> wrote:

I am confused why you see a loop. transfer_all_new_dbs() does a
merge-join of old/new database names, then calls gen_db_file_maps(),
which loops over the relations and calls create_rel_filename_map(),
which adds to the map via array indexing. I don't see any file loops
in there --- can you be more specific?

Sorry, I was too tired when posting that. I actually meant
transfer_single_new_db(). More specifically the profile clearly showed
that most of the time was spent in the two loops starting on lines 193
and 228.

Wow, you are right on target. I was so focused on making logical
lookups linear that I did not consider file system vm/fsm and file
extension lookups. Let me think a little and I will report back.
Thanks.

OK, I have had some time to think about this. What the current code
does is, for each database, get a directory listing to know about any
vm, fsm, and >1gig extents that exist in the directory. It caches the
directory listing and does full array scans looking for matches. If the
tablespace changes, it creates a new directory cache and throws away the
old one. This code certainly needs improvement!

I can think of two solutions. The first would be to scan the database
directory, and any tablespaces used by the database, sort it, then allow
binary search of the directory listing looking for file prefixes that
match the current relation.

The second approach would be to simply try to copy the fsm, vm, and
extent files, and ignore any ENOEXIST errors. This allows code
simplification. The downside is that it doesn't pull all files with
matching prefixes --- it requires pg_upgrade to _know_ what suffixes
might exist in that directory. Second, it assumes there can be no
number gaps in the file extent numbering (is that safe?).

I need recommendations on which direction to persue; this would only be
for 9.3.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#30

Bruce Momjian

bruce@momjian.us

about 13 years ago

In reply to: Bruce Momjian (#29)

1 attachment(s)

Re: Further pg_upgrade analysis for many tables

On Mon, Nov 12, 2012 at 12:09:08PM -0500, Bruce Momjian wrote:

OK, I have had some time to think about this. What the current code
does is, for each database, get a directory listing to know about any
vm, fsm, and >1gig extents that exist in the directory. It caches the
directory listing and does full array scans looking for matches. If the
tablespace changes, it creates a new directory cache and throws away the
old one. This code certainly needs improvement!

I can think of two solutions. The first would be to scan the database
directory, and any tablespaces used by the database, sort it, then allow
binary search of the directory listing looking for file prefixes that
match the current relation.

The second approach would be to simply try to copy the fsm, vm, and
extent files, and ignore any ENOEXIST errors. This allows code
simplification. The downside is that it doesn't pull all files with
matching prefixes --- it requires pg_upgrade to _know_ what suffixes
might exist in that directory. Second, it assumes there can be no
number gaps in the file extent numbering (is that safe?).

I need recommendations on which direction to persue; this would only be
for 9.3.

I went with the second idea, patch attached. Here are the times:

---------- 9.2 ---------- ------------ 9.3 --------
-- normal -- -- bin-up -- -- normal -- -- bin-up -- pg_upgrade
dump rest dump rest dump rest dump rest git patch
1 0.12 0.06 0.12 0.06 0.11 0.07 0.11 0.07 11.11 11.02
1000 7.22 2.40 4.74 2.78 2.20 2.43 4.04 2.86 19.60 19.25
2000 5.67 5.10 8.82 5.57 4.50 4.97 8.07 5.69 30.55 26.67
4000 13.34 11.13 25.16 12.52 8.95 11.24 16.75 12.16 60.70 52.31
8000 29.12 25.98 59.60 28.08 16.68 24.02 30.63 27.08 123.05 102.78
16000 87.36 53.16 189.38 62.72 31.38 55.37 61.55 62.66 365.71 286.00

You can see a significant speedup with those loops removed. The 16k
case is improved, but still not linear. The 16k dump/restore scale
looks fine, so it must be something in pg_upgrade, or in the kernel.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

Attachments:

pg_upgrade.difftext/x-diff; charset=us-asciiDownload

diff --git a/contrib/pg_upgrade/relfilenode.c b/contrib/pg_upgrade/relfilenode.c
new file mode 100644
index 33a867f..0a49a23
*** a/contrib/pg_upgrade/relfilenode.c
--- b/contrib/pg_upgrade/relfilenode.c
***************
*** 17,25 ****
  
  static void transfer_single_new_db(pageCnvCtx *pageConverter,
  					   FileNameMap *maps, int size);
! static void transfer_relfile(pageCnvCtx *pageConverter,
! 				 const char *fromfile, const char *tofile,
! 				 const char *nspname, const char *relname);
  
  
  /*
--- 17,24 ----
  
  static void transfer_single_new_db(pageCnvCtx *pageConverter,
  					   FileNameMap *maps, int size);
! static int transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
! 							 const char *suffix);
  
  
  /*
*************** static void
*** 131,185 ****
  transfer_single_new_db(pageCnvCtx *pageConverter,
  					   FileNameMap *maps, int size)
  {
- 	char		old_dir[MAXPGPATH];
- 	char		file_pattern[MAXPGPATH];
- 	char		**namelist = NULL;
- 	int			numFiles = 0;
  	int			mapnum;
! 	int			fileno;
! 	bool		vm_crashsafe_change = false;
! 
! 	old_dir[0] = '\0';
! 
  	/* Do not copy non-crashsafe vm files for binaries that assume crashsafety */
  	if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_CRASHSAFE_CAT_VER &&
  		new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
! 		vm_crashsafe_change = true;
  
  	for (mapnum = 0; mapnum < size; mapnum++)
  	{
! 		char		old_file[MAXPGPATH];
! 		char		new_file[MAXPGPATH];
! 
! 		/* Changed tablespaces?  Need a new directory scan? */
! 		if (strcmp(maps[mapnum].old_dir, old_dir) != 0)
! 		{
! 			if (numFiles > 0)
! 			{
! 				for (fileno = 0; fileno < numFiles; fileno++)
! 					pg_free(namelist[fileno]);
! 				pg_free(namelist);
! 			}
! 
! 			snprintf(old_dir, sizeof(old_dir), "%s", maps[mapnum].old_dir);
! 			numFiles = load_directory(old_dir, &namelist);
! 		}
! 
! 		/* Copying files might take some time, so give feedback. */
! 
! 		snprintf(old_file, sizeof(old_file), "%s/%u", maps[mapnum].old_dir,
! 				 maps[mapnum].old_relfilenode);
! 		snprintf(new_file, sizeof(new_file), "%s/%u", maps[mapnum].new_dir,
! 				 maps[mapnum].new_relfilenode);
! 		pg_log(PG_REPORT, OVERWRITE_MESSAGE, old_file);
! 
! 		/*
! 		 * Copy/link the relation's primary file (segment 0 of main fork)
! 		 * to the new cluster
! 		 */
! 		unlink(new_file);
! 		transfer_relfile(pageConverter, old_file, new_file,
! 						 maps[mapnum].nspname, maps[mapnum].relname);
  
  		/* fsm/vm files added in PG 8.4 */
  		if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)
--- 130,148 ----
  transfer_single_new_db(pageCnvCtx *pageConverter,
  					   FileNameMap *maps, int size)
  {
  	int			mapnum;
! 	bool		vm_crashsafe_match = true;
! 	int			segno;
! 	
  	/* Do not copy non-crashsafe vm files for binaries that assume crashsafety */
  	if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_CRASHSAFE_CAT_VER &&
  		new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
! 		vm_crashsafe_match = false;
  
  	for (mapnum = 0; mapnum < size; mapnum++)
  	{
! 		/* transfer primary file */
! 		transfer_relfile(pageConverter, &maps[mapnum], "");
  
  		/* fsm/vm files added in PG 8.4 */
  		if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)
*************** transfer_single_new_db(pageCnvCtx *pageC
*** 187,253 ****
  			/*
  			 * Copy/link any fsm and vm files, if they exist
  			 */
! 			snprintf(file_pattern, sizeof(file_pattern), "%u_",
! 					 maps[mapnum].old_relfilenode);
! 
! 			for (fileno = 0; fileno < numFiles; fileno++)
! 			{
! 				char	   *vm_offset = strstr(namelist[fileno], "_vm");
! 				bool		is_vm_file = false;
! 
! 				/* Is a visibility map file? (name ends with _vm) */
! 				if (vm_offset && strlen(vm_offset) == strlen("_vm"))
! 					is_vm_file = true;
! 
! 				if (strncmp(namelist[fileno], file_pattern,
! 							strlen(file_pattern)) == 0 &&
! 					(!is_vm_file || !vm_crashsafe_change))
! 				{
! 					snprintf(old_file, sizeof(old_file), "%s/%s", maps[mapnum].old_dir,
! 							 namelist[fileno]);
! 					snprintf(new_file, sizeof(new_file), "%s/%u%s", maps[mapnum].new_dir,
! 							 maps[mapnum].new_relfilenode, strchr(namelist[fileno], '_'));
! 
! 					unlink(new_file);
! 					transfer_relfile(pageConverter, old_file, new_file,
! 								 maps[mapnum].nspname, maps[mapnum].relname);
! 				}
! 			}
  		}
  
  		/*
  		 * Now copy/link any related segments as well. Remember, PG breaks
  		 * large files into 1GB segments, the first segment has no extension,
  		 * subsequent segments are named relfilenode.1, relfilenode.2,
! 		 * relfilenode.3, ...  'fsm' and 'vm' files use underscores so are not
  		 * copied.
  		 */
! 		snprintf(file_pattern, sizeof(file_pattern), "%u.",
! 				 maps[mapnum].old_relfilenode);
! 
! 		for (fileno = 0; fileno < numFiles; fileno++)
! 		{
! 			if (strncmp(namelist[fileno], file_pattern,
! 						strlen(file_pattern)) == 0)
! 			{
! 				snprintf(old_file, sizeof(old_file), "%s/%s", maps[mapnum].old_dir,
! 						 namelist[fileno]);
! 				snprintf(new_file, sizeof(new_file), "%s/%u%s", maps[mapnum].new_dir,
! 						 maps[mapnum].new_relfilenode, strchr(namelist[fileno], '.'));
  
! 				unlink(new_file);
! 				transfer_relfile(pageConverter, old_file, new_file,
! 								 maps[mapnum].nspname, maps[mapnum].relname);
! 			}
  		}
  	}
- 
- 	if (numFiles > 0)
- 	{
- 		for (fileno = 0; fileno < numFiles; fileno++)
- 			pg_free(namelist[fileno]);
- 		pg_free(namelist);
- 	}
  }
  
  
--- 150,176 ----
  			/*
  			 * Copy/link any fsm and vm files, if they exist
  			 */
! 			transfer_relfile(pageConverter, &maps[mapnum], "_fsm");
! 			if (vm_crashsafe_match)
! 				transfer_relfile(pageConverter, &maps[mapnum], "_vm");
  		}
  
  		/*
  		 * Now copy/link any related segments as well. Remember, PG breaks
  		 * large files into 1GB segments, the first segment has no extension,
  		 * subsequent segments are named relfilenode.1, relfilenode.2,
! 		 * relfilenode.3.
  		 * copied.
  		 */
! 		for (segno = 1;; segno++)
!         {
!         	char suffix[65];
  
!         	snprintf(suffix, sizeof(suffix), ".%d", segno);
! 			if (transfer_relfile(pageConverter, &maps[mapnum], suffix) != 0)
! 				break;
  		}
  	}
  }
  
  
*************** transfer_single_new_db(pageCnvCtx *pageC
*** 256,266 ****
   *
   * Copy or link file from old cluster to new one.
   */
! static void
! transfer_relfile(pageCnvCtx *pageConverter, const char *old_file,
! 			  const char *new_file, const char *nspname, const char *relname)
  {
  	const char *msg;
  
  	if ((user_opts.transfer_mode == TRANSFER_MODE_LINK) && (pageConverter != NULL))
  		pg_log(PG_FATAL, "This upgrade requires page-by-page conversion, "
--- 179,213 ----
   *
   * Copy or link file from old cluster to new one.
   */
! static int
! transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
! 				 const char *suffix)
  {
  	const char *msg;
+ 	char		old_file[MAXPGPATH];
+ 	char		new_file[MAXPGPATH];
+ 	int			fd;
+ 	
+ 	snprintf(old_file, sizeof(old_file), "%s/%u%s", map->old_dir,
+ 			 map->old_relfilenode, suffix);
+ 	snprintf(new_file, sizeof(new_file), "%s/%u%s", map->new_dir,
+ 			 map->new_relfilenode, suffix);
+ 
+ 	/* file does not exist? */
+ 	if ((fd = open(old_file, O_RDONLY)) == -1)
+ 	{
+ 		if (errno == ENOENT)
+ 			return -1;
+ 		else
+ 			pg_log(PG_FATAL, "non-existant file error while copying relation \"%s.%s\" (\"%s\" to \"%s\")\n",
+ 				   map->nspname, map->relname, old_file, new_file);
+ 	}
+ 	close(fd);
+ 
+ 	unlink(new_file);
+ 
+ 	/* Copying files might take some time, so give feedback. */
+ 	pg_log(PG_REPORT, OVERWRITE_MESSAGE, old_file);
  
  	if ((user_opts.transfer_mode == TRANSFER_MODE_LINK) && (pageConverter != NULL))
  		pg_log(PG_FATAL, "This upgrade requires page-by-page conversion, "
*************** transfer_relfile(pageCnvCtx *pageConvert
*** 272,278 ****
  
  		if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
  			pg_log(PG_FATAL, "error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
! 				   nspname, relname, old_file, new_file, msg);
  	}
  	else
  	{
--- 219,225 ----
  
  		if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
  			pg_log(PG_FATAL, "error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
! 				   map->nspname, map->relname, old_file, new_file, msg);
  	}
  	else
  	{
*************** transfer_relfile(pageCnvCtx *pageConvert
*** 281,287 ****
  		if ((msg = linkAndUpdateFile(pageConverter, old_file, new_file)) != NULL)
  			pg_log(PG_FATAL,
  				   "error while creating link for relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
! 				   nspname, relname, old_file, new_file, msg);
  	}
! 	return;
  }
--- 228,234 ----
  		if ((msg = linkAndUpdateFile(pageConverter, old_file, new_file)) != NULL)
  			pg_log(PG_FATAL,
  				   "error while creating link for relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
! 				   map->nspname, map->relname, old_file, new_file, msg);
  	}
! 	return 0;
  }

#31

Bruce Momjian

bruce@momjian.us

about 13 years ago

In reply to: Bruce Momjian (#30)

Re: Further pg_upgrade analysis for many tables

On Mon, Nov 12, 2012 at 03:59:27PM -0500, Bruce Momjian wrote:

The second approach would be to simply try to copy the fsm, vm, and
extent files, and ignore any ENOEXIST errors. This allows code
simplification. The downside is that it doesn't pull all files with
matching prefixes --- it requires pg_upgrade to _know_ what suffixes
might exist in that directory. Second, it assumes there can be no
number gaps in the file extent numbering (is that safe?).

I need recommendations on which direction to persue; this would only be
for 9.3.

I went with the second idea, patch attached. Here are the times:

---------- 9.2 ---------- ------------ 9.3 --------
-- normal -- -- bin-up -- -- normal -- -- bin-up -- pg_upgrade
dump rest dump rest dump rest dump rest git patch
1 0.12 0.06 0.12 0.06 0.11 0.07 0.11 0.07 11.11 11.02
1000 7.22 2.40 4.74 2.78 2.20 2.43 4.04 2.86 19.60 19.25
2000 5.67 5.10 8.82 5.57 4.50 4.97 8.07 5.69 30.55 26.67
4000 13.34 11.13 25.16 12.52 8.95 11.24 16.75 12.16 60.70 52.31
8000 29.12 25.98 59.60 28.08 16.68 24.02 30.63 27.08 123.05 102.78
16000 87.36 53.16 189.38 62.72 31.38 55.37 61.55 62.66 365.71 286.00

You can see a significant speedup with those loops removed. The 16k
case is improved, but still not linear. The 16k dump/restore scale
looks fine, so it must be something in pg_upgrade, or in the kernel.

It is possible that the poor 16k pg_upgrade value is caused by the poor
9.2 binary-upgrade number (189.38). Perhaps I need to hack up
pg_upgrade to allow a 9.3 to 9.3 upgrade to test this.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#32

Alvaro Herrera

alvherre@2ndquadrant.com

about 13 years ago

In reply to: Bruce Momjian (#30)

Re: Further pg_upgrade analysis for many tables

Bruce Momjian escribió:

--- 17,24 ----
static void transfer_single_new_db(pageCnvCtx *pageConverter,
FileNameMap *maps, int size);
! static int transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
! const char *suffix);

Uh, does this code assume that forks other than the main one are not
split in segments? I think that's a bug, is it not?

#33

Alvaro Herrera

alvherre@2ndquadrant.com

about 13 years ago

In reply to: Bruce Momjian (#31)

Re: Further pg_upgrade analysis for many tables

Bruce Momjian escribió:

It is possible that the poor 16k pg_upgrade value is caused by the poor
9.2 binary-upgrade number (189.38). Perhaps I need to hack up
pg_upgrade to allow a 9.3 to 9.3 upgrade to test this.

Hmm? This already works, since "make check" uses it, right?

#34

Bruce Momjian

bruce@momjian.us

about 13 years ago

In reply to: Alvaro Herrera (#32)

Re: Further pg_upgrade analysis for many tables

On Mon, Nov 12, 2012 at 06:14:59PM -0300, Alvaro Herrera wrote:

Bruce Momjian escribiï¿½:
--- 17,24 ----
static void transfer_single_new_db(pageCnvCtx *pageConverter,
FileNameMap *maps, int size);
! static int transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
! const char *suffix);
Uh, does this code assume that forks other than the main one are not
split in segments? I think that's a bug, is it not?

Oh, yeah, I must have fixed this long ago. It only fails if you use
tablespaces:

if (os_info.num_tablespaces > 0 &&
strcmp(old_cluster.tablespace_suffix, new_cluster.tablespace_suffix) == 0)
pg_log(PG_FATAL,
"Cannot upgrade to/from the same system catalog version when\n"
"using tablespaces.\n");

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#35

Bruce Momjian

bruce@momjian.us

about 13 years ago

In reply to: Bruce Momjian (#29)

Re: Further pg_upgrade analysis for many tables

On Mon, Nov 12, 2012 at 12:09:08PM -0500, Bruce Momjian wrote:

The second approach would be to simply try to copy the fsm, vm, and
extent files, and ignore any ENOEXIST errors. This allows code
simplification. The downside is that it doesn't pull all files with
matching prefixes --- it requires pg_upgrade to _know_ what suffixes
might exist in that directory. Second, it assumes there can be no
number gaps in the file extent numbering (is that safe?).

Seems our code does the same kind of segment number looping I was
suggesting for pg_upgrade, so I think I am safe:

/*
* Note that because we loop until getting ENOENT, we will correctly
* remove all inactive segments as well as active ones.
*/
for (segno = 1;; segno++)
{
sprintf(segpath, "%s.%u", path, segno);
if (unlink(segpath) < 0)
{
/* ENOENT is expected after the last segment... */
if (errno != ENOENT)
ereport(WARNING,
(errcode_for_file_access(),
errmsg("could not remove file \"%s\": %m", segpath)));
break;
}
}

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#36

Jeff Janes

jeff.janes@gmail.com

about 13 years ago

In reply to: Jeff Janes (#11)

Re: Further pg_upgrade analysis for many tables

On Fri, Nov 9, 2012 at 10:50 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Thu, Nov 8, 2012 at 9:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jeff Janes <jeff.janes@gmail.com> writes:

Are sure the server you are dumping out of is head?

I experimented a bit with dumping/restoring 16000 tables matching
Bruce's test case (ie, one serial column apiece). The pg_dump profile
seems fairly flat, without any easy optimization targets. But
restoring the dump script shows a rather interesting backend profile:

samples % image name symbol name
30861 39.6289 postgres AtEOXact_RelationCache
9911 12.7268 postgres hash_seq_search

...

There are at least three ways we could whack that mole:

* Run the psql script in --single-transaction mode, as I was mumbling
about the other day. If we were doing AtEOXact_RelationCache only once,
rather than once per CREATE TABLE statement, it wouldn't be a problem.
Easy but has only a narrow scope of applicability.

That is effective when loading into 9.3 (assuming you make
max_locks_per_transaction large enough). But when loading into <9.3,
using --single-transaction will evoke the quadratic behavior in the
resource owner/lock table and make things worse rather than better.

Using --single-transaction gets around the AtEOXact_RelationCache
quadratic, but it activates another quadratic behavior, this one in
"get_tabstat_entry". That is a good trade-off because that one has a
lower constant, but it is still going to bite.

Cheers,

Jeff

#37

Simon Riggs

simon@2ndQuadrant.com

about 13 years ago

In reply to: Robert Haas (#28)

Re: Further pg_upgrade analysis for many tables

On 12 November 2012 16:51, Robert Haas <robertmhaas@gmail.com> wrote:

Although there may be some workloads that access very large numbers of
tables repeatedly, I bet that's not typical.

Transactions with large numbers of DDL statements are typical at
upgrade (application or database release level) and the execution time
of those is critical to availability.

I'm guessing you mean large numbers of tables and accessing each one
multiple times?

Rather, I bet that a
session which accesses 10,000 tables is most likely to access them
just once each - and right now we don't handle that case very well;
this is not the first complaint about big relcaches causing problems.

pg_restore frequently accesses tables more than once as it runs, but
not more than a dozen times each, counting all types of DDL.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#38

Robert Haas

robertmhaas@gmail.com

about 13 years ago

In reply to: Simon Riggs (#37)

Re: Further pg_upgrade analysis for many tables

On Mon, Nov 12, 2012 at 5:17 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 12 November 2012 16:51, Robert Haas <robertmhaas@gmail.com> wrote:

Although there may be some workloads that access very large numbers of
tables repeatedly, I bet that's not typical.

Transactions with large numbers of DDL statements are typical at
upgrade (application or database release level) and the execution time
of those is critical to availability.

I'm guessing you mean large numbers of tables and accessing each one
multiple times?

Yes, that is what I meant.

Rather, I bet that a
session which accesses 10,000 tables is most likely to access them
just once each - and right now we don't handle that case very well;
this is not the first complaint about big relcaches causing problems.

pg_restore frequently accesses tables more than once as it runs, but
not more than a dozen times each, counting all types of DDL.

Hmm... yeah. Some of those accesses are probably one right after
another so any cache-flushing behavior would be fine; but index
creations for example might happen quite a bit later in the file,
IIRC.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#39

Ants Aasma

ants@cybertec.at

about 13 years ago

In reply to: Bruce Momjian (#30)

2 attachment(s)

Re: Further pg_upgrade analysis for many tables

On Mon, Nov 12, 2012 at 10:59 PM, Bruce Momjian <bruce@momjian.us> wrote:

You can see a significant speedup with those loops removed. The 16k
case is improved, but still not linear. The 16k dump/restore scale
looks fine, so it must be something in pg_upgrade, or in the kernel.

I can confirm the speedup. Profiling results for 9.3 to 9.3 upgrade
for 8k and 64k tables are attached. pg_upgrade itself is now taking
negligible time.

The 64k profile shows the AtEOXact_RelationCache scaling problem. For
the 8k profile nothing really pops out as a clear bottleneck. CPU time
distributes 83.1% to postgres, 4.9% to pg_dump, 7.4% to psql and 0.7%
to pg_upgrade.

Postgres time itself breaks down with 10% for shutdown checkpoint and
90% for regular running, consisting of 16% parsing, 13% analyze, 20%
plan, 30% execute, 11% commit (AtEOXact_RelationCache) and 6% network.

It looks to me that most benefit could be had from introducing more
parallelism. Are there any large roadblocks to pipelining the dump and
restore to have them happen in parallel?

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

#40

Bruce Momjian

bruce@momjian.us

about 13 years ago

In reply to: Ants Aasma (#39)

2 attachment(s)

Re: Further pg_upgrade analysis for many tables

On Tue, Nov 13, 2012 at 05:44:54AM +0200, Ants Aasma wrote:

On Mon, Nov 12, 2012 at 10:59 PM, Bruce Momjian <bruce@momjian.us> wrote:

You can see a significant speedup with those loops removed. The 16k
case is improved, but still not linear. The 16k dump/restore scale
looks fine, so it must be something in pg_upgrade, or in the kernel.

I can confirm the speedup. Profiling results for 9.3 to 9.3 upgrade
for 8k and 64k tables are attached. pg_upgrade itself is now taking
negligible time.

I generated these timings from the attached test script.

-------------------------- 9.3 ------------------------
---- normal ---- -- binary_upgrade -- -- pg_upgrade -
- dmp - - res - - dmp - - res - git patch
1 0.12 0.07 0.13 0.07 11.06 11.02
1000 2.20 2.46 3.57 2.82 19.15 18.61
2000 4.51 5.01 8.22 5.80 29.12 26.89
4000 8.97 10.88 14.76 12.43 45.87 43.08
8000 15.30 24.72 30.57 27.10 100.31 79.75
16000 36.14 54.88 62.27 61.69 248.03 167.94
32000 55.29 162.20 115.16 179.15 695.05 376.84
64000 149.86 716.46 265.77 724.32 2323.73 1122.38

You can see the speedup of the patch, particularly for a greater number
of tables, e.g. 2x faster for 64k tables.

The 64k profile shows the AtEOXact_RelationCache scaling problem. For
the 8k profile nothing really pops out as a clear bottleneck. CPU time
distributes 83.1% to postgres, 4.9% to pg_dump, 7.4% to psql and 0.7%
to pg_upgrade.

At 64k I see pg_upgrade taking 12% of the duration time, if I subtract
out the dump/restore times.

I am attaching an updated pg_upgrade patch, which I believe is ready for
application for 9.3.

Postgres time itself breaks down with 10% for shutdown checkpoint and
90% for regular running, consisting of 16% parsing, 13% analyze, 20%
plan, 30% execute, 11% commit (AtEOXact_RelationCache) and 6% network.

That SVG graph was quite impressive.

It looks to me that most benefit could be had from introducing more
parallelism. Are there any large roadblocks to pipelining the dump and
restore to have them happen in parallel?

I talked to Andrew Dustan about parallelization in pg_restore. First,
we currently use pg_dumpall, which isn't in the custom format required
for parallel restore, but if we changed to custom format, create table
isn't done in parallel, only create index/check constraints, and trigger
creation, etc. Not sure if it worth perusing this just for pg_upgrade.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

Attachments:

pg_upgrade.difftext/x-diff; charset=us-asciiDownload

diff --git a/contrib/pg_upgrade/file.c b/contrib/pg_upgrade/file.c
new file mode 100644
index a5d92c6..d8cd8f5
*** a/contrib/pg_upgrade/file.c
--- b/contrib/pg_upgrade/file.c
*************** copy_file(const char *srcfile, const cha
*** 221,281 ****
  #endif
  
  
- /*
-  * load_directory()
-  *
-  * Read all the file names in the specified directory, and return them as
-  * an array of "char *" pointers.  The array address is returned in
-  * *namelist, and the function result is the count of file names.
-  *
-  * To free the result data, free each (char *) array member, then free the
-  * namelist array itself.
-  */
- int
- load_directory(const char *dirname, char ***namelist)
- {
- 	DIR		   *dirdesc;
- 	struct dirent *direntry;
- 	int			count = 0;
- 	int			allocsize = 64;		/* initial array size */
- 
- 	*namelist = (char **) pg_malloc(allocsize * sizeof(char *));
- 
- 	if ((dirdesc = opendir(dirname)) == NULL)
- 		pg_log(PG_FATAL, "could not open directory \"%s\": %s\n",
- 			   dirname, getErrorText(errno));
- 
- 	while (errno = 0, (direntry = readdir(dirdesc)) != NULL)
- 	{
- 		if (count >= allocsize)
- 		{
- 			allocsize *= 2;
- 			*namelist = (char **)
- 						pg_realloc(*namelist, allocsize * sizeof(char *));
- 		}
- 
- 		(*namelist)[count++] = pg_strdup(direntry->d_name);
- 	}
- 
- #ifdef WIN32
- 	/*
- 	 * This fix is in mingw cvs (runtime/mingwex/dirent.c rev 1.4), but not in
- 	 * released version
- 	 */
- 	if (GetLastError() == ERROR_NO_MORE_FILES)
- 		errno = 0;
- #endif
- 
- 	if (errno)
- 		pg_log(PG_FATAL, "could not read directory \"%s\": %s\n",
- 			   dirname, getErrorText(errno));
- 
- 	closedir(dirdesc);
- 
- 	return count;
- }
- 
- 
  void
  check_hard_link(void)
  {
--- 221,226 ----
diff --git a/contrib/pg_upgrade/pg_upgrade.h b/contrib/pg_upgrade/pg_upgrade.h
new file mode 100644
index 3058343..f35ce75
*** a/contrib/pg_upgrade/pg_upgrade.h
--- b/contrib/pg_upgrade/pg_upgrade.h
***************
*** 7,13 ****
  
  #include <unistd.h>
  #include <assert.h>
- #include <dirent.h>
  #include <sys/stat.h>
  #include <sys/time.h>
  
--- 7,12 ----
*************** const char *setupPageConverter(pageCnvCt
*** 366,372 ****
  typedef void *pageCnvCtx;
  #endif
  
- int			load_directory(const char *dirname, char ***namelist);
  const char *copyAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
  				  const char *dst, bool force);
  const char *linkAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
--- 365,370 ----
diff --git a/contrib/pg_upgrade/relfilenode.c b/contrib/pg_upgrade/relfilenode.c
new file mode 100644
index 33a867f..d763ba7
*** a/contrib/pg_upgrade/relfilenode.c
--- b/contrib/pg_upgrade/relfilenode.c
***************
*** 17,25 ****
  
  static void transfer_single_new_db(pageCnvCtx *pageConverter,
  					   FileNameMap *maps, int size);
! static void transfer_relfile(pageCnvCtx *pageConverter,
! 				 const char *fromfile, const char *tofile,
! 				 const char *nspname, const char *relname);
  
  
  /*
--- 17,24 ----
  
  static void transfer_single_new_db(pageCnvCtx *pageConverter,
  					   FileNameMap *maps, int size);
! static void transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
! 							 const char *suffix);
  
  
  /*
*************** static void
*** 131,185 ****
  transfer_single_new_db(pageCnvCtx *pageConverter,
  					   FileNameMap *maps, int size)
  {
- 	char		old_dir[MAXPGPATH];
- 	char		file_pattern[MAXPGPATH];
- 	char		**namelist = NULL;
- 	int			numFiles = 0;
  	int			mapnum;
! 	int			fileno;
! 	bool		vm_crashsafe_change = false;
! 
! 	old_dir[0] = '\0';
! 
! 	/* Do not copy non-crashsafe vm files for binaries that assume crashsafety */
  	if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_CRASHSAFE_CAT_VER &&
  		new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
! 		vm_crashsafe_change = true;
  
  	for (mapnum = 0; mapnum < size; mapnum++)
  	{
! 		char		old_file[MAXPGPATH];
! 		char		new_file[MAXPGPATH];
! 
! 		/* Changed tablespaces?  Need a new directory scan? */
! 		if (strcmp(maps[mapnum].old_dir, old_dir) != 0)
! 		{
! 			if (numFiles > 0)
! 			{
! 				for (fileno = 0; fileno < numFiles; fileno++)
! 					pg_free(namelist[fileno]);
! 				pg_free(namelist);
! 			}
! 
! 			snprintf(old_dir, sizeof(old_dir), "%s", maps[mapnum].old_dir);
! 			numFiles = load_directory(old_dir, &namelist);
! 		}
! 
! 		/* Copying files might take some time, so give feedback. */
! 
! 		snprintf(old_file, sizeof(old_file), "%s/%u", maps[mapnum].old_dir,
! 				 maps[mapnum].old_relfilenode);
! 		snprintf(new_file, sizeof(new_file), "%s/%u", maps[mapnum].new_dir,
! 				 maps[mapnum].new_relfilenode);
! 		pg_log(PG_REPORT, OVERWRITE_MESSAGE, old_file);
! 
! 		/*
! 		 * Copy/link the relation's primary file (segment 0 of main fork)
! 		 * to the new cluster
! 		 */
! 		unlink(new_file);
! 		transfer_relfile(pageConverter, old_file, new_file,
! 						 maps[mapnum].nspname, maps[mapnum].relname);
  
  		/* fsm/vm files added in PG 8.4 */
  		if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)
--- 130,150 ----
  transfer_single_new_db(pageCnvCtx *pageConverter,
  					   FileNameMap *maps, int size)
  {
  	int			mapnum;
! 	bool		vm_crashsafe_match = true;
! 	
! 	/*
! 	 * Do the old and new cluster disagree on the crash-safetiness of the vm
!      * files?  If so, do not copy them.
!      */
  	if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_CRASHSAFE_CAT_VER &&
  		new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
! 		vm_crashsafe_match = false;
  
  	for (mapnum = 0; mapnum < size; mapnum++)
  	{
! 		/* transfer primary file */
! 		transfer_relfile(pageConverter, &maps[mapnum], "");
  
  		/* fsm/vm files added in PG 8.4 */
  		if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)
*************** transfer_single_new_db(pageCnvCtx *pageC
*** 187,253 ****
  			/*
  			 * Copy/link any fsm and vm files, if they exist
  			 */
! 			snprintf(file_pattern, sizeof(file_pattern), "%u_",
! 					 maps[mapnum].old_relfilenode);
! 
! 			for (fileno = 0; fileno < numFiles; fileno++)
! 			{
! 				char	   *vm_offset = strstr(namelist[fileno], "_vm");
! 				bool		is_vm_file = false;
! 
! 				/* Is a visibility map file? (name ends with _vm) */
! 				if (vm_offset && strlen(vm_offset) == strlen("_vm"))
! 					is_vm_file = true;
! 
! 				if (strncmp(namelist[fileno], file_pattern,
! 							strlen(file_pattern)) == 0 &&
! 					(!is_vm_file || !vm_crashsafe_change))
! 				{
! 					snprintf(old_file, sizeof(old_file), "%s/%s", maps[mapnum].old_dir,
! 							 namelist[fileno]);
! 					snprintf(new_file, sizeof(new_file), "%s/%u%s", maps[mapnum].new_dir,
! 							 maps[mapnum].new_relfilenode, strchr(namelist[fileno], '_'));
! 
! 					unlink(new_file);
! 					transfer_relfile(pageConverter, old_file, new_file,
! 								 maps[mapnum].nspname, maps[mapnum].relname);
! 				}
! 			}
! 		}
! 
! 		/*
! 		 * Now copy/link any related segments as well. Remember, PG breaks
! 		 * large files into 1GB segments, the first segment has no extension,
! 		 * subsequent segments are named relfilenode.1, relfilenode.2,
! 		 * relfilenode.3, ...  'fsm' and 'vm' files use underscores so are not
! 		 * copied.
! 		 */
! 		snprintf(file_pattern, sizeof(file_pattern), "%u.",
! 				 maps[mapnum].old_relfilenode);
! 
! 		for (fileno = 0; fileno < numFiles; fileno++)
! 		{
! 			if (strncmp(namelist[fileno], file_pattern,
! 						strlen(file_pattern)) == 0)
! 			{
! 				snprintf(old_file, sizeof(old_file), "%s/%s", maps[mapnum].old_dir,
! 						 namelist[fileno]);
! 				snprintf(new_file, sizeof(new_file), "%s/%u%s", maps[mapnum].new_dir,
! 						 maps[mapnum].new_relfilenode, strchr(namelist[fileno], '.'));
! 
! 				unlink(new_file);
! 				transfer_relfile(pageConverter, old_file, new_file,
! 								 maps[mapnum].nspname, maps[mapnum].relname);
! 			}
  		}
  	}
- 
- 	if (numFiles > 0)
- 	{
- 		for (fileno = 0; fileno < numFiles; fileno++)
- 			pg_free(namelist[fileno]);
- 		pg_free(namelist);
- 	}
  }
  
  
--- 152,162 ----
  			/*
  			 * Copy/link any fsm and vm files, if they exist
  			 */
! 			transfer_relfile(pageConverter, &maps[mapnum], "_fsm");
! 			if (vm_crashsafe_match)
! 				transfer_relfile(pageConverter, &maps[mapnum], "_vm");
  		}
  	}
  }
  
  
*************** transfer_single_new_db(pageCnvCtx *pageC
*** 257,287 ****
   * Copy or link file from old cluster to new one.
   */
  static void
! transfer_relfile(pageCnvCtx *pageConverter, const char *old_file,
! 			  const char *new_file, const char *nspname, const char *relname)
  {
  	const char *msg;
! 
! 	if ((user_opts.transfer_mode == TRANSFER_MODE_LINK) && (pageConverter != NULL))
! 		pg_log(PG_FATAL, "This upgrade requires page-by-page conversion, "
! 			   "you must use copy mode instead of link mode.\n");
! 
! 	if (user_opts.transfer_mode == TRANSFER_MODE_COPY)
  	{
! 		pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
  
! 		if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
! 			pg_log(PG_FATAL, "error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
! 				   nspname, relname, old_file, new_file, msg);
! 	}
! 	else
! 	{
! 		pg_log(PG_VERBOSE, "linking \"%s\" to \"%s\"\n", old_file, new_file);
  
- 		if ((msg = linkAndUpdateFile(pageConverter, old_file, new_file)) != NULL)
- 			pg_log(PG_FATAL,
- 				   "error while creating link for relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
- 				   nspname, relname, old_file, new_file, msg);
- 	}
  	return;
  }
--- 166,243 ----
   * Copy or link file from old cluster to new one.
   */
  static void
! transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
! 				 const char *type_suffix)
  {
  	const char *msg;
! 	char		old_file[MAXPGPATH];
! 	char		new_file[MAXPGPATH];
! 	int			fd;
! 	int			segno;
! 	char		extent_suffix[65];
! 	
! 	/*
! 	 * Now copy/link any related segments as well. Remember, PG breaks
! 	 * large files into 1GB segments, the first segment has no extension,
! 	 * subsequent segments are named relfilenode.1, relfilenode.2,
! 	 * relfilenode.3.
! 	 * copied.
! 	 */
! 	for (segno = 0;; segno++)
  	{
! 		if (segno == 0)
! 			extent_suffix[0] = '\0';
! 		else
! 			snprintf(extent_suffix, sizeof(extent_suffix), ".%d", segno);
  
! 		snprintf(old_file, sizeof(old_file), "%s/%u%s%s", map->old_dir,
! 				 map->old_relfilenode, type_suffix, extent_suffix);
! 		snprintf(new_file, sizeof(new_file), "%s/%u%s%s", map->new_dir,
! 				 map->new_relfilenode, type_suffix, extent_suffix);
! 	
! 		/* Is it an extent, fsm, or vm file? */
! 		if (type_suffix[0] != '\0' || segno != 0)
! 		{
! 			/* Did file open fail? */
! 			if ((fd = open(old_file, O_RDONLY)) == -1)
! 			{
! 				/* File does not exist?  That's OK, just return */
! 				if (errno == ENOENT)
! 					return;
! 				else
! 					pg_log(PG_FATAL, "non-existant file error while copying relation \"%s.%s\" (\"%s\" to \"%s\")\n",
! 						   map->nspname, map->relname, old_file, new_file);
! 			}
! 			close(fd);
! 		}
! 
! 		unlink(new_file);
! 	
! 		/* Copying files might take some time, so give feedback. */
! 		pg_log(PG_REPORT, OVERWRITE_MESSAGE, old_file);
! 	
! 		if ((user_opts.transfer_mode == TRANSFER_MODE_LINK) && (pageConverter != NULL))
! 			pg_log(PG_FATAL, "This upgrade requires page-by-page conversion, "
! 				   "you must use copy mode instead of link mode.\n");
! 	
! 		if (user_opts.transfer_mode == TRANSFER_MODE_COPY)
! 		{
! 			pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
! 	
! 			if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
! 				pg_log(PG_FATAL, "error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
! 					   map->nspname, map->relname, old_file, new_file, msg);
! 		}
! 		else
! 		{
! 			pg_log(PG_VERBOSE, "linking \"%s\" to \"%s\"\n", old_file, new_file);
! 	
! 			if ((msg = linkAndUpdateFile(pageConverter, old_file, new_file)) != NULL)
! 				pg_log(PG_FATAL,
! 					   "error while creating link for relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
! 					   map->nspname, map->relname, old_file, new_file, msg);
! 		}
!    }
  
  	return;
  }
diff --git a/doc/src/sgml/Makefile b/doc/src/sgml/Makefile
new file mode 100644
index f40a1fe..5c3afad
*** a/doc/src/sgml/Makefile
--- b/doc/src/sgml/Makefile
*************** postgres.xml: $(srcdir)/postgres.sgml $(
*** 255,266 ****
  	rm postgres.xmltmp
  # ' hello Emacs
  
! xslthtml: xslthtml-stamp
! 
! xslthtml-stamp: stylesheet.xsl postgres.xml
  	$(XSLTPROC) $(XSLTPROCFLAGS) $(XSLTPROC_HTML_FLAGS) $^
- 	cp $(srcdir)/stylesheet.css html/
- 	touch $@
  
  htmlhelp: stylesheet-hh.xsl postgres.xml
  	$(XSLTPROC) $(XSLTPROCFLAGS) $^
--- 255,262 ----
  	rm postgres.xmltmp
  # ' hello Emacs
  
! xslthtml: stylesheet.xsl postgres.xml
  	$(XSLTPROC) $(XSLTPROCFLAGS) $(XSLTPROC_HTML_FLAGS) $^
  
  htmlhelp: stylesheet-hh.xsl postgres.xml
  	$(XSLTPROC) $(XSLTPROCFLAGS) $^
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
new file mode 100644
index 8872920..445ca40
*** a/doc/src/sgml/ref/create_table.sgml
--- b/doc/src/sgml/ref/create_table.sgml
*************** CREATE TABLE employees OF employee_type
*** 1453,1459 ****
    <simplelist type="inline">
     <member><xref linkend="sql-altertable"></member>
     <member><xref linkend="sql-droptable"></member>
-    <member><xref linkend="sql-createtableas"></member>
     <member><xref linkend="sql-createtablespace"></member>
     <member><xref linkend="sql-createtype"></member>
    </simplelist>
--- 1453,1458 ----
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
new file mode 100644
index 55df02a..b9bfde2
*** a/src/backend/access/gin/ginfast.c
--- b/src/backend/access/gin/ginfast.c
*************** ginHeapTupleFastInsert(GinState *ginstat
*** 290,296 ****
  		if (metadata->head == InvalidBlockNumber)
  		{
  			/*
! 			 * Main list is empty, so just insert sublist as main list
  			 */
  			START_CRIT_SECTION();
  
--- 290,296 ----
  		if (metadata->head == InvalidBlockNumber)
  		{
  			/*
! 			 * Main list is empty, so just copy sublist into main list
  			 */
  			START_CRIT_SECTION();
  
*************** ginHeapTupleFastInsert(GinState *ginstat
*** 313,326 ****
  			LockBuffer(buffer, GIN_EXCLUSIVE);
  			page = BufferGetPage(buffer);
  
- 			rdata[0].next = rdata + 1;
- 
- 			rdata[1].buffer = buffer;
- 			rdata[1].buffer_std = true;
- 			rdata[1].data = NULL;
- 			rdata[1].len = 0;
- 			rdata[1].next = NULL;
- 
  			Assert(GinPageGetOpaque(page)->rightlink == InvalidBlockNumber);
  
  			START_CRIT_SECTION();
--- 313,318 ----
diff --git a/src/backend/access/gin/ginxlog.c b/src/backend/access/gin/ginxlog.c
new file mode 100644
index 4536c9c..250619c
*** a/src/backend/access/gin/ginxlog.c
--- b/src/backend/access/gin/ginxlog.c
*************** ginRedoCreateIndex(XLogRecPtr lsn, XLogR
*** 77,85 ****
  				MetaBuffer;
  	Page		page;
  
- 	/* Backup blocks are not used in create_index records */
- 	Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
- 
  	MetaBuffer = XLogReadBuffer(*node, GIN_METAPAGE_BLKNO, true);
  	Assert(BufferIsValid(MetaBuffer));
  	page = (Page) BufferGetPage(MetaBuffer);
--- 77,82 ----
*************** ginRedoCreatePTree(XLogRecPtr lsn, XLogR
*** 112,120 ****
  	Buffer		buffer;
  	Page		page;
  
- 	/* Backup blocks are not used in create_ptree records */
- 	Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
- 
  	buffer = XLogReadBuffer(data->node, data->blkno, true);
  	Assert(BufferIsValid(buffer));
  	page = (Page) BufferGetPage(buffer);
--- 109,114 ----
*************** ginRedoInsert(XLogRecPtr lsn, XLogRecord
*** 165,176 ****
  		}
  	}
  
! 	/* If we have a full-page image, restore it and we're done */
! 	if (record->xl_info & XLR_BKP_BLOCK(0))
! 	{
! 		(void) RestoreBackupBlock(lsn, record, 0, false, false);
  		return;
- 	}
  
  	buffer = XLogReadBuffer(data->node, data->blkno, false);
  	if (!BufferIsValid(buffer))
--- 159,167 ----
  		}
  	}
  
! 	/* nothing else to do if page was backed up */
! 	if (record->xl_info & XLR_BKP_BLOCK_1)
  		return;
  
  	buffer = XLogReadBuffer(data->node, data->blkno, false);
  	if (!BufferIsValid(buffer))
*************** ginRedoSplit(XLogRecPtr lsn, XLogRecord
*** 265,273 ****
  	if (data->isData)
  		flags |= GIN_DATA;
  
- 	/* Backup blocks are not used in split records */
- 	Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
- 
  	lbuffer = XLogReadBuffer(data->node, data->lblkno, true);
  	Assert(BufferIsValid(lbuffer));
  	lpage = (Page) BufferGetPage(lbuffer);
--- 256,261 ----
*************** ginRedoVacuumPage(XLogRecPtr lsn, XLogRe
*** 381,392 ****
  	Buffer		buffer;
  	Page		page;
  
! 	/* If we have a full-page image, restore it and we're done */
! 	if (record->xl_info & XLR_BKP_BLOCK(0))
! 	{
! 		(void) RestoreBackupBlock(lsn, record, 0, false, false);
  		return;
- 	}
  
  	buffer = XLogReadBuffer(data->node, data->blkno, false);
  	if (!BufferIsValid(buffer))
--- 369,377 ----
  	Buffer		buffer;
  	Page		page;
  
! 	/* nothing to do if page was backed up (and no info to do it with) */
! 	if (record->xl_info & XLR_BKP_BLOCK_1)
  		return;
  
  	buffer = XLogReadBuffer(data->node, data->blkno, false);
  	if (!BufferIsValid(buffer))
*************** static void
*** 435,472 ****
  ginRedoDeletePage(XLogRecPtr lsn, XLogRecord *record)
  {
  	ginxlogDeletePage *data = (ginxlogDeletePage *) XLogRecGetData(record);
! 	Buffer		dbuffer;
! 	Buffer		pbuffer;
! 	Buffer		lbuffer;
  	Page		page;
  
! 	if (record->xl_info & XLR_BKP_BLOCK(0))
! 		dbuffer = RestoreBackupBlock(lsn, record, 0, false, true);
! 	else
  	{
! 		dbuffer = XLogReadBuffer(data->node, data->blkno, false);
! 		if (BufferIsValid(dbuffer))
  		{
! 			page = BufferGetPage(dbuffer);
  			if (!XLByteLE(lsn, PageGetLSN(page)))
  			{
  				Assert(GinPageIsData(page));
  				GinPageGetOpaque(page)->flags = GIN_DELETED;
  				PageSetLSN(page, lsn);
  				PageSetTLI(page, ThisTimeLineID);
! 				MarkBufferDirty(dbuffer);
  			}
  		}
  	}
  
! 	if (record->xl_info & XLR_BKP_BLOCK(1))
! 		pbuffer = RestoreBackupBlock(lsn, record, 1, false, true);
! 	else
  	{
! 		pbuffer = XLogReadBuffer(data->node, data->parentBlkno, false);
! 		if (BufferIsValid(pbuffer))
  		{
! 			page = BufferGetPage(pbuffer);
  			if (!XLByteLE(lsn, PageGetLSN(page)))
  			{
  				Assert(GinPageIsData(page));
--- 420,452 ----
  ginRedoDeletePage(XLogRecPtr lsn, XLogRecord *record)
  {
  	ginxlogDeletePage *data = (ginxlogDeletePage *) XLogRecGetData(record);
! 	Buffer		buffer;
  	Page		page;
  
! 	if (!(record->xl_info & XLR_BKP_BLOCK_1))
  	{
! 		buffer = XLogReadBuffer(data->node, data->blkno, false);
! 		if (BufferIsValid(buffer))
  		{
! 			page = BufferGetPage(buffer);
  			if (!XLByteLE(lsn, PageGetLSN(page)))
  			{
  				Assert(GinPageIsData(page));
  				GinPageGetOpaque(page)->flags = GIN_DELETED;
  				PageSetLSN(page, lsn);
  				PageSetTLI(page, ThisTimeLineID);
! 				MarkBufferDirty(buffer);
  			}
+ 			UnlockReleaseBuffer(buffer);
  		}
  	}
  
! 	if (!(record->xl_info & XLR_BKP_BLOCK_2))
  	{
! 		buffer = XLogReadBuffer(data->node, data->parentBlkno, false);
! 		if (BufferIsValid(buffer))
  		{
! 			page = BufferGetPage(buffer);
  			if (!XLByteLE(lsn, PageGetLSN(page)))
  			{
  				Assert(GinPageIsData(page));
*************** ginRedoDeletePage(XLogRecPtr lsn, XLogRe
*** 474,508 ****
  				GinPageDeletePostingItem(page, data->parentOffset);
  				PageSetLSN(page, lsn);
  				PageSetTLI(page, ThisTimeLineID);
! 				MarkBufferDirty(pbuffer);
  			}
  		}
  	}
  
! 	if (record->xl_info & XLR_BKP_BLOCK(2))
! 		(void) RestoreBackupBlock(lsn, record, 2, false, false);
! 	else if (data->leftBlkno != InvalidBlockNumber)
  	{
! 		lbuffer = XLogReadBuffer(data->node, data->leftBlkno, false);
! 		if (BufferIsValid(lbuffer))
  		{
! 			page = BufferGetPage(lbuffer);
  			if (!XLByteLE(lsn, PageGetLSN(page)))
  			{
  				Assert(GinPageIsData(page));
  				GinPageGetOpaque(page)->rightlink = data->rightLink;
  				PageSetLSN(page, lsn);
  				PageSetTLI(page, ThisTimeLineID);
! 				MarkBufferDirty(lbuffer);
  			}
! 			UnlockReleaseBuffer(lbuffer);
  		}
  	}
- 
- 	if (BufferIsValid(pbuffer))
- 		UnlockReleaseBuffer(pbuffer);
- 	if (BufferIsValid(dbuffer))
- 		UnlockReleaseBuffer(dbuffer);
  }
  
  static void
--- 454,482 ----
  				GinPageDeletePostingItem(page, data->parentOffset);
  				PageSetLSN(page, lsn);
  				PageSetTLI(page, ThisTimeLineID);
! 				MarkBufferDirty(buffer);
  			}
+ 			UnlockReleaseBuffer(buffer);
  		}
  	}
  
! 	if (!(record->xl_info & XLR_BKP_BLOCK_3) && data->leftBlkno != InvalidBlockNumber)
  	{
! 		buffer = XLogReadBuffer(data->node, data->leftBlkno, false);
! 		if (BufferIsValid(buffer))
  		{
! 			page = BufferGetPage(buffer);
  			if (!XLByteLE(lsn, PageGetLSN(page)))
  			{
  				Assert(GinPageIsData(page));
  				GinPageGetOpaque(page)->rightlink = data->rightLink;
  				PageSetLSN(page, lsn);
  				PageSetTLI(page, ThisTimeLineID);
! 				MarkBufferDirty(buffer);
  			}
! 			UnlockReleaseBuffer(buffer);
  		}
  	}
  }
  
  static void
*************** ginRedoUpdateMetapage(XLogRecPtr lsn, XL
*** 531,539 ****
  		/*
  		 * insert into tail page
  		 */
! 		if (record->xl_info & XLR_BKP_BLOCK(0))
! 			(void) RestoreBackupBlock(lsn, record, 0, false, false);
! 		else
  		{
  			buffer = XLogReadBuffer(data->node, data->metadata.tail, false);
  			if (BufferIsValid(buffer))
--- 505,511 ----
  		/*
  		 * insert into tail page
  		 */
! 		if (!(record->xl_info & XLR_BKP_BLOCK_1))
  		{
  			buffer = XLogReadBuffer(data->node, data->metadata.tail, false);
  			if (BufferIsValid(buffer))
*************** ginRedoUpdateMetapage(XLogRecPtr lsn, XL
*** 581,605 ****
  		/*
  		 * New tail
  		 */
! 		if (record->xl_info & XLR_BKP_BLOCK(0))
! 			(void) RestoreBackupBlock(lsn, record, 0, false, false);
! 		else
  		{
! 			buffer = XLogReadBuffer(data->node, data->prevTail, false);
! 			if (BufferIsValid(buffer))
! 			{
! 				Page		page = BufferGetPage(buffer);
  
! 				if (!XLByteLE(lsn, PageGetLSN(page)))
! 				{
! 					GinPageGetOpaque(page)->rightlink = data->newRightlink;
  
! 					PageSetLSN(page, lsn);
! 					PageSetTLI(page, ThisTimeLineID);
! 					MarkBufferDirty(buffer);
! 				}
! 				UnlockReleaseBuffer(buffer);
  			}
  		}
  	}
  
--- 553,572 ----
  		/*
  		 * New tail
  		 */
! 		buffer = XLogReadBuffer(data->node, data->prevTail, false);
! 		if (BufferIsValid(buffer))
  		{
! 			Page		page = BufferGetPage(buffer);
  
! 			if (!XLByteLE(lsn, PageGetLSN(page)))
! 			{
! 				GinPageGetOpaque(page)->rightlink = data->newRightlink;
  
! 				PageSetLSN(page, lsn);
! 				PageSetTLI(page, ThisTimeLineID);
! 				MarkBufferDirty(buffer);
  			}
+ 			UnlockReleaseBuffer(buffer);
  		}
  	}
  
*************** ginRedoInsertListPage(XLogRecPtr lsn, XL
*** 618,629 ****
  				tupsize;
  	IndexTuple	tuples = (IndexTuple) (XLogRecGetData(record) + sizeof(ginxlogInsertListPage));
  
! 	/* If we have a full-page image, restore it and we're done */
! 	if (record->xl_info & XLR_BKP_BLOCK(0))
! 	{
! 		(void) RestoreBackupBlock(lsn, record, 0, false, false);
  		return;
- 	}
  
  	buffer = XLogReadBuffer(data->node, data->blkno, true);
  	Assert(BufferIsValid(buffer));
--- 585,592 ----
  				tupsize;
  	IndexTuple	tuples = (IndexTuple) (XLogRecGetData(record) + sizeof(ginxlogInsertListPage));
  
! 	if (record->xl_info & XLR_BKP_BLOCK_1)
  		return;
  
  	buffer = XLogReadBuffer(data->node, data->blkno, true);
  	Assert(BufferIsValid(buffer));
*************** ginRedoDeleteListPages(XLogRecPtr lsn, X
*** 669,677 ****
  	Page		metapage;
  	int			i;
  
- 	/* Backup blocks are not used in delete_listpage records */
- 	Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
- 
  	metabuffer = XLogReadBuffer(data->node, GIN_METAPAGE_BLKNO, false);
  	if (!BufferIsValid(metabuffer))
  		return;					/* assume index was deleted, nothing to do */
--- 632,637 ----
*************** ginRedoDeleteListPages(XLogRecPtr lsn, X
*** 685,700 ****
  		MarkBufferDirty(metabuffer);
  	}
  
- 	/*
- 	 * In normal operation, shiftList() takes exclusive lock on all the
- 	 * pages-to-be-deleted simultaneously.	During replay, however, it should
- 	 * be all right to lock them one at a time.  This is dependent on the fact
- 	 * that we are deleting pages from the head of the list, and that readers
- 	 * share-lock the next page before releasing the one they are on. So we
- 	 * cannot get past a reader that is on, or due to visit, any page we are
- 	 * going to delete.  New incoming readers will block behind our metapage
- 	 * lock and then see a fully updated page list.
- 	 */
  	for (i = 0; i < data->ndeleted; i++)
  	{
  		Buffer		buffer = XLogReadBuffer(data->node, data->toDelete[i], false);
--- 645,650 ----
*************** gin_redo(XLogRecPtr lsn, XLogRecord *rec
*** 728,733 ****
--- 678,684 ----
  	 * implement a similar optimization as we have in b-tree, and remove
  	 * killed tuples outside VACUUM, we'll need to handle that here.
  	 */
+ 	RestoreBkpBlocks(lsn, record, false);
  
  	topCtx = MemoryContextSwitchTo(opCtx);
  	switch (info)
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
new file mode 100644
index 4440499..76029d9
*** a/src/backend/access/gist/gistxlog.c
--- b/src/backend/access/gist/gistxlog.c
*************** typedef struct
*** 32,79 ****
  static MemoryContext opCtx;		/* working memory for operations */
  
  /*
!  * Replay the clearing of F_FOLLOW_RIGHT flag on a child page.
!  *
!  * Even if the WAL record includes a full-page image, we have to update the
!  * follow-right flag, because that change is not included in the full-page
!  * image.  To be sure that the intermediate state with the wrong flag value is
!  * not visible to concurrent Hot Standby queries, this function handles
!  * restoring the full-page image as well as updating the flag.  (Note that
!  * we never need to do anything else to the child page in the current WAL
!  * action.)
   */
  static void
! gistRedoClearFollowRight(XLogRecPtr lsn, XLogRecord *record, int block_index,
! 						 RelFileNode node, BlockNumber childblkno)
  {
  	Buffer		buffer;
- 	Page		page;
  
! 	if (record->xl_info & XLR_BKP_BLOCK(block_index))
! 		buffer = RestoreBackupBlock(lsn, record, block_index, false, true);
! 	else
  	{
! 		buffer = XLogReadBuffer(node, childblkno, false);
! 		if (!BufferIsValid(buffer))
! 			return;				/* page was deleted, nothing to do */
! 	}
! 	page = (Page) BufferGetPage(buffer);
  
! 	/*
! 	 * Note that we still update the page even if page LSN is equal to the LSN
! 	 * of this record, because the updated NSN is not included in the full
! 	 * page image.
! 	 */
! 	if (!XLByteLT(lsn, PageGetLSN(page)))
! 	{
! 		GistPageGetOpaque(page)->nsn = lsn;
! 		GistClearFollowRight(page);
  
! 		PageSetLSN(page, lsn);
! 		PageSetTLI(page, ThisTimeLineID);
! 		MarkBufferDirty(buffer);
  	}
- 	UnlockReleaseBuffer(buffer);
  }
  
  /*
--- 32,66 ----
  static MemoryContext opCtx;		/* working memory for operations */
  
  /*
!  * Replay the clearing of F_FOLLOW_RIGHT flag.
   */
  static void
! gistRedoClearFollowRight(RelFileNode node, XLogRecPtr lsn,
! 						 BlockNumber leftblkno)
  {
  	Buffer		buffer;
  
! 	buffer = XLogReadBuffer(node, leftblkno, false);
! 	if (BufferIsValid(buffer))
  	{
! 		Page		page = (Page) BufferGetPage(buffer);
  
! 		/*
! 		 * Note that we still update the page even if page LSN is equal to the
! 		 * LSN of this record, because the updated NSN is not included in the
! 		 * full page image.
! 		 */
! 		if (!XLByteLT(lsn, PageGetLSN(page)))
! 		{
! 			GistPageGetOpaque(page)->nsn = lsn;
! 			GistClearFollowRight(page);
  
! 			PageSetLSN(page, lsn);
! 			PageSetTLI(page, ThisTimeLineID);
! 			MarkBufferDirty(buffer);
! 		}
! 		UnlockReleaseBuffer(buffer);
  	}
  }
  
  /*
*************** gistRedoPageUpdateRecord(XLogRecPtr lsn,
*** 88,124 ****
  	Page		page;
  	char	   *data;
  
- 	/*
- 	 * We need to acquire and hold lock on target page while updating the left
- 	 * child page.  If we have a full-page image of target page, getting the
- 	 * lock is a side-effect of restoring that image.  Note that even if the
- 	 * target page no longer exists, we'll still attempt to replay the change
- 	 * on the child page.
- 	 */
- 	if (record->xl_info & XLR_BKP_BLOCK(0))
- 		buffer = RestoreBackupBlock(lsn, record, 0, false, true);
- 	else
- 		buffer = XLogReadBuffer(xldata->node, xldata->blkno, false);
- 
- 	/* Fix follow-right data on left child page */
  	if (BlockNumberIsValid(xldata->leftchild))
! 		gistRedoClearFollowRight(lsn, record, 1,
! 								 xldata->node, xldata->leftchild);
! 
! 	/* Done if target page no longer exists */
! 	if (!BufferIsValid(buffer))
! 		return;
  
  	/* nothing more to do if page was backed up (and no info to do it with) */
! 	if (record->xl_info & XLR_BKP_BLOCK(0))
! 	{
! 		UnlockReleaseBuffer(buffer);
  		return;
- 	}
  
  	page = (Page) BufferGetPage(buffer);
  
- 	/* nothing more to do if change already applied */
  	if (XLByteLE(lsn, PageGetLSN(page)))
  	{
  		UnlockReleaseBuffer(buffer);
--- 75,92 ----
  	Page		page;
  	char	   *data;
  
  	if (BlockNumberIsValid(xldata->leftchild))
! 		gistRedoClearFollowRight(xldata->node, lsn, xldata->leftchild);
  
  	/* nothing more to do if page was backed up (and no info to do it with) */
! 	if (record->xl_info & XLR_BKP_BLOCK_1)
  		return;
  
+ 	buffer = XLogReadBuffer(xldata->node, xldata->blkno, false);
+ 	if (!BufferIsValid(buffer))
+ 		return;
  	page = (Page) BufferGetPage(buffer);
  
  	if (XLByteLE(lsn, PageGetLSN(page)))
  	{
  		UnlockReleaseBuffer(buffer);
*************** gistRedoPageUpdateRecord(XLogRecPtr lsn,
*** 172,187 ****
  			GistClearTuplesDeleted(page);
  	}
  
! 	if (!GistPageIsLeaf(page) &&
! 		PageGetMaxOffsetNumber(page) == InvalidOffsetNumber &&
! 		xldata->blkno == GIST_ROOT_BLKNO)
! 	{
  		/*
  		 * all links on non-leaf root page was deleted by vacuum full, so root
  		 * page becomes a leaf
  		 */
  		GistPageSetLeaf(page);
- 	}
  
  	GistPageGetOpaque(page)->rightlink = InvalidBlockNumber;
  	PageSetLSN(page, lsn);
--- 140,152 ----
  			GistClearTuplesDeleted(page);
  	}
  
! 	if (!GistPageIsLeaf(page) && PageGetMaxOffsetNumber(page) == InvalidOffsetNumber && xldata->blkno == GIST_ROOT_BLKNO)
! 
  		/*
  		 * all links on non-leaf root page was deleted by vacuum full, so root
  		 * page becomes a leaf
  		 */
  		GistPageSetLeaf(page);
  
  	GistPageGetOpaque(page)->rightlink = InvalidBlockNumber;
  	PageSetLSN(page, lsn);
*************** gistRedoPageUpdateRecord(XLogRecPtr lsn,
*** 191,196 ****
--- 156,185 ----
  }
  
  static void
+ gistRedoPageDeleteRecord(XLogRecPtr lsn, XLogRecord *record)
+ {
+ 	gistxlogPageDelete *xldata = (gistxlogPageDelete *) XLogRecGetData(record);
+ 	Buffer		buffer;
+ 	Page		page;
+ 
+ 	/* nothing else to do if page was backed up (and no info to do it with) */
+ 	if (record->xl_info & XLR_BKP_BLOCK_1)
+ 		return;
+ 
+ 	buffer = XLogReadBuffer(xldata->node, xldata->blkno, false);
+ 	if (!BufferIsValid(buffer))
+ 		return;
+ 
+ 	page = (Page) BufferGetPage(buffer);
+ 	GistPageSetDeleted(page);
+ 
+ 	PageSetLSN(page, lsn);
+ 	PageSetTLI(page, ThisTimeLineID);
+ 	MarkBufferDirty(buffer);
+ 	UnlockReleaseBuffer(buffer);
+ }
+ 
+ static void
  decodePageSplitRecord(PageSplitRecord *decoded, XLogRecord *record)
  {
  	char	   *begin = XLogRecGetData(record),
*************** gistRedoPageSplitRecord(XLogRecPtr lsn,
*** 226,247 ****
  {
  	gistxlogPageSplit *xldata = (gistxlogPageSplit *) XLogRecGetData(record);
  	PageSplitRecord xlrec;
- 	Buffer		firstbuffer = InvalidBuffer;
  	Buffer		buffer;
  	Page		page;
  	int			i;
  	bool		isrootsplit = false;
  
  	decodePageSplitRecord(&xlrec, record);
  
- 	/*
- 	 * We must hold lock on the first-listed page throughout the action,
- 	 * including while updating the left child page (if any).  We can unlock
- 	 * remaining pages in the list as soon as they've been written, because
- 	 * there is no path for concurrent queries to reach those pages without
- 	 * first visiting the first-listed page.
- 	 */
- 
  	/* loop around all pages */
  	for (i = 0; i < xlrec.data->npage; i++)
  	{
--- 215,229 ----
  {
  	gistxlogPageSplit *xldata = (gistxlogPageSplit *) XLogRecGetData(record);
  	PageSplitRecord xlrec;
  	Buffer		buffer;
  	Page		page;
  	int			i;
  	bool		isrootsplit = false;
  
+ 	if (BlockNumberIsValid(xldata->leftchild))
+ 		gistRedoClearFollowRight(xldata->node, lsn, xldata->leftchild);
  	decodePageSplitRecord(&xlrec, record);
  
  	/* loop around all pages */
  	for (i = 0; i < xlrec.data->npage; i++)
  	{
*************** gistRedoPageSplitRecord(XLogRecPtr lsn,
*** 291,310 ****
  		PageSetLSN(page, lsn);
  		PageSetTLI(page, ThisTimeLineID);
  		MarkBufferDirty(buffer);
! 
! 		if (i == 0)
! 			firstbuffer = buffer;
! 		else
! 			UnlockReleaseBuffer(buffer);
  	}
- 
- 	/* Fix follow-right data on left child page, if any */
- 	if (BlockNumberIsValid(xldata->leftchild))
- 		gistRedoClearFollowRight(lsn, record, 0,
- 								 xldata->node, xldata->leftchild);
- 
- 	/* Finally, release lock on the first page */
- 	UnlockReleaseBuffer(firstbuffer);
  }
  
  static void
--- 273,280 ----
  		PageSetLSN(page, lsn);
  		PageSetTLI(page, ThisTimeLineID);
  		MarkBufferDirty(buffer);
! 		UnlockReleaseBuffer(buffer);
  	}
  }
  
  static void
*************** gistRedoCreateIndex(XLogRecPtr lsn, XLog
*** 314,322 ****
  	Buffer		buffer;
  	Page		page;
  
- 	/* Backup blocks are not used in create_index records */
- 	Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
- 
  	buffer = XLogReadBuffer(*node, GIST_ROOT_BLKNO, true);
  	Assert(BufferIsValid(buffer));
  	page = (Page) BufferGetPage(buffer);
--- 284,289 ----
*************** gist_redo(XLogRecPtr lsn, XLogRecord *re
*** 341,346 ****
--- 308,314 ----
  	 * implement a similar optimization we have in b-tree, and remove killed
  	 * tuples outside VACUUM, we'll need to handle that here.
  	 */
+ 	RestoreBkpBlocks(lsn, record, false);
  
  	oldCxt = MemoryContextSwitchTo(opCtx);
  	switch (info)
*************** gist_redo(XLogRecPtr lsn, XLogRecord *re
*** 348,353 ****
--- 316,324 ----
  		case XLOG_GIST_PAGE_UPDATE:
  			gistRedoPageUpdateRecord(lsn, record);
  			break;
+ 		case XLOG_GIST_PAGE_DELETE:
+ 			gistRedoPageDeleteRecord(lsn, record);
+ 			break;
  		case XLOG_GIST_PAGE_SPLIT:
  			gistRedoPageSplitRecord(lsn, record);
  			break;
*************** out_gistxlogPageUpdate(StringInfo buf, g
*** 377,382 ****
--- 348,361 ----
  }
  
  static void
+ out_gistxlogPageDelete(StringInfo buf, gistxlogPageDelete *xlrec)
+ {
+ 	appendStringInfo(buf, "page_delete: rel %u/%u/%u; blkno %u",
+ 				xlrec->node.spcNode, xlrec->node.dbNode, xlrec->node.relNode,
+ 					 xlrec->blkno);
+ }
+ 
+ static void
  out_gistxlogPageSplit(StringInfo buf, gistxlogPageSplit *xlrec)
  {
  	appendStringInfo(buf, "page_split: ");
*************** gist_desc(StringInfo buf, uint8 xl_info,
*** 396,401 ****
--- 375,383 ----
  			appendStringInfo(buf, "page_update: ");
  			out_gistxlogPageUpdate(buf, (gistxlogPageUpdate *) rec);
  			break;
+ 		case XLOG_GIST_PAGE_DELETE:
+ 			out_gistxlogPageDelete(buf, (gistxlogPageDelete *) rec);
+ 			break;
  		case XLOG_GIST_PAGE_SPLIT:
  			out_gistxlogPageSplit(buf, (gistxlogPageSplit *) rec);
  			break;
*************** gistXLogUpdate(RelFileNode node, Buffer
*** 516,545 ****
  			   Buffer leftchildbuf)
  {
  	XLogRecData *rdata;
! 	gistxlogPageUpdate xlrec;
  	int			cur,
  				i;
  	XLogRecPtr	recptr;
  
! 	rdata = (XLogRecData *) palloc(sizeof(XLogRecData) * (3 + ituplen));
  
! 	xlrec.node = node;
! 	xlrec.blkno = BufferGetBlockNumber(buffer);
! 	xlrec.ntodelete = ntodelete;
! 	xlrec.leftchild =
  		BufferIsValid(leftchildbuf) ? BufferGetBlockNumber(leftchildbuf) : InvalidBlockNumber;
  
! 	rdata[0].data = (char *) &xlrec;
! 	rdata[0].len = sizeof(gistxlogPageUpdate);
! 	rdata[0].buffer = InvalidBuffer;
  	rdata[0].next = &(rdata[1]);
  
! 	rdata[1].data = (char *) todelete;
! 	rdata[1].len = sizeof(OffsetNumber) * ntodelete;
! 	rdata[1].buffer = buffer;
! 	rdata[1].buffer_std = true;
  
! 	cur = 2;
  
  	/* new tuples */
  	for (i = 0; i < ituplen; i++)
--- 498,534 ----
  			   Buffer leftchildbuf)
  {
  	XLogRecData *rdata;
! 	gistxlogPageUpdate *xlrec;
  	int			cur,
  				i;
  	XLogRecPtr	recptr;
  
! 	rdata = (XLogRecData *) palloc(sizeof(XLogRecData) * (4 + ituplen));
! 	xlrec = (gistxlogPageUpdate *) palloc(sizeof(gistxlogPageUpdate));
  
! 	xlrec->node = node;
! 	xlrec->blkno = BufferGetBlockNumber(buffer);
! 	xlrec->ntodelete = ntodelete;
! 	xlrec->leftchild =
  		BufferIsValid(leftchildbuf) ? BufferGetBlockNumber(leftchildbuf) : InvalidBlockNumber;
  
! 	rdata[0].buffer = buffer;
! 	rdata[0].buffer_std = true;
! 	rdata[0].data = NULL;
! 	rdata[0].len = 0;
  	rdata[0].next = &(rdata[1]);
  
! 	rdata[1].data = (char *) xlrec;
! 	rdata[1].len = sizeof(gistxlogPageUpdate);
! 	rdata[1].buffer = InvalidBuffer;
! 	rdata[1].next = &(rdata[2]);
  
! 	rdata[2].data = (char *) todelete;
! 	rdata[2].len = sizeof(OffsetNumber) * ntodelete;
! 	rdata[2].buffer = buffer;
! 	rdata[2].buffer_std = true;
! 
! 	cur = 3;
  
  	/* new tuples */
  	for (i = 0; i < ituplen; i++)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
new file mode 100644
index 64aecf2..570cf95
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
*************** heap_xlog_cleanup_info(XLogRecPtr lsn, X
*** 4620,4628 ****
  	 * conflict processing to occur before we begin index vacuum actions. see
  	 * vacuumlazy.c and also comments in btvacuumpage()
  	 */
- 
- 	/* Backup blocks are not used in cleanup_info records */
- 	Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
  }
  
  /*
--- 4620,4625 ----
*************** heap_xlog_clean(XLogRecPtr lsn, XLogReco
*** 4655,4669 ****
  		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
  											xlrec->node);
  
! 	/*
! 	 * If we have a full-page image, restore it (using a cleanup lock) and
! 	 * we're done.
! 	 */
! 	if (record->xl_info & XLR_BKP_BLOCK(0))
! 	{
! 		(void) RestoreBackupBlock(lsn, record, 0, true, false);
  		return;
- 	}
  
  	buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, xlrec->block, RBM_NORMAL);
  	if (!BufferIsValid(buffer))
--- 4652,4661 ----
  		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
  											xlrec->node);
  
! 	RestoreBkpBlocks(lsn, record, true);
! 
! 	if (record->xl_info & XLR_BKP_BLOCK_1)
  		return;
  
  	buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, xlrec->block, RBM_NORMAL);
  	if (!BufferIsValid(buffer))
*************** heap_xlog_freeze(XLogRecPtr lsn, XLogRec
*** 4729,4744 ****
  	if (InHotStandby)
  		ResolveRecoveryConflictWithSnapshot(cutoff_xid, xlrec->node);
  
! 	/* If we have a full-page image, restore it and we're done */
! 	if (record->xl_info & XLR_BKP_BLOCK(0))
! 	{
! 		(void) RestoreBackupBlock(lsn, record, 0, false, false);
  		return;
- 	}
  
! 	buffer = XLogReadBuffer(xlrec->node, xlrec->block, false);
  	if (!BufferIsValid(buffer))
  		return;
  	page = (Page) BufferGetPage(buffer);
  
  	if (XLByteLE(lsn, PageGetLSN(page)))
--- 4721,4735 ----
  	if (InHotStandby)
  		ResolveRecoveryConflictWithSnapshot(cutoff_xid, xlrec->node);
  
! 	RestoreBkpBlocks(lsn, record, false);
! 
! 	if (record->xl_info & XLR_BKP_BLOCK_1)
  		return;
  
! 	buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, xlrec->block, RBM_NORMAL);
  	if (!BufferIsValid(buffer))
  		return;
+ 	LockBufferForCleanup(buffer);
  	page = (Page) BufferGetPage(buffer);
  
  	if (XLByteLE(lsn, PageGetLSN(page)))
*************** heap_xlog_visible(XLogRecPtr lsn, XLogRe
*** 4788,4793 ****
--- 4779,4796 ----
  	Page		page;
  
  	/*
+ 	 * Read the heap page, if it still exists.	If the heap file has been
+ 	 * dropped or truncated later in recovery, this might fail.  In that case,
+ 	 * there's no point in doing anything further, since the visibility map
+ 	 * will have to be cleared out at the same time.
+ 	 */
+ 	buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, xlrec->block,
+ 									RBM_NORMAL);
+ 	if (!BufferIsValid(buffer))
+ 		return;
+ 	page = (Page) BufferGetPage(buffer);
+ 
+ 	/*
  	 * If there are any Hot Standby transactions running that have an xmin
  	 * horizon old enough that this page isn't all-visible for them, they
  	 * might incorrectly decide that an index-only scan can skip a heap fetch.
*************** heap_xlog_visible(XLogRecPtr lsn, XLogRe
*** 4799,4848 ****
  	if (InHotStandby)
  		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid, xlrec->node);
  
  	/*
! 	 * Read the heap page, if it still exists.	If the heap file has been
! 	 * dropped or truncated later in recovery, we don't need to update the
! 	 * page, but we'd better still update the visibility map.
  	 */
! 	buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, xlrec->block,
! 									RBM_NORMAL);
! 	if (BufferIsValid(buffer))
  	{
! 		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
! 
! 		page = (Page) BufferGetPage(buffer);
! 
! 		/*
! 		 * We don't bump the LSN of the heap page when setting the visibility
! 		 * map bit, because that would generate an unworkable volume of
! 		 * full-page writes.  This exposes us to torn page hazards, but since
! 		 * we're not inspecting the existing page contents in any way, we
! 		 * don't care.
! 		 *
! 		 * However, all operations that clear the visibility map bit *do* bump
! 		 * the LSN, and those operations will only be replayed if the XLOG LSN
! 		 * follows the page LSN.  Thus, if the page LSN has advanced past our
! 		 * XLOG record's LSN, we mustn't mark the page all-visible, because
! 		 * the subsequent update won't be replayed to clear the flag.
! 		 */
! 		if (!XLByteLE(lsn, PageGetLSN(page)))
! 		{
! 			PageSetAllVisible(page);
! 			MarkBufferDirty(buffer);
! 		}
! 
! 		/* Done with heap page. */
! 		UnlockReleaseBuffer(buffer);
  	}
  
  	/*
! 	 * Even if we skipped the heap page update due to the LSN interlock, it's
  	 * still safe to update the visibility map.  Any WAL record that clears
  	 * the visibility map bit does so before checking the page LSN, so any
  	 * bits that need to be cleared will still be cleared.
  	 */
! 	if (record->xl_info & XLR_BKP_BLOCK(0))
! 		(void) RestoreBackupBlock(lsn, record, 0, false, false);
  	else
  	{
  		Relation	reln;
--- 4802,4838 ----
  	if (InHotStandby)
  		ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid, xlrec->node);
  
+ 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+ 
  	/*
! 	 * We don't bump the LSN of the heap page when setting the visibility map
! 	 * bit, because that would generate an unworkable volume of full-page
! 	 * writes.	This exposes us to torn page hazards, but since we're not
! 	 * inspecting the existing page contents in any way, we don't care.
! 	 *
! 	 * However, all operations that clear the visibility map bit *do* bump the
! 	 * LSN, and those operations will only be replayed if the XLOG LSN follows
! 	 * the page LSN.  Thus, if the page LSN has advanced past our XLOG
! 	 * record's LSN, we mustn't mark the page all-visible, because the
! 	 * subsequent update won't be replayed to clear the flag.
  	 */
! 	if (!XLByteLE(lsn, PageGetLSN(page)))
  	{
! 		PageSetAllVisible(page);
! 		MarkBufferDirty(buffer);
  	}
  
+ 	/* Done with heap page. */
+ 	UnlockReleaseBuffer(buffer);
+ 
  	/*
! 	 * Even we skipped the heap page update due to the LSN interlock, it's
  	 * still safe to update the visibility map.  Any WAL record that clears
  	 * the visibility map bit does so before checking the page LSN, so any
  	 * bits that need to be cleared will still be cleared.
  	 */
! 	if (record->xl_info & XLR_BKP_BLOCK_1)
! 		RestoreBkpBlocks(lsn, record, false);
  	else
  	{
  		Relation	reln;
*************** heap_xlog_visible(XLogRecPtr lsn, XLogRe
*** 4854,4866 ****
  		/*
  		 * Don't set the bit if replay has already passed this point.
  		 *
! 		 * It might be safe to do this unconditionally; if replay has passed
  		 * this point, we'll replay at least as far this time as we did
  		 * before, and if this bit needs to be cleared, the record responsible
  		 * for doing so should be again replayed, and clear it.  For right
  		 * now, out of an abundance of conservatism, we use the same test here
! 		 * we did for the heap page.  If this results in a dropped bit, no
! 		 * real harm is done; and the next VACUUM will fix it.
  		 */
  		if (!XLByteLE(lsn, PageGetLSN(BufferGetPage(vmbuffer))))
  			visibilitymap_set(reln, xlrec->block, lsn, vmbuffer,
--- 4844,4856 ----
  		/*
  		 * Don't set the bit if replay has already passed this point.
  		 *
! 		 * It might be safe to do this unconditionally; if replay has past
  		 * this point, we'll replay at least as far this time as we did
  		 * before, and if this bit needs to be cleared, the record responsible
  		 * for doing so should be again replayed, and clear it.  For right
  		 * now, out of an abundance of conservatism, we use the same test here
! 		 * we did for the heap page; if this results in a dropped bit, no real
! 		 * harm is done; and the next VACUUM will fix it.
  		 */
  		if (!XLByteLE(lsn, PageGetLSN(BufferGetPage(vmbuffer))))
  			visibilitymap_set(reln, xlrec->block, lsn, vmbuffer,
*************** heap_xlog_newpage(XLogRecPtr lsn, XLogRe
*** 4878,4886 ****
  	Buffer		buffer;
  	Page		page;
  
- 	/* Backup blocks are not used in newpage records */
- 	Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
- 
  	/*
  	 * Note: the NEWPAGE log record is used for both heaps and indexes, so do
  	 * not do anything that assumes we are touching a heap.
--- 4868,4873 ----
*************** heap_xlog_delete(XLogRecPtr lsn, XLogRec
*** 4936,4947 ****
  		FreeFakeRelcacheEntry(reln);
  	}
  
! 	/* If we have a full-page image, restore it and we're done */
! 	if (record->xl_info & XLR_BKP_BLOCK(0))
! 	{
! 		(void) RestoreBackupBlock(lsn, record, 0, false, false);
  		return;
- 	}
  
  	buffer = XLogReadBuffer(xlrec->target.node, blkno, false);
  	if (!BufferIsValid(buffer))
--- 4923,4930 ----
  		FreeFakeRelcacheEntry(reln);
  	}
  
! 	if (record->xl_info & XLR_BKP_BLOCK_1)
  		return;
  
  	buffer = XLogReadBuffer(xlrec->target.node, blkno, false);
  	if (!BufferIsValid(buffer))
*************** heap_xlog_insert(XLogRecPtr lsn, XLogRec
*** 5021,5032 ****
  		FreeFakeRelcacheEntry(reln);
  	}
  
! 	/* If we have a full-page image, restore it and we're done */
! 	if (record->xl_info & XLR_BKP_BLOCK(0))
! 	{
! 		(void) RestoreBackupBlock(lsn, record, 0, false, false);
  		return;
- 	}
  
  	if (record->xl_info & XLOG_HEAP_INIT_PAGE)
  	{
--- 5004,5011 ----
  		FreeFakeRelcacheEntry(reln);
  	}
  
! 	if (record->xl_info & XLR_BKP_BLOCK_1)
  		return;
  
  	if (record->xl_info & XLOG_HEAP_INIT_PAGE)
  	{
*************** heap_xlog_multi_insert(XLogRecPtr lsn, X
*** 5128,5133 ****
--- 5107,5114 ----
  	 * required.
  	 */
  
+ 	RestoreBkpBlocks(lsn, record, false);
+ 
  	xlrec = (xl_heap_multi_insert *) recdata;
  	recdata += SizeOfHeapMultiInsert;
  
*************** heap_xlog_multi_insert(XLogRecPtr lsn, X
*** 5156,5167 ****
  		FreeFakeRelcacheEntry(reln);
  	}
  
! 	/* If we have a full-page image, restore it and we're done */
! 	if (record->xl_info & XLR_BKP_BLOCK(0))
! 	{
! 		(void) RestoreBackupBlock(lsn, record, 0, false, false);
  		return;
- 	}
  
  	if (isinit)
  	{
--- 5137,5144 ----
  		FreeFakeRelcacheEntry(reln);
  	}
  
! 	if (record->xl_info & XLR_BKP_BLOCK_1)
  		return;
  
  	if (isinit)
  	{
*************** static void
*** 5255,5264 ****
  heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  {
  	xl_heap_update *xlrec = (xl_heap_update *) XLogRecGetData(record);
  	bool		samepage = (ItemPointerGetBlockNumber(&(xlrec->newtid)) ==
  							ItemPointerGetBlockNumber(&(xlrec->target.tid)));
- 	Buffer		obuffer,
- 				nbuffer;
  	Page		page;
  	OffsetNumber offnum;
  	ItemId		lp = NULL;
--- 5232,5240 ----
  heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  {
  	xl_heap_update *xlrec = (xl_heap_update *) XLogRecGetData(record);
+ 	Buffer		buffer;
  	bool		samepage = (ItemPointerGetBlockNumber(&(xlrec->newtid)) ==
  							ItemPointerGetBlockNumber(&(xlrec->target.tid)));
  	Page		page;
  	OffsetNumber offnum;
  	ItemId		lp = NULL;
*************** heap_xlog_update(XLogRecPtr lsn, XLogRec
*** 5289,5332 ****
  		FreeFakeRelcacheEntry(reln);
  	}
  
! 	/*
! 	 * In normal operation, it is important to lock the two pages in
! 	 * page-number order, to avoid possible deadlocks against other update
! 	 * operations going the other way.	However, during WAL replay there can
! 	 * be no other update happening, so we don't need to worry about that. But
! 	 * we *do* need to worry that we don't expose an inconsistent state to Hot
! 	 * Standby queries --- so the original page can't be unlocked before we've
! 	 * added the new tuple to the new page.
! 	 */
! 
! 	if (record->xl_info & XLR_BKP_BLOCK(0))
  	{
- 		obuffer = RestoreBackupBlock(lsn, record, 0, false, true);
  		if (samepage)
! 		{
! 			/* backup block covered both changes, so we're done */
! 			UnlockReleaseBuffer(obuffer);
! 			return;
! 		}
  		goto newt;
  	}
  
  	/* Deal with old tuple version */
  
! 	obuffer = XLogReadBuffer(xlrec->target.node,
! 							 ItemPointerGetBlockNumber(&(xlrec->target.tid)),
! 							 false);
! 	if (!BufferIsValid(obuffer))
  		goto newt;
! 	page = (Page) BufferGetPage(obuffer);
  
  	if (XLByteLE(lsn, PageGetLSN(page)))		/* changes are applied */
  	{
  		if (samepage)
- 		{
- 			UnlockReleaseBuffer(obuffer);
  			return;
- 		}
  		goto newt;
  	}
  
--- 5265,5291 ----
  		FreeFakeRelcacheEntry(reln);
  	}
  
! 	if (record->xl_info & XLR_BKP_BLOCK_1)
  	{
  		if (samepage)
! 			return;				/* backup block covered both changes */
  		goto newt;
  	}
  
  	/* Deal with old tuple version */
  
! 	buffer = XLogReadBuffer(xlrec->target.node,
! 							ItemPointerGetBlockNumber(&(xlrec->target.tid)),
! 							false);
! 	if (!BufferIsValid(buffer))
  		goto newt;
! 	page = (Page) BufferGetPage(buffer);
  
  	if (XLByteLE(lsn, PageGetLSN(page)))		/* changes are applied */
  	{
+ 		UnlockReleaseBuffer(buffer);
  		if (samepage)
  			return;
  		goto newt;
  	}
  
*************** heap_xlog_update(XLogRecPtr lsn, XLogRec
*** 5364,5377 ****
  	 * is already applied
  	 */
  	if (samepage)
- 	{
- 		nbuffer = obuffer;
  		goto newsame;
- 	}
- 
  	PageSetLSN(page, lsn);
  	PageSetTLI(page, ThisTimeLineID);
! 	MarkBufferDirty(obuffer);
  
  	/* Deal with new tuple */
  
--- 5323,5333 ----
  	 * is already applied
  	 */
  	if (samepage)
  		goto newsame;
  	PageSetLSN(page, lsn);
  	PageSetTLI(page, ThisTimeLineID);
! 	MarkBufferDirty(buffer);
! 	UnlockReleaseBuffer(buffer);
  
  	/* Deal with new tuple */
  
*************** newt:;
*** 5393,5430 ****
  		FreeFakeRelcacheEntry(reln);
  	}
  
! 	if (record->xl_info & XLR_BKP_BLOCK(1))
! 	{
! 		(void) RestoreBackupBlock(lsn, record, 1, false, false);
! 		if (BufferIsValid(obuffer))
! 			UnlockReleaseBuffer(obuffer);
  		return;
- 	}
  
  	if (record->xl_info & XLOG_HEAP_INIT_PAGE)
  	{
! 		nbuffer = XLogReadBuffer(xlrec->target.node,
! 								 ItemPointerGetBlockNumber(&(xlrec->newtid)),
! 								 true);
! 		Assert(BufferIsValid(nbuffer));
! 		page = (Page) BufferGetPage(nbuffer);
  
! 		PageInit(page, BufferGetPageSize(nbuffer), 0);
  	}
  	else
  	{
! 		nbuffer = XLogReadBuffer(xlrec->target.node,
! 								 ItemPointerGetBlockNumber(&(xlrec->newtid)),
! 								 false);
! 		if (!BufferIsValid(nbuffer))
  			return;
! 		page = (Page) BufferGetPage(nbuffer);
  
  		if (XLByteLE(lsn, PageGetLSN(page)))	/* changes are applied */
  		{
! 			UnlockReleaseBuffer(nbuffer);
! 			if (BufferIsValid(obuffer))
! 				UnlockReleaseBuffer(obuffer);
  			return;
  		}
  	}
--- 5349,5379 ----
  		FreeFakeRelcacheEntry(reln);
  	}
  
! 	if (record->xl_info & XLR_BKP_BLOCK_2)
  		return;
  
  	if (record->xl_info & XLOG_HEAP_INIT_PAGE)
  	{
! 		buffer = XLogReadBuffer(xlrec->target.node,
! 								ItemPointerGetBlockNumber(&(xlrec->newtid)),
! 								true);
! 		Assert(BufferIsValid(buffer));
! 		page = (Page) BufferGetPage(buffer);
  
! 		PageInit(page, BufferGetPageSize(buffer), 0);
  	}
  	else
  	{
! 		buffer = XLogReadBuffer(xlrec->target.node,
! 								ItemPointerGetBlockNumber(&(xlrec->newtid)),
! 								false);
! 		if (!BufferIsValid(buffer))
  			return;
! 		page = (Page) BufferGetPage(buffer);
  
  		if (XLByteLE(lsn, PageGetLSN(page)))	/* changes are applied */
  		{
! 			UnlockReleaseBuffer(buffer);
  			return;
  		}
  	}
*************** newsame:;
*** 5469,5482 ****
  
  	PageSetLSN(page, lsn);
  	PageSetTLI(page, ThisTimeLineID);
! 	MarkBufferDirty(nbuffer);
! 	UnlockReleaseBuffer(nbuffer);
! 
! 	if (BufferIsValid(obuffer) && obuffer != nbuffer)
! 		UnlockReleaseBuffer(obuffer);
  
  	/*
! 	 * If the new page is running low on free space, update the FSM as well.
  	 * Arbitrarily, our definition of "low" is less than 20%. We can't do much
  	 * better than that without knowing the fill-factor for the table.
  	 *
--- 5418,5428 ----
  
  	PageSetLSN(page, lsn);
  	PageSetTLI(page, ThisTimeLineID);
! 	MarkBufferDirty(buffer);
! 	UnlockReleaseBuffer(buffer);
  
  	/*
! 	 * If the page is running low on free space, update the FSM as well.
  	 * Arbitrarily, our definition of "low" is less than 20%. We can't do much
  	 * better than that without knowing the fill-factor for the table.
  	 *
*************** newsame:;
*** 5492,5499 ****
  	 */
  	if (!hot_update && freespace < BLCKSZ / 5)
  		XLogRecordPageWithFreeSpace(xlrec->target.node,
! 								 ItemPointerGetBlockNumber(&(xlrec->newtid)),
! 									freespace);
  }
  
  static void
--- 5438,5444 ----
  	 */
  	if (!hot_update && freespace < BLCKSZ / 5)
  		XLogRecordPageWithFreeSpace(xlrec->target.node,
! 					 ItemPointerGetBlockNumber(&(xlrec->newtid)), freespace);
  }
  
  static void
*************** heap_xlog_lock(XLogRecPtr lsn, XLogRecor
*** 5506,5517 ****
  	ItemId		lp = NULL;
  	HeapTupleHeader htup;
  
! 	/* If we have a full-page image, restore it and we're done */
! 	if (record->xl_info & XLR_BKP_BLOCK(0))
! 	{
! 		(void) RestoreBackupBlock(lsn, record, 0, false, false);
  		return;
- 	}
  
  	buffer = XLogReadBuffer(xlrec->target.node,
  							ItemPointerGetBlockNumber(&(xlrec->target.tid)),
--- 5451,5458 ----
  	ItemId		lp = NULL;
  	HeapTupleHeader htup;
  
! 	if (record->xl_info & XLR_BKP_BLOCK_1)
  		return;
  
  	buffer = XLogReadBuffer(xlrec->target.node,
  							ItemPointerGetBlockNumber(&(xlrec->target.tid)),
*************** heap_xlog_inplace(XLogRecPtr lsn, XLogRe
*** 5569,5580 ****
  	uint32		oldlen;
  	uint32		newlen;
  
! 	/* If we have a full-page image, restore it and we're done */
! 	if (record->xl_info & XLR_BKP_BLOCK(0))
! 	{
! 		(void) RestoreBackupBlock(lsn, record, 0, false, false);
  		return;
- 	}
  
  	buffer = XLogReadBuffer(xlrec->target.node,
  							ItemPointerGetBlockNumber(&(xlrec->target.tid)),
--- 5510,5517 ----
  	uint32		oldlen;
  	uint32		newlen;
  
! 	if (record->xl_info & XLR_BKP_BLOCK_1)
  		return;
  
  	buffer = XLogReadBuffer(xlrec->target.node,
  							ItemPointerGetBlockNumber(&(xlrec->target.tid)),
*************** heap_redo(XLogRecPtr lsn, XLogRecord *re
*** 5623,5628 ****
--- 5560,5567 ----
  	 * required. The ones in heap2 rmgr do.
  	 */
  
+ 	RestoreBkpBlocks(lsn, record, false);
+ 
  	switch (info & XLOG_HEAP_OPMASK)
  	{
  		case XLOG_HEAP_INSERT:
*************** heap2_redo(XLogRecPtr lsn, XLogRecord *r
*** 5656,5661 ****
--- 5595,5605 ----
  {
  	uint8		info = record->xl_info & ~XLR_INFO_MASK;
  
+ 	/*
+ 	 * Note that RestoreBkpBlocks() is called after conflict processing within
+ 	 * each record type handling function.
+ 	 */
+ 
  	switch (info & XLOG_HEAP_OPMASK)
  	{
  		case XLOG_HEAP2_FREEZE:
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
new file mode 100644
index 8f53480..72ea171
*** a/src/backend/access/nbtree/nbtxlog.c
--- b/src/backend/access/nbtree/nbtxlog.c
*************** btree_xlog_insert(bool isleaf, bool isme
*** 218,226 ****
  		datalen -= sizeof(xl_btree_metadata);
  	}
  
! 	if (record->xl_info & XLR_BKP_BLOCK(0))
! 		(void) RestoreBackupBlock(lsn, record, 0, false, false);
! 	else
  	{
  		buffer = XLogReadBuffer(xlrec->target.node,
  							 ItemPointerGetBlockNumber(&(xlrec->target.tid)),
--- 218,227 ----
  		datalen -= sizeof(xl_btree_metadata);
  	}
  
! 	if ((record->xl_info & XLR_BKP_BLOCK_1) && !ismeta && isleaf)
! 		return;					/* nothing to do */
! 
! 	if (!(record->xl_info & XLR_BKP_BLOCK_1))
  	{
  		buffer = XLogReadBuffer(xlrec->target.node,
  							 ItemPointerGetBlockNumber(&(xlrec->target.tid)),
*************** btree_xlog_insert(bool isleaf, bool isme
*** 248,260 ****
  		}
  	}
  
- 	/*
- 	 * Note: in normal operation, we'd update the metapage while still holding
- 	 * lock on the page we inserted into.  But during replay it's not
- 	 * necessary to hold that lock, since no other index updates can be
- 	 * happening concurrently, and readers will cope fine with following an
- 	 * obsolete link from the metapage.
- 	 */
  	if (ismeta)
  		_bt_restore_meta(xlrec->target.node, lsn,
  						 md.root, md.level,
--- 249,254 ----
*************** btree_xlog_split(bool onleft, bool isroo
*** 296,302 ****
  		forget_matching_split(xlrec->node, downlink, false);
  
  		/* Extract left hikey and its size (still assuming 16-bit alignment) */
! 		if (!(record->xl_info & XLR_BKP_BLOCK(0)))
  		{
  			/* We assume 16-bit alignment is enough for IndexTupleSize */
  			left_hikey = (Item) datapos;
--- 290,296 ----
  		forget_matching_split(xlrec->node, downlink, false);
  
  		/* Extract left hikey and its size (still assuming 16-bit alignment) */
! 		if (!(record->xl_info & XLR_BKP_BLOCK_1))
  		{
  			/* We assume 16-bit alignment is enough for IndexTupleSize */
  			left_hikey = (Item) datapos;
*************** btree_xlog_split(bool onleft, bool isroo
*** 316,322 ****
  		datalen -= sizeof(OffsetNumber);
  	}
  
! 	if (onleft && !(record->xl_info & XLR_BKP_BLOCK(0)))
  	{
  		/*
  		 * We assume that 16-bit alignment is enough to apply IndexTupleSize
--- 310,316 ----
  		datalen -= sizeof(OffsetNumber);
  	}
  
! 	if (onleft && !(record->xl_info & XLR_BKP_BLOCK_1))
  	{
  		/*
  		 * We assume that 16-bit alignment is enough to apply IndexTupleSize
*************** btree_xlog_split(bool onleft, bool isroo
*** 329,335 ****
  		datalen -= newitemsz;
  	}
  
! 	/* Reconstruct right (new) sibling page from scratch */
  	rbuf = XLogReadBuffer(xlrec->node, xlrec->rightsib, true);
  	Assert(BufferIsValid(rbuf));
  	rpage = (Page) BufferGetPage(rbuf);
--- 323,329 ----
  		datalen -= newitemsz;
  	}
  
! 	/* Reconstruct right (new) sibling from scratch */
  	rbuf = XLogReadBuffer(xlrec->node, xlrec->rightsib, true);
  	Assert(BufferIsValid(rbuf));
  	rpage = (Page) BufferGetPage(rbuf);
*************** btree_xlog_split(bool onleft, bool isroo
*** 363,383 ****
  
  	/* don't release the buffer yet; we touch right page's first item below */
  
! 	/* Now reconstruct left (original) sibling page */
! 	if (record->xl_info & XLR_BKP_BLOCK(0))
! 		(void) RestoreBackupBlock(lsn, record, 0, false, false);
! 	else
  	{
  		Buffer		lbuf = XLogReadBuffer(xlrec->node, xlrec->leftsib, false);
  
  		if (BufferIsValid(lbuf))
  		{
- 			/*
- 			 * Note that this code ensures that the items remaining on the
- 			 * left page are in the correct item number order, but it does not
- 			 * reproduce the physical order they would have had.  Is this
- 			 * worth changing?  See also _bt_restore_page().
- 			 */
  			Page		lpage = (Page) BufferGetPage(lbuf);
  			BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
  
--- 357,374 ----
  
  	/* don't release the buffer yet; we touch right page's first item below */
  
! 	/*
! 	 * Reconstruct left (original) sibling if needed.  Note that this code
! 	 * ensures that the items remaining on the left page are in the correct
! 	 * item number order, but it does not reproduce the physical order they
! 	 * would have had.	Is this worth changing?  See also _bt_restore_page().
! 	 */
! 	if (!(record->xl_info & XLR_BKP_BLOCK_1))
  	{
  		Buffer		lbuf = XLogReadBuffer(xlrec->node, xlrec->leftsib, false);
  
  		if (BufferIsValid(lbuf))
  		{
  			Page		lpage = (Page) BufferGetPage(lbuf);
  			BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
  
*************** btree_xlog_split(bool onleft, bool isroo
*** 441,457 ****
  	/* We no longer need the right buffer */
  	UnlockReleaseBuffer(rbuf);
  
! 	/*
! 	 * Fix left-link of the page to the right of the new right sibling.
! 	 *
! 	 * Note: in normal operation, we do this while still holding lock on the
! 	 * two split pages.  However, that's not necessary for correctness in WAL
! 	 * replay, because no other index update can be in progress, and readers
! 	 * will cope properly when following an obsolete left-link.
! 	 */
! 	if (record->xl_info & XLR_BKP_BLOCK(1))
! 		(void) RestoreBackupBlock(lsn, record, 1, false, false);
! 	else if (xlrec->rnext != P_NONE)
  	{
  		Buffer		buffer = XLogReadBuffer(xlrec->node, xlrec->rnext, false);
  
--- 432,439 ----
  	/* We no longer need the right buffer */
  	UnlockReleaseBuffer(rbuf);
  
! 	/* Fix left-link of the page to the right of the new right sibling */
! 	if (xlrec->rnext != P_NONE && !(record->xl_info & XLR_BKP_BLOCK_2))
  	{
  		Buffer		buffer = XLogReadBuffer(xlrec->node, xlrec->rnext, false);
  
*************** btree_xlog_split(bool onleft, bool isroo
*** 481,491 ****
  static void
  btree_xlog_vacuum(XLogRecPtr lsn, XLogRecord *record)
  {
! 	xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
  	Buffer		buffer;
  	Page		page;
  	BTPageOpaque opaque;
  
  	/*
  	 * If queries might be active then we need to ensure every block is
  	 * unpinned between the lastBlockVacuumed and the current block, if there
--- 463,475 ----
  static void
  btree_xlog_vacuum(XLogRecPtr lsn, XLogRecord *record)
  {
! 	xl_btree_vacuum *xlrec;
  	Buffer		buffer;
  	Page		page;
  	BTPageOpaque opaque;
  
+ 	xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+ 
  	/*
  	 * If queries might be active then we need to ensure every block is
  	 * unpinned between the lastBlockVacuumed and the current block, if there
*************** btree_xlog_vacuum(XLogRecPtr lsn, XLogRe
*** 518,531 ****
  	}
  
  	/*
! 	 * If we have a full-page image, restore it (using a cleanup lock) and
! 	 * we're done.
  	 */
! 	if (record->xl_info & XLR_BKP_BLOCK(0))
! 	{
! 		(void) RestoreBackupBlock(lsn, record, 0, true, false);
  		return;
- 	}
  
  	/*
  	 * Like in btvacuumpage(), we need to take a cleanup lock on every leaf
--- 502,514 ----
  	}
  
  	/*
! 	 * If the block was restored from a full page image, nothing more to do.
! 	 * The RestoreBkpBlocks() call already pinned and took cleanup lock on it.
! 	 * XXX: Perhaps we should call RestoreBkpBlocks() *after* the loop above,
! 	 * to make the disk access more sequential.
  	 */
! 	if (record->xl_info & XLR_BKP_BLOCK_1)
  		return;
  
  	/*
  	 * Like in btvacuumpage(), we need to take a cleanup lock on every leaf
*************** btree_xlog_vacuum(XLogRecPtr lsn, XLogRe
*** 580,587 ****
   * XXX optimise later with something like XLogPrefetchBuffer()
   */
  static TransactionId
! btree_xlog_delete_get_latestRemovedXid(xl_btree_delete *xlrec)
  {
  	OffsetNumber *unused;
  	Buffer		ibuffer,
  				hbuffer;
--- 563,571 ----
   * XXX optimise later with something like XLogPrefetchBuffer()
   */
  static TransactionId
! btree_xlog_delete_get_latestRemovedXid(XLogRecord *record)
  {
+ 	xl_btree_delete *xlrec = (xl_btree_delete *) XLogRecGetData(record);
  	OffsetNumber *unused;
  	Buffer		ibuffer,
  				hbuffer;
*************** btree_xlog_delete_get_latestRemovedXid(x
*** 718,752 ****
  static void
  btree_xlog_delete(XLogRecPtr lsn, XLogRecord *record)
  {
! 	xl_btree_delete *xlrec = (xl_btree_delete *) XLogRecGetData(record);
  	Buffer		buffer;
  	Page		page;
  	BTPageOpaque opaque;
  
! 	/*
! 	 * If we have any conflict processing to do, it must happen before we
! 	 * update the page.
! 	 *
! 	 * Btree delete records can conflict with standby queries.  You might
! 	 * think that vacuum records would conflict as well, but we've handled
! 	 * that already.  XLOG_HEAP2_CLEANUP_INFO records provide the highest xid
! 	 * cleaned by the vacuum of the heap and so we can resolve any conflicts
! 	 * just once when that arrives.  After that we know that no conflicts
! 	 * exist from individual btree vacuum records on that index.
! 	 */
! 	if (InHotStandby)
! 	{
! 		TransactionId latestRemovedXid = btree_xlog_delete_get_latestRemovedXid(xlrec);
! 
! 		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, xlrec->node);
! 	}
! 
! 	/* If we have a full-page image, restore it and we're done */
! 	if (record->xl_info & XLR_BKP_BLOCK(0))
! 	{
! 		(void) RestoreBackupBlock(lsn, record, 0, false, false);
  		return;
! 	}
  
  	/*
  	 * We don't need to take a cleanup lock to apply these changes. See
--- 702,716 ----
  static void
  btree_xlog_delete(XLogRecPtr lsn, XLogRecord *record)
  {
! 	xl_btree_delete *xlrec;
  	Buffer		buffer;
  	Page		page;
  	BTPageOpaque opaque;
  
! 	if (record->xl_info & XLR_BKP_BLOCK_1)
  		return;
! 
! 	xlrec = (xl_btree_delete *) XLogRecGetData(record);
  
  	/*
  	 * We don't need to take a cleanup lock to apply these changes. See
*************** btree_xlog_delete_page(uint8 info, XLogR
*** 802,819 ****
  	leftsib = xlrec->leftblk;
  	rightsib = xlrec->rightblk;
  
- 	/*
- 	 * In normal operation, we would lock all the pages this WAL record
- 	 * touches before changing any of them.  In WAL replay, it should be okay
- 	 * to lock just one page at a time, since no concurrent index updates can
- 	 * be happening, and readers should not care whether they arrive at the
- 	 * target page or not (since it's surely empty).
- 	 */
- 
  	/* parent page */
! 	if (record->xl_info & XLR_BKP_BLOCK(0))
! 		(void) RestoreBackupBlock(lsn, record, 0, false, false);
! 	else
  	{
  		buffer = XLogReadBuffer(xlrec->target.node, parent, false);
  		if (BufferIsValid(buffer))
--- 766,773 ----
  	leftsib = xlrec->leftblk;
  	rightsib = xlrec->rightblk;
  
  	/* parent page */
! 	if (!(record->xl_info & XLR_BKP_BLOCK_1))
  	{
  		buffer = XLogReadBuffer(xlrec->target.node, parent, false);
  		if (BufferIsValid(buffer))
*************** btree_xlog_delete_page(uint8 info, XLogR
*** 859,867 ****
  	}
  
  	/* Fix left-link of right sibling */
! 	if (record->xl_info & XLR_BKP_BLOCK(1))
! 		(void) RestoreBackupBlock(lsn, record, 1, false, false);
! 	else
  	{
  		buffer = XLogReadBuffer(xlrec->target.node, rightsib, false);
  		if (BufferIsValid(buffer))
--- 813,819 ----
  	}
  
  	/* Fix left-link of right sibling */
! 	if (!(record->xl_info & XLR_BKP_BLOCK_2))
  	{
  		buffer = XLogReadBuffer(xlrec->target.node, rightsib, false);
  		if (BufferIsValid(buffer))
*************** btree_xlog_delete_page(uint8 info, XLogR
*** 885,893 ****
  	}
  
  	/* Fix right-link of left sibling, if any */
! 	if (record->xl_info & XLR_BKP_BLOCK(2))
! 		(void) RestoreBackupBlock(lsn, record, 2, false, false);
! 	else
  	{
  		if (leftsib != P_NONE)
  		{
--- 837,843 ----
  	}
  
  	/* Fix right-link of left sibling, if any */
! 	if (!(record->xl_info & XLR_BKP_BLOCK_3))
  	{
  		if (leftsib != P_NONE)
  		{
*************** btree_xlog_newroot(XLogRecPtr lsn, XLogR
*** 961,969 ****
  	BTPageOpaque pageop;
  	BlockNumber downlink = 0;
  
- 	/* Backup blocks are not used in newroot records */
- 	Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
- 
  	buffer = XLogReadBuffer(xlrec->node, xlrec->rootblk, true);
  	Assert(BufferIsValid(buffer));
  	page = (Page) BufferGetPage(buffer);
--- 911,916 ----
*************** btree_xlog_newroot(XLogRecPtr lsn, XLogR
*** 1005,1040 ****
  		forget_matching_split(xlrec->node, downlink, true);
  }
  
! static void
! btree_xlog_reuse_page(XLogRecPtr lsn, XLogRecord *record)
  {
! 	xl_btree_reuse_page *xlrec = (xl_btree_reuse_page *) XLogRecGetData(record);
  
  	/*
! 	 * Btree reuse_page records exist to provide a conflict point when we
! 	 * reuse pages in the index via the FSM.  That's all they do though.
! 	 *
! 	 * latestRemovedXid was the page's btpo.xact.  The btpo.xact <
! 	 * RecentGlobalXmin test in _bt_page_recyclable() conceptually mirrors the
! 	 * pgxact->xmin > limitXmin test in GetConflictingVirtualXIDs().
! 	 * Consequently, one XID value achieves the same exclusion effect on
! 	 * master and standby.
  	 */
  	if (InHotStandby)
  	{
! 		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
! 											xlrec->node);
! 	}
  
! 	/* Backup blocks are not used in reuse_page records */
! 	Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
! }
  
  
! void
! btree_redo(XLogRecPtr lsn, XLogRecord *record)
! {
! 	uint8		info = record->xl_info & ~XLR_INFO_MASK;
  
  	switch (info)
  	{
--- 952,1018 ----
  		forget_matching_split(xlrec->node, downlink, true);
  }
  
! 
! void
! btree_redo(XLogRecPtr lsn, XLogRecord *record)
  {
! 	uint8		info = record->xl_info & ~XLR_INFO_MASK;
  
  	/*
! 	 * If we have any conflict processing to do, it must happen before we
! 	 * update the page.
  	 */
  	if (InHotStandby)
  	{
! 		switch (info)
! 		{
! 			case XLOG_BTREE_DELETE:
  
! 				/*
! 				 * Btree delete records can conflict with standby queries. You
! 				 * might think that vacuum records would conflict as well, but
! 				 * we've handled that already. XLOG_HEAP2_CLEANUP_INFO records
! 				 * provide the highest xid cleaned by the vacuum of the heap
! 				 * and so we can resolve any conflicts just once when that
! 				 * arrives. After that any we know that no conflicts exist
! 				 * from individual btree vacuum records on that index.
! 				 */
! 				{
! 					TransactionId latestRemovedXid = btree_xlog_delete_get_latestRemovedXid(record);
! 					xl_btree_delete *xlrec = (xl_btree_delete *) XLogRecGetData(record);
! 
! 					ResolveRecoveryConflictWithSnapshot(latestRemovedXid, xlrec->node);
! 				}
! 				break;
  
+ 			case XLOG_BTREE_REUSE_PAGE:
  
! 				/*
! 				 * Btree reuse page records exist to provide a conflict point
! 				 * when we reuse pages in the index via the FSM. That's all it
! 				 * does though. latestRemovedXid was the page's btpo.xact. The
! 				 * btpo.xact < RecentGlobalXmin test in _bt_page_recyclable()
! 				 * conceptually mirrors the pgxact->xmin > limitXmin test in
! 				 * GetConflictingVirtualXIDs().  Consequently, one XID value
! 				 * achieves the same exclusion effect on master and standby.
! 				 */
! 				{
! 					xl_btree_reuse_page *xlrec = (xl_btree_reuse_page *) XLogRecGetData(record);
! 
! 					ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, xlrec->node);
! 				}
! 				return;
! 
! 			default:
! 				break;
! 		}
! 	}
! 
! 	/*
! 	 * Vacuum needs to pin and take cleanup lock on every leaf page, a regular
! 	 * exclusive lock is enough for all other purposes.
! 	 */
! 	RestoreBkpBlocks(lsn, record, (info == XLOG_BTREE_VACUUM));
  
  	switch (info)
  	{
*************** btree_redo(XLogRecPtr lsn, XLogRecord *r
*** 1074,1080 ****
  			btree_xlog_newroot(lsn, record);
  			break;
  		case XLOG_BTREE_REUSE_PAGE:
! 			btree_xlog_reuse_page(lsn, record);
  			break;
  		default:
  			elog(PANIC, "btree_redo: unknown op code %u", info);
--- 1052,1058 ----
  			btree_xlog_newroot(lsn, record);
  			break;
  		case XLOG_BTREE_REUSE_PAGE:
! 			/* Handled above before restoring bkp block */
  			break;
  		default:
  			elog(PANIC, "btree_redo: unknown op code %u", info);
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
new file mode 100644
index 8746b35..54e78f1
*** a/src/backend/access/spgist/spgxlog.c
--- b/src/backend/access/spgist/spgxlog.c
*************** spgRedoCreateIndex(XLogRecPtr lsn, XLogR
*** 76,84 ****
  	Buffer		buffer;
  	Page		page;
  
- 	/* Backup blocks are not used in create_index records */
- 	Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
- 
  	buffer = XLogReadBuffer(*node, SPGIST_METAPAGE_BLKNO, true);
  	Assert(BufferIsValid(buffer));
  	page = (Page) BufferGetPage(buffer);
--- 76,81 ----
*************** spgRedoAddLeaf(XLogRecPtr lsn, XLogRecor
*** 120,133 ****
  	ptr += sizeof(spgxlogAddLeaf);
  	leafTuple = (SpGistLeafTuple) ptr;
  
! 	/*
! 	 * In normal operation we would have both current and parent pages locked
! 	 * simultaneously; but in WAL replay it should be safe to update the leaf
! 	 * page before updating the parent.
! 	 */
! 	if (record->xl_info & XLR_BKP_BLOCK(0))
! 		(void) RestoreBackupBlock(lsn, record, 0, false, false);
! 	else
  	{
  		buffer = XLogReadBuffer(xldata->node, xldata->blknoLeaf,
  								xldata->newPage);
--- 117,123 ----
  	ptr += sizeof(spgxlogAddLeaf);
  	leafTuple = (SpGistLeafTuple) ptr;
  
! 	if (!(record->xl_info & XLR_BKP_BLOCK_1))
  	{
  		buffer = XLogReadBuffer(xldata->node, xldata->blknoLeaf,
  								xldata->newPage);
*************** spgRedoAddLeaf(XLogRecPtr lsn, XLogRecor
*** 179,187 ****
  	}
  
  	/* update parent downlink if necessary */
! 	if (record->xl_info & XLR_BKP_BLOCK(1))
! 		(void) RestoreBackupBlock(lsn, record, 1, false, false);
! 	else if (xldata->blknoParent != InvalidBlockNumber)
  	{
  		buffer = XLogReadBuffer(xldata->node, xldata->blknoParent, false);
  		if (BufferIsValid(buffer))
--- 169,176 ----
  	}
  
  	/* update parent downlink if necessary */
! 	if (xldata->blknoParent != InvalidBlockNumber &&
! 		!(record->xl_info & XLR_BKP_BLOCK_2))
  	{
  		buffer = XLogReadBuffer(xldata->node, xldata->blknoParent, false);
  		if (BufferIsValid(buffer))
*************** spgRedoMoveLeafs(XLogRecPtr lsn, XLogRec
*** 230,245 ****
  
  	/* now ptr points to the list of leaf tuples */
  
- 	/*
- 	 * In normal operation we would have all three pages (source, dest, and
- 	 * parent) locked simultaneously; but in WAL replay it should be safe to
- 	 * update them one at a time, as long as we do it in the right order.
- 	 */
- 
  	/* Insert tuples on the dest page (do first, so redirect is valid) */
! 	if (record->xl_info & XLR_BKP_BLOCK(1))
! 		(void) RestoreBackupBlock(lsn, record, 1, false, false);
! 	else
  	{
  		buffer = XLogReadBuffer(xldata->node, xldata->blknoDst,
  								xldata->newPage);
--- 219,226 ----
  
  	/* now ptr points to the list of leaf tuples */
  
  	/* Insert tuples on the dest page (do first, so redirect is valid) */
! 	if (!(record->xl_info & XLR_BKP_BLOCK_2))
  	{
  		buffer = XLogReadBuffer(xldata->node, xldata->blknoDst,
  								xldata->newPage);
*************** spgRedoMoveLeafs(XLogRecPtr lsn, XLogRec
*** 272,280 ****
  	}
  
  	/* Delete tuples from the source page, inserting a redirection pointer */
! 	if (record->xl_info & XLR_BKP_BLOCK(0))
! 		(void) RestoreBackupBlock(lsn, record, 0, false, false);
! 	else
  	{
  		buffer = XLogReadBuffer(xldata->node, xldata->blknoSrc, false);
  		if (BufferIsValid(buffer))
--- 253,259 ----
  	}
  
  	/* Delete tuples from the source page, inserting a redirection pointer */
! 	if (!(record->xl_info & XLR_BKP_BLOCK_1))
  	{
  		buffer = XLogReadBuffer(xldata->node, xldata->blknoSrc, false);
  		if (BufferIsValid(buffer))
*************** spgRedoMoveLeafs(XLogRecPtr lsn, XLogRec
*** 297,305 ****
  	}
  
  	/* And update the parent downlink */
! 	if (record->xl_info & XLR_BKP_BLOCK(2))
! 		(void) RestoreBackupBlock(lsn, record, 2, false, false);
! 	else
  	{
  		buffer = XLogReadBuffer(xldata->node, xldata->blknoParent, false);
  		if (BufferIsValid(buffer))
--- 276,282 ----
  	}
  
  	/* And update the parent downlink */
! 	if (!(record->xl_info & XLR_BKP_BLOCK_3))
  	{
  		buffer = XLogReadBuffer(xldata->node, xldata->blknoParent, false);
  		if (BufferIsValid(buffer))
*************** spgRedoAddNode(XLogRecPtr lsn, XLogRecor
*** 345,353 ****
  	{
  		/* update in place */
  		Assert(xldata->blknoParent == InvalidBlockNumber);
! 		if (record->xl_info & XLR_BKP_BLOCK(0))
! 			(void) RestoreBackupBlock(lsn, record, 0, false, false);
! 		else
  		{
  			buffer = XLogReadBuffer(xldata->node, xldata->blkno, false);
  			if (BufferIsValid(buffer))
--- 322,328 ----
  	{
  		/* update in place */
  		Assert(xldata->blknoParent == InvalidBlockNumber);
! 		if (!(record->xl_info & XLR_BKP_BLOCK_1))
  		{
  			buffer = XLogReadBuffer(xldata->node, xldata->blkno, false);
  			if (BufferIsValid(buffer))
*************** spgRedoAddNode(XLogRecPtr lsn, XLogRecor
*** 372,393 ****
  	}
  	else
  	{
- 		/*
- 		 * In normal operation we would have all three pages (source, dest,
- 		 * and parent) locked simultaneously; but in WAL replay it should be
- 		 * safe to update them one at a time, as long as we do it in the right
- 		 * order.
- 		 *
- 		 * The logic here depends on the assumption that blkno != blknoNew,
- 		 * else we can't tell which BKP bit goes with which page, and the LSN
- 		 * checks could go wrong too.
- 		 */
- 		Assert(xldata->blkno != xldata->blknoNew);
- 
  		/* Install new tuple first so redirect is valid */
! 		if (record->xl_info & XLR_BKP_BLOCK(1))
! 			(void) RestoreBackupBlock(lsn, record, 1, false, false);
! 		else
  		{
  			buffer = XLogReadBuffer(xldata->node, xldata->blknoNew,
  									xldata->newPage);
--- 347,354 ----
  	}
  	else
  	{
  		/* Install new tuple first so redirect is valid */
! 		if (!(record->xl_info & XLR_BKP_BLOCK_2))
  		{
  			buffer = XLogReadBuffer(xldata->node, xldata->blknoNew,
  									xldata->newPage);
*************** spgRedoAddNode(XLogRecPtr lsn, XLogRecor
*** 404,420 ****
  					addOrReplaceTuple(page, (Item) innerTuple,
  									  innerTuple->size, xldata->offnumNew);
  
! 					/*
! 					 * If parent is in this same page, don't advance LSN;
! 					 * doing so would fool us into not applying the parent
! 					 * downlink update below.  We'll update the LSN when we
! 					 * fix the parent downlink.
! 					 */
! 					if (xldata->blknoParent != xldata->blknoNew)
! 					{
! 						PageSetLSN(page, lsn);
! 						PageSetTLI(page, ThisTimeLineID);
! 					}
  					MarkBufferDirty(buffer);
  				}
  				UnlockReleaseBuffer(buffer);
--- 365,372 ----
  					addOrReplaceTuple(page, (Item) innerTuple,
  									  innerTuple->size, xldata->offnumNew);
  
! 					PageSetLSN(page, lsn);
! 					PageSetTLI(page, ThisTimeLineID);
  					MarkBufferDirty(buffer);
  				}
  				UnlockReleaseBuffer(buffer);
*************** spgRedoAddNode(XLogRecPtr lsn, XLogRecor
*** 422,430 ****
  		}
  
  		/* Delete old tuple, replacing it with redirect or placeholder tuple */
! 		if (record->xl_info & XLR_BKP_BLOCK(0))
! 			(void) RestoreBackupBlock(lsn, record, 0, false, false);
! 		else
  		{
  			buffer = XLogReadBuffer(xldata->node, xldata->blkno, false);
  			if (BufferIsValid(buffer))
--- 374,380 ----
  		}
  
  		/* Delete old tuple, replacing it with redirect or placeholder tuple */
! 		if (!(record->xl_info & XLR_BKP_BLOCK_1))
  		{
  			buffer = XLogReadBuffer(xldata->node, xldata->blkno, false);
  			if (BufferIsValid(buffer))
*************** spgRedoAddNode(XLogRecPtr lsn, XLogRecor
*** 455,471 ****
  					else
  						SpGistPageGetOpaque(page)->nRedirection++;
  
! 					/*
! 					 * If parent is in this same page, don't advance LSN;
! 					 * doing so would fool us into not applying the parent
! 					 * downlink update below.  We'll update the LSN when we
! 					 * fix the parent downlink.
! 					 */
! 					if (xldata->blknoParent != xldata->blkno)
! 					{
! 						PageSetLSN(page, lsn);
! 						PageSetTLI(page, ThisTimeLineID);
! 					}
  					MarkBufferDirty(buffer);
  				}
  				UnlockReleaseBuffer(buffer);
--- 405,412 ----
  					else
  						SpGistPageGetOpaque(page)->nRedirection++;
  
! 					PageSetLSN(page, lsn);
! 					PageSetTLI(page, ThisTimeLineID);
  					MarkBufferDirty(buffer);
  				}
  				UnlockReleaseBuffer(buffer);
*************** spgRedoAddNode(XLogRecPtr lsn, XLogRecor
*** 484,495 ****
  		else
  			bbi = 2;
  
! 		if (record->xl_info & XLR_BKP_BLOCK(bbi))
! 		{
! 			if (bbi == 2)		/* else we already did it */
! 				(void) RestoreBackupBlock(lsn, record, bbi, false, false);
! 		}
! 		else
  		{
  			buffer = XLogReadBuffer(xldata->node, xldata->blknoParent, false);
  			if (BufferIsValid(buffer))
--- 425,431 ----
  		else
  			bbi = 2;
  
! 		if (!(record->xl_info & XLR_SET_BKP_BLOCK(bbi)))
  		{
  			buffer = XLogReadBuffer(xldata->node, xldata->blknoParent, false);
  			if (BufferIsValid(buffer))
*************** spgRedoSplitTuple(XLogRecPtr lsn, XLogRe
*** 531,546 ****
  	ptr += prefixTuple->size;
  	postfixTuple = (SpGistInnerTuple) ptr;
  
- 	/*
- 	 * In normal operation we would have both pages locked simultaneously; but
- 	 * in WAL replay it should be safe to update them one at a time, as long
- 	 * as we do it in the right order.
- 	 */
- 
  	/* insert postfix tuple first to avoid dangling link */
! 	if (record->xl_info & XLR_BKP_BLOCK(1))
! 		(void) RestoreBackupBlock(lsn, record, 1, false, false);
! 	else if (xldata->blknoPostfix != xldata->blknoPrefix)
  	{
  		buffer = XLogReadBuffer(xldata->node, xldata->blknoPostfix,
  								xldata->newPage);
--- 467,475 ----
  	ptr += prefixTuple->size;
  	postfixTuple = (SpGistInnerTuple) ptr;
  
  	/* insert postfix tuple first to avoid dangling link */
! 	if (xldata->blknoPostfix != xldata->blknoPrefix &&
! 		!(record->xl_info & XLR_BKP_BLOCK_2))
  	{
  		buffer = XLogReadBuffer(xldata->node, xldata->blknoPostfix,
  								xldata->newPage);
*************** spgRedoSplitTuple(XLogRecPtr lsn, XLogRe
*** 566,574 ****
  	}
  
  	/* now handle the original page */
! 	if (record->xl_info & XLR_BKP_BLOCK(0))
! 		(void) RestoreBackupBlock(lsn, record, 0, false, false);
! 	else
  	{
  		buffer = XLogReadBuffer(xldata->node, xldata->blknoPrefix, false);
  		if (BufferIsValid(buffer))
--- 495,501 ----
  	}
  
  	/* now handle the original page */
! 	if (!(record->xl_info & XLR_BKP_BLOCK_1))
  	{
  		buffer = XLogReadBuffer(xldata->node, xldata->blknoPrefix, false);
  		if (BufferIsValid(buffer))
*************** spgRedoPickSplit(XLogRecPtr lsn, XLogRec
*** 608,615 ****
  	uint8	   *leafPageSelect;
  	Buffer		srcBuffer;
  	Buffer		destBuffer;
- 	Page		srcPage;
- 	Page		destPage;
  	Page		page;
  	int			bbi;
  	int			i;
--- 535,540 ----
*************** spgRedoPickSplit(XLogRecPtr lsn, XLogRec
*** 638,651 ****
  	{
  		/* when splitting root, we touch it only in the guise of new inner */
  		srcBuffer = InvalidBuffer;
- 		srcPage = NULL;
  	}
  	else if (xldata->initSrc)
  	{
  		/* just re-init the source page */
  		srcBuffer = XLogReadBuffer(xldata->node, xldata->blknoSrc, true);
  		Assert(BufferIsValid(srcBuffer));
! 		srcPage = (Page) BufferGetPage(srcBuffer);
  
  		SpGistInitBuffer(srcBuffer,
  					 SPGIST_LEAF | (xldata->storesNulls ? SPGIST_NULLS : 0));
--- 563,575 ----
  	{
  		/* when splitting root, we touch it only in the guise of new inner */
  		srcBuffer = InvalidBuffer;
  	}
  	else if (xldata->initSrc)
  	{
  		/* just re-init the source page */
  		srcBuffer = XLogReadBuffer(xldata->node, xldata->blknoSrc, true);
  		Assert(BufferIsValid(srcBuffer));
! 		page = (Page) BufferGetPage(srcBuffer);
  
  		SpGistInitBuffer(srcBuffer,
  					 SPGIST_LEAF | (xldata->storesNulls ? SPGIST_NULLS : 0));
*************** spgRedoPickSplit(XLogRecPtr lsn, XLogRec
*** 653,676 ****
  	}
  	else
  	{
! 		/*
! 		 * Delete the specified tuples from source page.  (In case we're in
! 		 * Hot Standby, we need to hold lock on the page till we're done
! 		 * inserting leaf tuples and the new inner tuple, else the added
! 		 * redirect tuple will be a dangling link.)
! 		 */
! 		if (record->xl_info & XLR_BKP_BLOCK(bbi))
! 		{
! 			srcBuffer = RestoreBackupBlock(lsn, record, bbi, false, true);
! 			srcPage = NULL;		/* don't need to do any page updates */
! 		}
! 		else
  		{
  			srcBuffer = XLogReadBuffer(xldata->node, xldata->blknoSrc, false);
  			if (BufferIsValid(srcBuffer))
  			{
! 				srcPage = BufferGetPage(srcBuffer);
! 				if (!XLByteLE(lsn, PageGetLSN(srcPage)))
  				{
  					/*
  					 * We have it a bit easier here than in doPickSplit(),
--- 577,590 ----
  	}
  	else
  	{
! 		/* delete the specified tuples from source page */
! 		if (!(record->xl_info & XLR_SET_BKP_BLOCK(bbi)))
  		{
  			srcBuffer = XLogReadBuffer(xldata->node, xldata->blknoSrc, false);
  			if (BufferIsValid(srcBuffer))
  			{
! 				page = BufferGetPage(srcBuffer);
! 				if (!XLByteLE(lsn, PageGetLSN(page)))
  				{
  					/*
  					 * We have it a bit easier here than in doPickSplit(),
*************** spgRedoPickSplit(XLogRecPtr lsn, XLogRec
*** 678,691 ****
  					 * we can inject the correct redirection tuple now.
  					 */
  					if (!state.isBuild)
! 						spgPageIndexMultiDelete(&state, srcPage,
  												toDelete, xldata->nDelete,
  												SPGIST_REDIRECT,
  												SPGIST_PLACEHOLDER,
  												xldata->blknoInner,
  												xldata->offnumInner);
  					else
! 						spgPageIndexMultiDelete(&state, srcPage,
  												toDelete, xldata->nDelete,
  												SPGIST_PLACEHOLDER,
  												SPGIST_PLACEHOLDER,
--- 592,605 ----
  					 * we can inject the correct redirection tuple now.
  					 */
  					if (!state.isBuild)
! 						spgPageIndexMultiDelete(&state, page,
  												toDelete, xldata->nDelete,
  												SPGIST_REDIRECT,
  												SPGIST_PLACEHOLDER,
  												xldata->blknoInner,
  												xldata->offnumInner);
  					else
! 						spgPageIndexMultiDelete(&state, page,
  												toDelete, xldata->nDelete,
  												SPGIST_PLACEHOLDER,
  												SPGIST_PLACEHOLDER,
*************** spgRedoPickSplit(XLogRecPtr lsn, XLogRec
*** 694,705 ****
  
  					/* don't update LSN etc till we're done with it */
  				}
- 				else
- 					srcPage = NULL;		/* don't do any page updates */
  			}
- 			else
- 				srcPage = NULL;
  		}
  		bbi++;
  	}
  
--- 608,617 ----
  
  					/* don't update LSN etc till we're done with it */
  				}
  			}
  		}
+ 		else
+ 			srcBuffer = InvalidBuffer;
  		bbi++;
  	}
  
*************** spgRedoPickSplit(XLogRecPtr lsn, XLogRec
*** 707,720 ****
  	if (xldata->blknoDest == InvalidBlockNumber)
  	{
  		destBuffer = InvalidBuffer;
- 		destPage = NULL;
  	}
  	else if (xldata->initDest)
  	{
  		/* just re-init the dest page */
  		destBuffer = XLogReadBuffer(xldata->node, xldata->blknoDest, true);
  		Assert(BufferIsValid(destBuffer));
! 		destPage = (Page) BufferGetPage(destBuffer);
  
  		SpGistInitBuffer(destBuffer,
  					 SPGIST_LEAF | (xldata->storesNulls ? SPGIST_NULLS : 0));
--- 619,631 ----
  	if (xldata->blknoDest == InvalidBlockNumber)
  	{
  		destBuffer = InvalidBuffer;
  	}
  	else if (xldata->initDest)
  	{
  		/* just re-init the dest page */
  		destBuffer = XLogReadBuffer(xldata->node, xldata->blknoDest, true);
  		Assert(BufferIsValid(destBuffer));
! 		page = (Page) BufferGetPage(destBuffer);
  
  		SpGistInitBuffer(destBuffer,
  					 SPGIST_LEAF | (xldata->storesNulls ? SPGIST_NULLS : 0));
*************** spgRedoPickSplit(XLogRecPtr lsn, XLogRec
*** 722,748 ****
  	}
  	else
  	{
! 		/*
! 		 * We could probably release the page lock immediately in the
! 		 * full-page-image case, but for safety let's hold it till later.
! 		 */
! 		if (record->xl_info & XLR_BKP_BLOCK(bbi))
! 		{
! 			destBuffer = RestoreBackupBlock(lsn, record, bbi, false, true);
! 			destPage = NULL;	/* don't need to do any page updates */
! 		}
! 		else
! 		{
  			destBuffer = XLogReadBuffer(xldata->node, xldata->blknoDest, false);
! 			if (BufferIsValid(destBuffer))
! 			{
! 				destPage = (Page) BufferGetPage(destBuffer);
! 				if (XLByteLE(lsn, PageGetLSN(destPage)))
! 					destPage = NULL;	/* don't do any page updates */
! 			}
! 			else
! 				destPage = NULL;
! 		}
  		bbi++;
  	}
  
--- 633,642 ----
  	}
  	else
  	{
! 		if (!(record->xl_info & XLR_SET_BKP_BLOCK(bbi)))
  			destBuffer = XLogReadBuffer(xldata->node, xldata->blknoDest, false);
! 		else
! 			destBuffer = InvalidBuffer;
  		bbi++;
  	}
  
*************** spgRedoPickSplit(XLogRecPtr lsn, XLogRec
*** 750,783 ****
  	for (i = 0; i < xldata->nInsert; i++)
  	{
  		SpGistLeafTuple lt = (SpGistLeafTuple) ptr;
  
  		ptr += lt->size;
  
! 		page = leafPageSelect[i] ? destPage : srcPage;
! 		if (page == NULL)
  			continue;			/* no need to touch this page */
  
! 		addOrReplaceTuple(page, (Item) lt, lt->size, toInsert[i]);
  	}
  
! 	/* Now update src and dest page LSNs if needed */
! 	if (srcPage != NULL)
  	{
! 		PageSetLSN(srcPage, lsn);
! 		PageSetTLI(srcPage, ThisTimeLineID);
! 		MarkBufferDirty(srcBuffer);
  	}
! 	if (destPage != NULL)
  	{
! 		PageSetLSN(destPage, lsn);
! 		PageSetTLI(destPage, ThisTimeLineID);
! 		MarkBufferDirty(destBuffer);
  	}
  
  	/* restore new inner tuple */
! 	if (record->xl_info & XLR_BKP_BLOCK(bbi))
! 		(void) RestoreBackupBlock(lsn, record, bbi, false, false);
! 	else
  	{
  		Buffer		buffer = XLogReadBuffer(xldata->node, xldata->blknoInner,
  											xldata->initInner);
--- 644,690 ----
  	for (i = 0; i < xldata->nInsert; i++)
  	{
  		SpGistLeafTuple lt = (SpGistLeafTuple) ptr;
+ 		Buffer		leafBuffer;
  
  		ptr += lt->size;
  
! 		leafBuffer = leafPageSelect[i] ? destBuffer : srcBuffer;
! 		if (!BufferIsValid(leafBuffer))
  			continue;			/* no need to touch this page */
+ 		page = BufferGetPage(leafBuffer);
  
! 		if (!XLByteLE(lsn, PageGetLSN(page)))
! 		{
! 			addOrReplaceTuple(page, (Item) lt, lt->size, toInsert[i]);
! 		}
  	}
  
! 	/* Now update src and dest page LSNs */
! 	if (BufferIsValid(srcBuffer))
  	{
! 		page = BufferGetPage(srcBuffer);
! 		if (!XLByteLE(lsn, PageGetLSN(page)))
! 		{
! 			PageSetLSN(page, lsn);
! 			PageSetTLI(page, ThisTimeLineID);
! 			MarkBufferDirty(srcBuffer);
! 		}
! 		UnlockReleaseBuffer(srcBuffer);
  	}
! 	if (BufferIsValid(destBuffer))
  	{
! 		page = BufferGetPage(destBuffer);
! 		if (!XLByteLE(lsn, PageGetLSN(page)))
! 		{
! 			PageSetLSN(page, lsn);
! 			PageSetTLI(page, ThisTimeLineID);
! 			MarkBufferDirty(destBuffer);
! 		}
! 		UnlockReleaseBuffer(destBuffer);
  	}
  
  	/* restore new inner tuple */
! 	if (!(record->xl_info & XLR_SET_BKP_BLOCK(bbi)))
  	{
  		Buffer		buffer = XLogReadBuffer(xldata->node, xldata->blknoInner,
  											xldata->initInner);
*************** spgRedoPickSplit(XLogRecPtr lsn, XLogRec
*** 815,829 ****
  	}
  	bbi++;
  
- 	/*
- 	 * Now we can release the leaf-page locks.	It's okay to do this before
- 	 * updating the parent downlink.
- 	 */
- 	if (BufferIsValid(srcBuffer))
- 		UnlockReleaseBuffer(srcBuffer);
- 	if (BufferIsValid(destBuffer))
- 		UnlockReleaseBuffer(destBuffer);
- 
  	/* update parent downlink, unless we did it above */
  	if (xldata->blknoParent == InvalidBlockNumber)
  	{
--- 722,727 ----
*************** spgRedoPickSplit(XLogRecPtr lsn, XLogRec
*** 832,840 ****
  	}
  	else if (xldata->blknoInner != xldata->blknoParent)
  	{
! 		if (record->xl_info & XLR_BKP_BLOCK(bbi))
! 			(void) RestoreBackupBlock(lsn, record, bbi, false, false);
! 		else
  		{
  			Buffer		buffer = XLogReadBuffer(xldata->node, xldata->blknoParent, false);
  
--- 730,736 ----
  	}
  	else if (xldata->blknoInner != xldata->blknoParent)
  	{
! 		if (!(record->xl_info & XLR_SET_BKP_BLOCK(bbi)))
  		{
  			Buffer		buffer = XLogReadBuffer(xldata->node, xldata->blknoParent, false);
  
*************** spgRedoVacuumLeaf(XLogRecPtr lsn, XLogRe
*** 892,900 ****
  	ptr += sizeof(OffsetNumber) * xldata->nChain;
  	chainDest = (OffsetNumber *) ptr;
  
! 	if (record->xl_info & XLR_BKP_BLOCK(0))
! 		(void) RestoreBackupBlock(lsn, record, 0, false, false);
! 	else
  	{
  		buffer = XLogReadBuffer(xldata->node, xldata->blkno, false);
  		if (BufferIsValid(buffer))
--- 788,794 ----
  	ptr += sizeof(OffsetNumber) * xldata->nChain;
  	chainDest = (OffsetNumber *) ptr;
  
! 	if (!(record->xl_info & XLR_BKP_BLOCK_1))
  	{
  		buffer = XLogReadBuffer(xldata->node, xldata->blkno, false);
  		if (BufferIsValid(buffer))
*************** spgRedoVacuumRoot(XLogRecPtr lsn, XLogRe
*** 963,971 ****
  	ptr += sizeof(spgxlogVacuumRoot);
  	toDelete = (OffsetNumber *) ptr;
  
! 	if (record->xl_info & XLR_BKP_BLOCK(0))
! 		(void) RestoreBackupBlock(lsn, record, 0, false, false);
! 	else
  	{
  		buffer = XLogReadBuffer(xldata->node, xldata->blkno, false);
  		if (BufferIsValid(buffer))
--- 857,863 ----
  	ptr += sizeof(spgxlogVacuumRoot);
  	toDelete = (OffsetNumber *) ptr;
  
! 	if (!(record->xl_info & XLR_BKP_BLOCK_1))
  	{
  		buffer = XLogReadBuffer(xldata->node, xldata->blkno, false);
  		if (BufferIsValid(buffer))
*************** spgRedoVacuumRedirect(XLogRecPtr lsn, XL
*** 997,1016 ****
  	ptr += sizeof(spgxlogVacuumRedirect);
  	itemToPlaceholder = (OffsetNumber *) ptr;
  
! 	/*
! 	 * If any redirection tuples are being removed, make sure there are no
! 	 * live Hot Standby transactions that might need to see them.
! 	 */
! 	if (InHotStandby)
! 	{
! 		if (TransactionIdIsValid(xldata->newestRedirectXid))
! 			ResolveRecoveryConflictWithSnapshot(xldata->newestRedirectXid,
! 												xldata->node);
! 	}
! 
! 	if (record->xl_info & XLR_BKP_BLOCK(0))
! 		(void) RestoreBackupBlock(lsn, record, 0, false, false);
! 	else
  	{
  		buffer = XLogReadBuffer(xldata->node, xldata->blkno, false);
  
--- 889,895 ----
  	ptr += sizeof(spgxlogVacuumRedirect);
  	itemToPlaceholder = (OffsetNumber *) ptr;
  
! 	if (!(record->xl_info & XLR_BKP_BLOCK_1))
  	{
  		buffer = XLogReadBuffer(xldata->node, xldata->blkno, false);
  
*************** spg_redo(XLogRecPtr lsn, XLogRecord *rec
*** 1075,1080 ****
--- 954,989 ----
  	uint8		info = record->xl_info & ~XLR_INFO_MASK;
  	MemoryContext oldCxt;
  
+ 	/*
+ 	 * If we have any conflict processing to do, it must happen before we
+ 	 * update the page.
+ 	 */
+ 	if (InHotStandby)
+ 	{
+ 		switch (info)
+ 		{
+ 			case XLOG_SPGIST_VACUUM_REDIRECT:
+ 				{
+ 					spgxlogVacuumRedirect *xldata =
+ 					(spgxlogVacuumRedirect *) XLogRecGetData(record);
+ 
+ 					/*
+ 					 * If any redirection tuples are being removed, make sure
+ 					 * there are no live Hot Standby transactions that might
+ 					 * need to see them.
+ 					 */
+ 					if (TransactionIdIsValid(xldata->newestRedirectXid))
+ 						ResolveRecoveryConflictWithSnapshot(xldata->newestRedirectXid,
+ 															xldata->node);
+ 					break;
+ 				}
+ 			default:
+ 				break;
+ 		}
+ 	}
+ 
+ 	RestoreBkpBlocks(lsn, record, false);
+ 
  	oldCxt = MemoryContextSwitchTo(opCtx);
  	switch (info)
  	{
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
new file mode 100644
index f8ebf57..573c9ad
*** a/src/backend/access/transam/README
--- b/src/backend/access/transam/README
*************** critical section.)
*** 438,446 ****
  4. Mark the shared buffer(s) as dirty with MarkBufferDirty().  (This must
  happen before the WAL record is inserted; see notes in SyncOneBuffer().)
  
! 5. If the relation requires WAL-logging, build a WAL log record and pass it
! to XLogInsert(); then update the page's LSN and TLI using the returned XLOG
! location.  For instance,
  
  		recptr = XLogInsert(rmgr_id, info, rdata);
  
--- 438,445 ----
  4. Mark the shared buffer(s) as dirty with MarkBufferDirty().  (This must
  happen before the WAL record is inserted; see notes in SyncOneBuffer().)
  
! 5. Build a WAL log record and pass it to XLogInsert(); then update the page's
! LSN and TLI using the returned XLOG location.  For instance,
  
  		recptr = XLogInsert(rmgr_id, info, rdata);
  
*************** which buffers were handled that way ---
*** 467,475 ****
  what the XLOG record actually contains.  XLOG records that describe multi-page
  changes therefore require some care to design: you must be certain that you
  know what data is indicated by each "BKP" bit.  An example of the trickiness
! is that in a HEAP_UPDATE record, BKP(0) normally is associated with the source
! page and BKP(1) is associated with the destination page --- but if these are
! the same page, only BKP(0) would have been set.
  
  For this reason as well as the risk of deadlocking on buffer locks, it's best
  to design WAL records so that they reflect small atomic actions involving just
--- 466,474 ----
  what the XLOG record actually contains.  XLOG records that describe multi-page
  changes therefore require some care to design: you must be certain that you
  know what data is indicated by each "BKP" bit.  An example of the trickiness
! is that in a HEAP_UPDATE record, BKP(1) normally is associated with the source
! page and BKP(2) is associated with the destination page --- but if these are
! the same page, only BKP(1) would have been set.
  
  For this reason as well as the risk of deadlocking on buffer locks, it's best
  to design WAL records so that they reflect small atomic actions involving just
*************** incrementally update the page, the rdata
*** 498,516 ****
  ID at least once; otherwise there is no defense against torn-page problems.
  The standard replay-routine pattern for this case is
  
! 	if (record->xl_info & XLR_BKP_BLOCK(N))
! 	{
! 		/* apply the change from the full-page image */
! 		(void) RestoreBackupBlock(lsn, record, N, false, false);
! 		return;
! 	}
  
  	buffer = XLogReadBuffer(rnode, blkno, false);
  	if (!BufferIsValid(buffer))
! 	{
! 		/* page has been deleted, so we need do nothing */
! 		return;
! 	}
  	page = (Page) BufferGetPage(buffer);
  
  	if (XLByteLE(lsn, PageGetLSN(page)))
--- 497,508 ----
  ID at least once; otherwise there is no defense against torn-page problems.
  The standard replay-routine pattern for this case is
  
! 	if (record->xl_info & XLR_BKP_BLOCK_n)
! 		<< do nothing, page was rewritten from logged copy >>;
  
  	buffer = XLogReadBuffer(rnode, blkno, false);
  	if (!BufferIsValid(buffer))
! 		<< do nothing, page has been deleted >>;
  	page = (Page) BufferGetPage(buffer);
  
  	if (XLByteLE(lsn, PageGetLSN(page)))
*************** The standard replay-routine pattern for
*** 528,569 ****
  	UnlockReleaseBuffer(buffer);
  
  As noted above, for a multi-page update you need to be able to determine
! which XLR_BKP_BLOCK(N) flag applies to each page.  If a WAL record reflects
  a combination of fully-rewritable and incremental updates, then the rewritable
! pages don't count for the XLR_BKP_BLOCK(N) numbering.  (XLR_BKP_BLOCK(N) is
! associated with the N'th distinct buffer ID seen in the "rdata" array, and
  per the above discussion, fully-rewritable buffers shouldn't be mentioned in
  "rdata".)
  
- When replaying a WAL record that describes changes on multiple pages, you
- must be careful to lock the pages properly to prevent concurrent Hot Standby
- queries from seeing an inconsistent state.  If this requires that two
- or more buffer locks be held concurrently, the coding pattern shown above
- is too simplistic, since it assumes the routine can exit as soon as it's
- known the current page requires no modification.  Instead, you might have
- something like
- 
- 	if (record->xl_info & XLR_BKP_BLOCK(0))
- 	{
- 		/* apply the change from the full-page image */
- 		buffer0 = RestoreBackupBlock(lsn, record, 0, false, true);
- 	}
- 	else
- 	{
- 		buffer0 = XLogReadBuffer(rnode, blkno, false);
- 		if (BufferIsValid(buffer0))
- 		{
- 			... apply the change if not already done ...
- 			MarkBufferDirty(buffer0);
- 		}
- 	}
- 
- 	... similarly apply the changes for remaining pages ...
- 
- 	/* and now we can release the lock on the first page */
- 	if (BufferIsValid(buffer0))
- 		UnlockReleaseBuffer(buffer0);
- 
  Due to all these constraints, complex changes (such as a multilevel index
  insertion) normally need to be described by a series of atomic-action WAL
  records.  What do you do if the intermediate states are not self-consistent?
--- 520,532 ----
  	UnlockReleaseBuffer(buffer);
  
  As noted above, for a multi-page update you need to be able to determine
! which XLR_BKP_BLOCK_n flag applies to each page.  If a WAL record reflects
  a combination of fully-rewritable and incremental updates, then the rewritable
! pages don't count for the XLR_BKP_BLOCK_n numbering.  (XLR_BKP_BLOCK_n is
! associated with the n'th distinct buffer ID seen in the "rdata" array, and
  per the above discussion, fully-rewritable buffers shouldn't be mentioned in
  "rdata".)
  
  Due to all these constraints, complex changes (such as a multilevel index
  insertion) normally need to be described by a series of atomic-action WAL
  records.  What do you do if the intermediate states are not self-consistent?
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
new file mode 100644
index 1faf666..c541b5a
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
*************** begin:;
*** 835,842 ****
  	 * At the exit of this loop, write_len includes the backup block data.
  	 *
  	 * Also set the appropriate info bits to show which buffers were backed
! 	 * up. The XLR_BKP_BLOCK(N) bit corresponds to the N'th distinct buffer
! 	 * value (ignoring InvalidBuffer) appearing in the rdata chain.
  	 */
  	rdt_lastnormal = rdt;
  	write_len = len;
--- 835,842 ----
  	 * At the exit of this loop, write_len includes the backup block data.
  	 *
  	 * Also set the appropriate info bits to show which buffers were backed
! 	 * up. The i'th XLR_SET_BKP_BLOCK bit corresponds to the i'th distinct
! 	 * buffer value (ignoring InvalidBuffer) appearing in the rdata chain.
  	 */
  	rdt_lastnormal = rdt;
  	write_len = len;
*************** begin:;
*** 848,854 ****
  		if (!dtbuf_bkp[i])
  			continue;
  
! 		info |= XLR_BKP_BLOCK(i);
  
  		bkpb = &(dtbuf_xlg[i]);
  		page = (char *) BufferGetBlock(dtbuf[i]);
--- 848,854 ----
  		if (!dtbuf_bkp[i])
  			continue;
  
! 		info |= XLR_SET_BKP_BLOCK(i);
  
  		bkpb = &(dtbuf_xlg[i]);
  		page = (char *) BufferGetBlock(dtbuf[i]);
*************** CleanupBackupHistory(void)
*** 3080,3095 ****
  }
  
  /*
!  * Restore a full-page image from a backup block attached to an XLOG record.
!  *
!  * lsn: LSN of the XLOG record being replayed
!  * record: the complete XLOG record
!  * block_index: which backup block to restore (0 .. XLR_MAX_BKP_BLOCKS - 1)
!  * get_cleanup_lock: TRUE to get a cleanup rather than plain exclusive lock
!  * keep_buffer: TRUE to return the buffer still locked and pinned
   *
!  * Returns the buffer number containing the page.  Note this is not terribly
!  * useful unless keep_buffer is specified as TRUE.
   *
   * Note: when a backup block is available in XLOG, we restore it
   * unconditionally, even if the page in the database appears newer.
--- 3080,3088 ----
  }
  
  /*
!  * Restore the backup blocks present in an XLOG record, if any.
   *
!  * We assume all of the record has been read into memory at *record.
   *
   * Note: when a backup block is available in XLOG, we restore it
   * unconditionally, even if the page in the database appears newer.
*************** CleanupBackupHistory(void)
*** 3100,3119 ****
   * modifications of the page that appear in XLOG, rather than possibly
   * ignoring them as already applied, but that's not a huge drawback.
   *
!  * If 'get_cleanup_lock' is true, a cleanup lock is obtained on the buffer,
!  * else a normal exclusive lock is used.  During crash recovery, that's just
!  * pro forma because there can't be any regular backends in the system, but
!  * in hot standby mode the distinction is important.
!  *
!  * If 'keep_buffer' is true, return without releasing the buffer lock and pin;
!  * then caller is responsible for doing UnlockReleaseBuffer() later.  This
!  * is needed in some cases when replaying XLOG records that touch multiple
!  * pages, to prevent inconsistent states from being visible to other backends.
!  * (Again, that's only important in hot standby mode.)
   */
! Buffer
! RestoreBackupBlock(XLogRecPtr lsn, XLogRecord *record, int block_index,
! 				   bool get_cleanup_lock, bool keep_buffer)
  {
  	Buffer		buffer;
  	Page		page;
--- 3093,3107 ----
   * modifications of the page that appear in XLOG, rather than possibly
   * ignoring them as already applied, but that's not a huge drawback.
   *
!  * If 'cleanup' is true, a cleanup lock is used when restoring blocks.
!  * Otherwise, a normal exclusive lock is used.	During crash recovery, that's
!  * just pro forma because there can't be any regular backends in the system,
!  * but in hot standby mode the distinction is important. The 'cleanup'
!  * argument applies to all backup blocks in the WAL record, that suffices for
!  * now.
   */
! void
! RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup)
  {
  	Buffer		buffer;
  	Page		page;
*************** RestoreBackupBlock(XLogRecPtr lsn, XLogR
*** 3121,3179 ****
  	char	   *blk;
  	int			i;
  
! 	/* Locate requested BkpBlock in the record */
  	blk = (char *) XLogRecGetData(record) + record->xl_len;
  	for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
  	{
! 		if (!(record->xl_info & XLR_BKP_BLOCK(i)))
  			continue;
  
  		memcpy(&bkpb, blk, sizeof(BkpBlock));
  		blk += sizeof(BkpBlock);
  
! 		if (i == block_index)
! 		{
! 			/* Found it, apply the update */
! 			buffer = XLogReadBufferExtended(bkpb.node, bkpb.fork, bkpb.block,
! 											RBM_ZERO);
! 			Assert(BufferIsValid(buffer));
! 			if (get_cleanup_lock)
! 				LockBufferForCleanup(buffer);
! 			else
! 				LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
! 
! 			page = (Page) BufferGetPage(buffer);
! 
! 			if (bkpb.hole_length == 0)
! 			{
! 				memcpy((char *) page, blk, BLCKSZ);
! 			}
! 			else
! 			{
! 				memcpy((char *) page, blk, bkpb.hole_offset);
! 				/* must zero-fill the hole */
! 				MemSet((char *) page + bkpb.hole_offset, 0, bkpb.hole_length);
! 				memcpy((char *) page + (bkpb.hole_offset + bkpb.hole_length),
! 					   blk + bkpb.hole_offset,
! 					   BLCKSZ - (bkpb.hole_offset + bkpb.hole_length));
! 			}
! 
! 			PageSetLSN(page, lsn);
! 			PageSetTLI(page, ThisTimeLineID);
! 			MarkBufferDirty(buffer);
  
! 			if (!keep_buffer)
! 				UnlockReleaseBuffer(buffer);
  
! 			return buffer;
  		}
  
  		blk += BLCKSZ - bkpb.hole_length;
  	}
- 
- 	/* Caller specified a bogus block_index */
- 	elog(ERROR, "failed to restore block_index %d", block_index);
- 	return InvalidBuffer;		/* keep compiler quiet */
  }
  
  /*
--- 3109,3157 ----
  	char	   *blk;
  	int			i;
  
! 	if (!(record->xl_info & XLR_BKP_BLOCK_MASK))
! 		return;
! 
  	blk = (char *) XLogRecGetData(record) + record->xl_len;
  	for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
  	{
! 		if (!(record->xl_info & XLR_SET_BKP_BLOCK(i)))
  			continue;
  
  		memcpy(&bkpb, blk, sizeof(BkpBlock));
  		blk += sizeof(BkpBlock);
  
! 		buffer = XLogReadBufferExtended(bkpb.node, bkpb.fork, bkpb.block,
! 										RBM_ZERO);
! 		Assert(BufferIsValid(buffer));
! 		if (cleanup)
! 			LockBufferForCleanup(buffer);
! 		else
! 			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
  
! 		page = (Page) BufferGetPage(buffer);
  
! 		if (bkpb.hole_length == 0)
! 		{
! 			memcpy((char *) page, blk, BLCKSZ);
! 		}
! 		else
! 		{
! 			memcpy((char *) page, blk, bkpb.hole_offset);
! 			/* must zero-fill the hole */
! 			MemSet((char *) page + bkpb.hole_offset, 0, bkpb.hole_length);
! 			memcpy((char *) page + (bkpb.hole_offset + bkpb.hole_length),
! 				   blk + bkpb.hole_offset,
! 				   BLCKSZ - (bkpb.hole_offset + bkpb.hole_length));
  		}
  
+ 		PageSetLSN(page, lsn);
+ 		PageSetTLI(page, ThisTimeLineID);
+ 		MarkBufferDirty(buffer);
+ 		UnlockReleaseBuffer(buffer);
+ 
  		blk += BLCKSZ - bkpb.hole_length;
  	}
  }
  
  /*
*************** RecordIsValid(XLogRecord *record, XLogRe
*** 3215,3221 ****
  	{
  		uint32		blen;
  
! 		if (!(record->xl_info & XLR_BKP_BLOCK(i)))
  			continue;
  
  		if (remaining < sizeof(BkpBlock))
--- 3193,3199 ----
  	{
  		uint32		blen;
  
! 		if (!(record->xl_info & XLR_SET_BKP_BLOCK(i)))
  			continue;
  
  		if (remaining < sizeof(BkpBlock))
*************** xlog_outrec(StringInfo buf, XLogRecord *
*** 8103,8110 ****
  	int			i;
  
  	appendStringInfo(buf, "prev %X/%X; xid %u",
! 					 (uint32) (record->xl_prev >> 32),
! 					 (uint32) record->xl_prev,
  					 record->xl_xid);
  
  	appendStringInfo(buf, "; len %u",
--- 8081,8087 ----
  	int			i;
  
  	appendStringInfo(buf, "prev %X/%X; xid %u",
! 					 (uint32) (record->xl_prev >> 32), (uint32) record->xl_prev,
  					 record->xl_xid);
  
  	appendStringInfo(buf, "; len %u",
*************** xlog_outrec(StringInfo buf, XLogRecord *
*** 8112,8119 ****
  
  	for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
  	{
! 		if (record->xl_info & XLR_BKP_BLOCK(i))
! 			appendStringInfo(buf, "; bkpb%d", i);
  	}
  
  	appendStringInfo(buf, ": %s", RmgrTable[record->xl_rmid].rm_name);
--- 8089,8096 ----
  
  	for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
  	{
! 		if (record->xl_info & XLR_SET_BKP_BLOCK(i))
! 			appendStringInfo(buf, "; bkpb%d", i + 1);
  	}
  
  	appendStringInfo(buf, ": %s", RmgrTable[record->xl_rmid].rm_name);
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
new file mode 100644
index 1e8eabd..52877ae
*** a/src/include/access/gist_private.h
--- b/src/include/access/gist_private.h
*************** typedef GISTScanOpaqueData *GISTScanOpaq
*** 167,173 ****
  #define XLOG_GIST_PAGE_SPLIT		0x30
   /* #define XLOG_GIST_INSERT_COMPLETE	 0x40 */	/* not used anymore */
  #define XLOG_GIST_CREATE_INDEX		0x50
!  /* #define XLOG_GIST_PAGE_DELETE		 0x60 */	/* not used anymore */
  
  typedef struct gistxlogPageUpdate
  {
--- 167,173 ----
  #define XLOG_GIST_PAGE_SPLIT		0x30
   /* #define XLOG_GIST_INSERT_COMPLETE	 0x40 */	/* not used anymore */
  #define XLOG_GIST_CREATE_INDEX		0x50
! #define XLOG_GIST_PAGE_DELETE		0x60
  
  typedef struct gistxlogPageUpdate
  {
*************** typedef struct gistxlogPage
*** 211,216 ****
--- 211,222 ----
  	int			num;			/* number of index tuples following */
  } gistxlogPage;
  
+ typedef struct gistxlogPageDelete
+ {
+ 	RelFileNode node;
+ 	BlockNumber blkno;
+ } gistxlogPageDelete;
+ 
  /* SplitedPageLayout - gistSplit function result */
  typedef struct SplitedPageLayout
  {
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
new file mode 100644
index 32c2e40..2893f3b
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
*************** typedef struct XLogRecord
*** 71,77 ****
   */
  #define XLR_BKP_BLOCK_MASK		0x0F	/* all info bits used for bkp blocks */
  #define XLR_MAX_BKP_BLOCKS		4
! #define XLR_BKP_BLOCK(iblk)		(0x08 >> (iblk))		/* iblk in 0..3 */
  
  /* Sync methods */
  #define SYNC_METHOD_FSYNC		0
--- 71,81 ----
   */
  #define XLR_BKP_BLOCK_MASK		0x0F	/* all info bits used for bkp blocks */
  #define XLR_MAX_BKP_BLOCKS		4
! #define XLR_SET_BKP_BLOCK(iblk) (0x08 >> (iblk))
! #define XLR_BKP_BLOCK_1			XLR_SET_BKP_BLOCK(0)	/* 0x08 */
! #define XLR_BKP_BLOCK_2			XLR_SET_BKP_BLOCK(1)	/* 0x04 */
! #define XLR_BKP_BLOCK_3			XLR_SET_BKP_BLOCK(2)	/* 0x02 */
! #define XLR_BKP_BLOCK_4			XLR_SET_BKP_BLOCK(3)	/* 0x01 */
  
  /* Sync methods */
  #define SYNC_METHOD_FSYNC		0
*************** extern int	sync_method;
*** 90,102 ****
   * If buffer is valid then XLOG will check if buffer must be backed up
   * (ie, whether this is first change of that page since last checkpoint).
   * If so, the whole page contents are attached to the XLOG record, and XLOG
!  * sets XLR_BKP_BLOCK(N) bit in xl_info.  Note that the buffer must be pinned
   * and exclusive-locked by the caller, so that it won't change under us.
   * NB: when the buffer is backed up, we DO NOT insert the data pointed to by
   * this XLogRecData struct into the XLOG record, since we assume it's present
   * in the buffer.  Therefore, rmgr redo routines MUST pay attention to
!  * XLR_BKP_BLOCK(N) to know what is actually stored in the XLOG record.
!  * The N'th XLR_BKP_BLOCK bit corresponds to the N'th distinct buffer
   * value (ignoring InvalidBuffer) appearing in the rdata chain.
   *
   * When buffer is valid, caller must set buffer_std to indicate whether the
--- 94,106 ----
   * If buffer is valid then XLOG will check if buffer must be backed up
   * (ie, whether this is first change of that page since last checkpoint).
   * If so, the whole page contents are attached to the XLOG record, and XLOG
!  * sets XLR_BKP_BLOCK_X bit in xl_info.  Note that the buffer must be pinned
   * and exclusive-locked by the caller, so that it won't change under us.
   * NB: when the buffer is backed up, we DO NOT insert the data pointed to by
   * this XLogRecData struct into the XLOG record, since we assume it's present
   * in the buffer.  Therefore, rmgr redo routines MUST pay attention to
!  * XLR_BKP_BLOCK_X to know what is actually stored in the XLOG record.
!  * The i'th XLR_BKP_BLOCK bit corresponds to the i'th distinct buffer
   * value (ignoring InvalidBuffer) appearing in the rdata chain.
   *
   * When buffer is valid, caller must set buffer_std to indicate whether the
*************** extern int	XLogFileOpen(XLogSegNo segno)
*** 270,278 ****
  extern void XLogGetLastRemoved(XLogSegNo *segno);
  extern void XLogSetAsyncXactLSN(XLogRecPtr record);
  
! extern Buffer RestoreBackupBlock(XLogRecPtr lsn, XLogRecord *record,
! 				   int block_index,
! 				   bool get_cleanup_lock, bool keep_buffer);
  
  extern void xlog_redo(XLogRecPtr lsn, XLogRecord *record);
  extern void xlog_desc(StringInfo buf, uint8 xl_info, char *rec);
--- 274,280 ----
  extern void XLogGetLastRemoved(XLogSegNo *segno);
  extern void XLogSetAsyncXactLSN(XLogRecPtr record);
  
! extern void RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup);
  
  extern void xlog_redo(XLogRecPtr lsn, XLogRecord *record);
  extern void xlog_desc(StringInfo buf, uint8 xl_info, char *rec);

test_many_tablestext/plain; charset=us-asciiDownload

#41

Bruce Momjian

bruce@momjian.us

about 13 years ago

In reply to: Bruce Momjian (#40)

1 attachment(s)

Re: Further pg_upgrade analysis for many tables

On Tue, Nov 13, 2012 at 07:03:51PM -0500, Bruce Momjian wrote:

I am attaching an updated pg_upgrade patch, which I believe is ready for
application for 9.3.

Correction, here is the proper patch. The previous posted version was
had pending merges from the master branch.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

Attachments:

pg_upgrade.difftext/x-diff; charset=us-asciiDownload

diff --git a/contrib/pg_upgrade/file.c b/contrib/pg_upgrade/file.c
new file mode 100644
index a5d92c6..d8cd8f5
*** a/contrib/pg_upgrade/file.c
--- b/contrib/pg_upgrade/file.c
*************** copy_file(const char *srcfile, const cha
*** 221,281 ****
  #endif
  
  
- /*
-  * load_directory()
-  *
-  * Read all the file names in the specified directory, and return them as
-  * an array of "char *" pointers.  The array address is returned in
-  * *namelist, and the function result is the count of file names.
-  *
-  * To free the result data, free each (char *) array member, then free the
-  * namelist array itself.
-  */
- int
- load_directory(const char *dirname, char ***namelist)
- {
- 	DIR		   *dirdesc;
- 	struct dirent *direntry;
- 	int			count = 0;
- 	int			allocsize = 64;		/* initial array size */
- 
- 	*namelist = (char **) pg_malloc(allocsize * sizeof(char *));
- 
- 	if ((dirdesc = opendir(dirname)) == NULL)
- 		pg_log(PG_FATAL, "could not open directory \"%s\": %s\n",
- 			   dirname, getErrorText(errno));
- 
- 	while (errno = 0, (direntry = readdir(dirdesc)) != NULL)
- 	{
- 		if (count >= allocsize)
- 		{
- 			allocsize *= 2;
- 			*namelist = (char **)
- 						pg_realloc(*namelist, allocsize * sizeof(char *));
- 		}
- 
- 		(*namelist)[count++] = pg_strdup(direntry->d_name);
- 	}
- 
- #ifdef WIN32
- 	/*
- 	 * This fix is in mingw cvs (runtime/mingwex/dirent.c rev 1.4), but not in
- 	 * released version
- 	 */
- 	if (GetLastError() == ERROR_NO_MORE_FILES)
- 		errno = 0;
- #endif
- 
- 	if (errno)
- 		pg_log(PG_FATAL, "could not read directory \"%s\": %s\n",
- 			   dirname, getErrorText(errno));
- 
- 	closedir(dirdesc);
- 
- 	return count;
- }
- 
- 
  void
  check_hard_link(void)
  {
--- 221,226 ----
diff --git a/contrib/pg_upgrade/pg_upgrade.h b/contrib/pg_upgrade/pg_upgrade.h
new file mode 100644
index 3058343..f35ce75
*** a/contrib/pg_upgrade/pg_upgrade.h
--- b/contrib/pg_upgrade/pg_upgrade.h
***************
*** 7,13 ****
  
  #include <unistd.h>
  #include <assert.h>
- #include <dirent.h>
  #include <sys/stat.h>
  #include <sys/time.h>
  
--- 7,12 ----
*************** const char *setupPageConverter(pageCnvCt
*** 366,372 ****
  typedef void *pageCnvCtx;
  #endif
  
- int			load_directory(const char *dirname, char ***namelist);
  const char *copyAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
  				  const char *dst, bool force);
  const char *linkAndUpdateFile(pageCnvCtx *pageConverter, const char *src,
--- 365,370 ----
diff --git a/contrib/pg_upgrade/relfilenode.c b/contrib/pg_upgrade/relfilenode.c
new file mode 100644
index 33a867f..d763ba7
*** a/contrib/pg_upgrade/relfilenode.c
--- b/contrib/pg_upgrade/relfilenode.c
***************
*** 17,25 ****
  
  static void transfer_single_new_db(pageCnvCtx *pageConverter,
  					   FileNameMap *maps, int size);
! static void transfer_relfile(pageCnvCtx *pageConverter,
! 				 const char *fromfile, const char *tofile,
! 				 const char *nspname, const char *relname);
  
  
  /*
--- 17,24 ----
  
  static void transfer_single_new_db(pageCnvCtx *pageConverter,
  					   FileNameMap *maps, int size);
! static void transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
! 							 const char *suffix);
  
  
  /*
*************** static void
*** 131,185 ****
  transfer_single_new_db(pageCnvCtx *pageConverter,
  					   FileNameMap *maps, int size)
  {
- 	char		old_dir[MAXPGPATH];
- 	char		file_pattern[MAXPGPATH];
- 	char		**namelist = NULL;
- 	int			numFiles = 0;
  	int			mapnum;
! 	int			fileno;
! 	bool		vm_crashsafe_change = false;
! 
! 	old_dir[0] = '\0';
! 
! 	/* Do not copy non-crashsafe vm files for binaries that assume crashsafety */
  	if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_CRASHSAFE_CAT_VER &&
  		new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
! 		vm_crashsafe_change = true;
  
  	for (mapnum = 0; mapnum < size; mapnum++)
  	{
! 		char		old_file[MAXPGPATH];
! 		char		new_file[MAXPGPATH];
! 
! 		/* Changed tablespaces?  Need a new directory scan? */
! 		if (strcmp(maps[mapnum].old_dir, old_dir) != 0)
! 		{
! 			if (numFiles > 0)
! 			{
! 				for (fileno = 0; fileno < numFiles; fileno++)
! 					pg_free(namelist[fileno]);
! 				pg_free(namelist);
! 			}
! 
! 			snprintf(old_dir, sizeof(old_dir), "%s", maps[mapnum].old_dir);
! 			numFiles = load_directory(old_dir, &namelist);
! 		}
! 
! 		/* Copying files might take some time, so give feedback. */
! 
! 		snprintf(old_file, sizeof(old_file), "%s/%u", maps[mapnum].old_dir,
! 				 maps[mapnum].old_relfilenode);
! 		snprintf(new_file, sizeof(new_file), "%s/%u", maps[mapnum].new_dir,
! 				 maps[mapnum].new_relfilenode);
! 		pg_log(PG_REPORT, OVERWRITE_MESSAGE, old_file);
! 
! 		/*
! 		 * Copy/link the relation's primary file (segment 0 of main fork)
! 		 * to the new cluster
! 		 */
! 		unlink(new_file);
! 		transfer_relfile(pageConverter, old_file, new_file,
! 						 maps[mapnum].nspname, maps[mapnum].relname);
  
  		/* fsm/vm files added in PG 8.4 */
  		if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)
--- 130,150 ----
  transfer_single_new_db(pageCnvCtx *pageConverter,
  					   FileNameMap *maps, int size)
  {
  	int			mapnum;
! 	bool		vm_crashsafe_match = true;
! 	
! 	/*
! 	 * Do the old and new cluster disagree on the crash-safetiness of the vm
!      * files?  If so, do not copy them.
!      */
  	if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_CRASHSAFE_CAT_VER &&
  		new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
! 		vm_crashsafe_match = false;
  
  	for (mapnum = 0; mapnum < size; mapnum++)
  	{
! 		/* transfer primary file */
! 		transfer_relfile(pageConverter, &maps[mapnum], "");
  
  		/* fsm/vm files added in PG 8.4 */
  		if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)
*************** transfer_single_new_db(pageCnvCtx *pageC
*** 187,253 ****
  			/*
  			 * Copy/link any fsm and vm files, if they exist
  			 */
! 			snprintf(file_pattern, sizeof(file_pattern), "%u_",
! 					 maps[mapnum].old_relfilenode);
! 
! 			for (fileno = 0; fileno < numFiles; fileno++)
! 			{
! 				char	   *vm_offset = strstr(namelist[fileno], "_vm");
! 				bool		is_vm_file = false;
! 
! 				/* Is a visibility map file? (name ends with _vm) */
! 				if (vm_offset && strlen(vm_offset) == strlen("_vm"))
! 					is_vm_file = true;
! 
! 				if (strncmp(namelist[fileno], file_pattern,
! 							strlen(file_pattern)) == 0 &&
! 					(!is_vm_file || !vm_crashsafe_change))
! 				{
! 					snprintf(old_file, sizeof(old_file), "%s/%s", maps[mapnum].old_dir,
! 							 namelist[fileno]);
! 					snprintf(new_file, sizeof(new_file), "%s/%u%s", maps[mapnum].new_dir,
! 							 maps[mapnum].new_relfilenode, strchr(namelist[fileno], '_'));
! 
! 					unlink(new_file);
! 					transfer_relfile(pageConverter, old_file, new_file,
! 								 maps[mapnum].nspname, maps[mapnum].relname);
! 				}
! 			}
! 		}
! 
! 		/*
! 		 * Now copy/link any related segments as well. Remember, PG breaks
! 		 * large files into 1GB segments, the first segment has no extension,
! 		 * subsequent segments are named relfilenode.1, relfilenode.2,
! 		 * relfilenode.3, ...  'fsm' and 'vm' files use underscores so are not
! 		 * copied.
! 		 */
! 		snprintf(file_pattern, sizeof(file_pattern), "%u.",
! 				 maps[mapnum].old_relfilenode);
! 
! 		for (fileno = 0; fileno < numFiles; fileno++)
! 		{
! 			if (strncmp(namelist[fileno], file_pattern,
! 						strlen(file_pattern)) == 0)
! 			{
! 				snprintf(old_file, sizeof(old_file), "%s/%s", maps[mapnum].old_dir,
! 						 namelist[fileno]);
! 				snprintf(new_file, sizeof(new_file), "%s/%u%s", maps[mapnum].new_dir,
! 						 maps[mapnum].new_relfilenode, strchr(namelist[fileno], '.'));
! 
! 				unlink(new_file);
! 				transfer_relfile(pageConverter, old_file, new_file,
! 								 maps[mapnum].nspname, maps[mapnum].relname);
! 			}
  		}
  	}
- 
- 	if (numFiles > 0)
- 	{
- 		for (fileno = 0; fileno < numFiles; fileno++)
- 			pg_free(namelist[fileno]);
- 		pg_free(namelist);
- 	}
  }
  
  
--- 152,162 ----
  			/*
  			 * Copy/link any fsm and vm files, if they exist
  			 */
! 			transfer_relfile(pageConverter, &maps[mapnum], "_fsm");
! 			if (vm_crashsafe_match)
! 				transfer_relfile(pageConverter, &maps[mapnum], "_vm");
  		}
  	}
  }
  
  
*************** transfer_single_new_db(pageCnvCtx *pageC
*** 257,287 ****
   * Copy or link file from old cluster to new one.
   */
  static void
! transfer_relfile(pageCnvCtx *pageConverter, const char *old_file,
! 			  const char *new_file, const char *nspname, const char *relname)
  {
  	const char *msg;
! 
! 	if ((user_opts.transfer_mode == TRANSFER_MODE_LINK) && (pageConverter != NULL))
! 		pg_log(PG_FATAL, "This upgrade requires page-by-page conversion, "
! 			   "you must use copy mode instead of link mode.\n");
! 
! 	if (user_opts.transfer_mode == TRANSFER_MODE_COPY)
  	{
! 		pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
  
! 		if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
! 			pg_log(PG_FATAL, "error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
! 				   nspname, relname, old_file, new_file, msg);
! 	}
! 	else
! 	{
! 		pg_log(PG_VERBOSE, "linking \"%s\" to \"%s\"\n", old_file, new_file);
  
- 		if ((msg = linkAndUpdateFile(pageConverter, old_file, new_file)) != NULL)
- 			pg_log(PG_FATAL,
- 				   "error while creating link for relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
- 				   nspname, relname, old_file, new_file, msg);
- 	}
  	return;
  }
--- 166,243 ----
   * Copy or link file from old cluster to new one.
   */
  static void
! transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
! 				 const char *type_suffix)
  {
  	const char *msg;
! 	char		old_file[MAXPGPATH];
! 	char		new_file[MAXPGPATH];
! 	int			fd;
! 	int			segno;
! 	char		extent_suffix[65];
! 	
! 	/*
! 	 * Now copy/link any related segments as well. Remember, PG breaks
! 	 * large files into 1GB segments, the first segment has no extension,
! 	 * subsequent segments are named relfilenode.1, relfilenode.2,
! 	 * relfilenode.3.
! 	 * copied.
! 	 */
! 	for (segno = 0;; segno++)
  	{
! 		if (segno == 0)
! 			extent_suffix[0] = '\0';
! 		else
! 			snprintf(extent_suffix, sizeof(extent_suffix), ".%d", segno);
  
! 		snprintf(old_file, sizeof(old_file), "%s/%u%s%s", map->old_dir,
! 				 map->old_relfilenode, type_suffix, extent_suffix);
! 		snprintf(new_file, sizeof(new_file), "%s/%u%s%s", map->new_dir,
! 				 map->new_relfilenode, type_suffix, extent_suffix);
! 	
! 		/* Is it an extent, fsm, or vm file? */
! 		if (type_suffix[0] != '\0' || segno != 0)
! 		{
! 			/* Did file open fail? */
! 			if ((fd = open(old_file, O_RDONLY)) == -1)
! 			{
! 				/* File does not exist?  That's OK, just return */
! 				if (errno == ENOENT)
! 					return;
! 				else
! 					pg_log(PG_FATAL, "non-existant file error while copying relation \"%s.%s\" (\"%s\" to \"%s\")\n",
! 						   map->nspname, map->relname, old_file, new_file);
! 			}
! 			close(fd);
! 		}
! 
! 		unlink(new_file);
! 	
! 		/* Copying files might take some time, so give feedback. */
! 		pg_log(PG_REPORT, OVERWRITE_MESSAGE, old_file);
! 	
! 		if ((user_opts.transfer_mode == TRANSFER_MODE_LINK) && (pageConverter != NULL))
! 			pg_log(PG_FATAL, "This upgrade requires page-by-page conversion, "
! 				   "you must use copy mode instead of link mode.\n");
! 	
! 		if (user_opts.transfer_mode == TRANSFER_MODE_COPY)
! 		{
! 			pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", old_file, new_file);
! 	
! 			if ((msg = copyAndUpdateFile(pageConverter, old_file, new_file, true)) != NULL)
! 				pg_log(PG_FATAL, "error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
! 					   map->nspname, map->relname, old_file, new_file, msg);
! 		}
! 		else
! 		{
! 			pg_log(PG_VERBOSE, "linking \"%s\" to \"%s\"\n", old_file, new_file);
! 	
! 			if ((msg = linkAndUpdateFile(pageConverter, old_file, new_file)) != NULL)
! 				pg_log(PG_FATAL,
! 					   "error while creating link for relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
! 					   map->nspname, map->relname, old_file, new_file, msg);
! 		}
!    }
  
  	return;
  }

#42

Ants Aasma

ants@cybertec.at

about 13 years ago

In reply to: Bruce Momjian (#40)

Re: Further pg_upgrade analysis for many tables

On Wed, Nov 14, 2012 at 2:03 AM, Bruce Momjian <bruce@momjian.us> wrote:

At 64k I see pg_upgrade taking 12% of the duration time, if I subtract
out the dump/restore times.

My percentage numbers only included CPU time and I used SSD storage.
For the most part there was no IO wait to speak of, but it's
completely expected that thousands of link calls are not free.

Postgres time itself breaks down with 10% for shutdown checkpoint and
90% for regular running, consisting of 16% parsing, 13% analyze, 20%
plan, 30% execute, 11% commit (AtEOXact_RelationCache) and 6% network.

That SVG graph was quite impressive.

I used perf and Gprof2Dot for this. I will probably do a blog post on
how to generate these graphs. It's much more useful for me than a
plain flat profile as I don't know by heart which functions are called
by which.

It looks to me that most benefit could be had from introducing more
parallelism. Are there any large roadblocks to pipelining the dump and
restore to have them happen in parallel?

I talked to Andrew Dustan about parallelization in pg_restore. First,
we currently use pg_dumpall, which isn't in the custom format required
for parallel restore, but if we changed to custom format, create table
isn't done in parallel, only create index/check constraints, and trigger
creation, etc. Not sure if it worth perusing this just for pg_upgrade.

I agree that parallel restore for schemas is a hard problem. But I
didn't mean parallelism within the restore, I meant that we could
start both postmasters and pipe the output from dump directly to
restore. This way the times for dumping and restoring can overlap.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

#43

Bruce Momjian

bruce@momjian.us

about 13 years ago

In reply to: Ants Aasma (#42)

Re: Further pg_upgrade analysis for many tables

On Wed, Nov 14, 2012 at 06:11:27AM +0200, Ants Aasma wrote:

On Wed, Nov 14, 2012 at 2:03 AM, Bruce Momjian <bruce@momjian.us> wrote:

At 64k I see pg_upgrade taking 12% of the duration time, if I subtract
out the dump/restore times.

My percentage numbers only included CPU time and I used SSD storage.
For the most part there was no IO wait to speak of, but it's
completely expected that thousands of link calls are not free.

Agreed. I was looking at wall clock time so I could see the total
impact of everything pg_upgrade does.

Postgres time itself breaks down with 10% for shutdown checkpoint and
90% for regular running, consisting of 16% parsing, 13% analyze, 20%
plan, 30% execute, 11% commit (AtEOXact_RelationCache) and 6% network.

That SVG graph was quite impressive.

I used perf and Gprof2Dot for this. I will probably do a blog post on
how to generate these graphs. It's much more useful for me than a
plain flat profile as I don't know by heart which functions are called
by which.

Yes, please share that information.

It looks to me that most benefit could be had from introducing more
parallelism. Are there any large roadblocks to pipelining the dump and
restore to have them happen in parallel?

I talked to Andrew Dustan about parallelization in pg_restore. First,
we currently use pg_dumpall, which isn't in the custom format required
for parallel restore, but if we changed to custom format, create table
isn't done in parallel, only create index/check constraints, and trigger
creation, etc. Not sure if it worth perusing this just for pg_upgrade.

I agree that parallel restore for schemas is a hard problem. But I
didn't mean parallelism within the restore, I meant that we could
start both postmasters and pipe the output from dump directly to
restore. This way the times for dumping and restoring can overlap.

Wow, that is a very creative idea. The current code doesn't do that,
but this has the potential of doubling pg_upgrade's speed, without
adding a lot of complexity. Here are the challenges of this approach:

* I would need to log the output of pg_dumpall as it is passed to psql
so users can debug problems

* pg_upgrade never runs the old and new clusters at the same time for
fear that it will run out of resources, e.g. shared memory, or if they
are using the same port number. We can make this optional and force
different port numbers.

Let me work up a prototype in the next few days and see how it performs.
Thanks for the great idea.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#44

Andrew Dunstan

andrew@dunslane.net

about 13 years ago

In reply to: Bruce Momjian (#43)

Re: Further pg_upgrade analysis for many tables

On 11/14/2012 10:08 AM, Bruce Momjian wrote:

On Wed, Nov 14, 2012 at 06:11:27AM +0200, Ants Aasma wrote:

I agree that parallel restore for schemas is a hard problem. But I
didn't mean parallelism within the restore, I meant that we could
start both postmasters and pipe the output from dump directly to
restore. This way the times for dumping and restoring can overlap.

Wow, that is a very creative idea. The current code doesn't do that,
but this has the potential of doubling pg_upgrade's speed, without
adding a lot of complexity. Here are the challenges of this approach:

* I would need to log the output of pg_dumpall as it is passed to psql
so users can debug problems

Instead of piping it directly, have pg_upgrade work as a tee, pumping
bytes both to psql and a file. This doesn't seem terribly hard.

* pg_upgrade never runs the old and new clusters at the same time for
fear that it will run out of resources, e.g. shared memory, or if they
are using the same port number. We can make this optional and force
different port numbers.

Right.

cheers

andrew

#45

Bruce Momjian

bruce@momjian.us

about 13 years ago

In reply to: Andrew Dunstan (#44)

Re: Further pg_upgrade analysis for many tables

On Wed, Nov 14, 2012 at 10:25:24AM -0500, Andrew Dunstan wrote:

On 11/14/2012 10:08 AM, Bruce Momjian wrote:

On Wed, Nov 14, 2012 at 06:11:27AM +0200, Ants Aasma wrote:

I agree that parallel restore for schemas is a hard problem. But I
didn't mean parallelism within the restore, I meant that we could
start both postmasters and pipe the output from dump directly to
restore. This way the times for dumping and restoring can overlap.

Wow, that is a very creative idea. The current code doesn't do that,
but this has the potential of doubling pg_upgrade's speed, without
adding a lot of complexity. Here are the challenges of this approach:

* I would need to log the output of pg_dumpall as it is passed to psql
so users can debug problems

Instead of piping it directly, have pg_upgrade work as a tee,
pumping bytes both to psql and a file. This doesn't seem terribly
hard.

Right. It isn't hard.

* pg_upgrade never runs the old and new clusters at the same time for
fear that it will run out of resources, e.g. shared memory, or if they
are using the same port number. We can make this optional and force
different port numbers.

Right.

OK.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#46

Jeff Janes

jeff.janes@gmail.com

about 13 years ago

In reply to: Tom Lane (#8)

Re: Further pg_upgrade analysis for many tables

On Thu, Nov 8, 2012 at 9:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jeff Janes <jeff.janes@gmail.com> writes:

Are sure the server you are dumping out of is head?

I experimented a bit with dumping/restoring 16000 tables matching
Bruce's test case (ie, one serial column apiece). The pg_dump profile
seems fairly flat, without any easy optimization targets. But
restoring the dump script shows a rather interesting backend profile:

samples % image name symbol name
30861 39.6289 postgres AtEOXact_RelationCache
9911 12.7268 postgres hash_seq_search

...

The hash_seq_search time is probably mostly associated with
AtEOXact_RelationCache, which is run during transaction commit and scans
the relcache hashtable looking for tables created in the current
transaction. So that's about 50% of the runtime going into that one
activity.

There are at least three ways we could whack that mole:

* Run the psql script in --single-transaction mode, as I was mumbling
about the other day. If we were doing AtEOXact_RelationCache only once,
rather than once per CREATE TABLE statement, it wouldn't be a problem.
Easy but has only a narrow scope of applicability.

* Keep a separate list (or data structure of your choice) so that
relcache entries created in the current xact could be found directly
rather than having to scan the whole relcache. That'd add complexity
though, and could perhaps be a net loss for cases where the relcache
isn't so bloated.

Maybe a static list that can overflow, like the ResourceOwner/Lock
table one recently added. The overhead of that should be very low.

Are the three places where "need_eoxact_work = true;" the only places
where things need to be added to the new structure? It seems like
there is no need to remove things from the list, because the things
done in AtEOXact_RelationCache are idempotent.

* Limit the size of the relcache (eg by aging out
not-recently-referenced entries) so that we aren't incurring O(N^2)
costs for scripts touching N tables. Again, this adds complexity and
could be counterproductive in some scenarios.

I made the crude hack of just dumping the relcache whenever it was

1000 at eox. The time to load 100,000 tables went from 62 minutes

without the patch to 12 minutes with it. (loading with "-1 -f" took
23 minutes).

The next quadratic behavior is in init_sequence.

Cheers,

Jeff

diff --git a/src/backend/utils/cache/relcache.c
b/src/backend/utils/cache/relcache.c
index 8c9ebe0..3941c98 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -2260,6 +2260,8 @@ AtEOXact_RelationCache(bool isCommit)
                )
                return;

+ if (hash_get_num_entries(RelationIdCache)>1000)
{RelationCacheInvalidate();}

#47

Tom Lane

tgl@sss.pgh.pa.us

about 13 years ago

In reply to: Jeff Janes (#46)

Re: Further pg_upgrade analysis for many tables

Jeff Janes <jeff.janes@gmail.com> writes:

On Thu, Nov 8, 2012 at 9:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

There are at least three ways we could whack that mole: ...

* Keep a separate list (or data structure of your choice) so that
relcache entries created in the current xact could be found directly
rather than having to scan the whole relcache. That'd add complexity
though, and could perhaps be a net loss for cases where the relcache
isn't so bloated.

Maybe a static list that can overflow, like the ResourceOwner/Lock
table one recently added. The overhead of that should be very low.

Are the three places where "need_eoxact_work = true;" the only places
where things need to be added to the new structure?

Yeah. The problem is not so much the number of places that do that,
as that places that flush entries from the relcache would need to know
to remove them from the separate list, else you'd have dangling
pointers. It's certainly not impossible, I was just unsure how much
of a pain in the rear it might be.

The next quadratic behavior is in init_sequence.

Yeah, that's another place that is using a linear list that perhaps
should be a hashtable. OTOH, probably most sessions don't touch enough
different sequences for that to be a win.

regards, tom lane

#48

Alvaro Herrera

alvherre@2ndquadrant.com

about 13 years ago

In reply to: Tom Lane (#47)

Re: Further pg_upgrade analysis for many tables

Tom Lane escribió:

Jeff Janes <jeff.janes@gmail.com> writes:

The next quadratic behavior is in init_sequence.

Yeah, that's another place that is using a linear list that perhaps
should be a hashtable. OTOH, probably most sessions don't touch enough
different sequences for that to be a win.

Could we use some adaptive mechanism here? Say we use a list for the
first ten entries, and if an eleventh one comes in, we create a hash
table for that one and all subsequent ones. All future calls would
have to examine both the list for the first few and then the hash table.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#49

Bruce Momjian

bruce@momjian.us

about 13 years ago

In reply to: Bruce Momjian (#40)

Re: Further pg_upgrade analysis for many tables

Patch applied to git head. Thanks Ants Aasma for the analysis that lead
to the patch.

---------------------------------------------------------------------------

On Tue, Nov 13, 2012 at 07:03:51PM -0500, Bruce Momjian wrote:

On Tue, Nov 13, 2012 at 05:44:54AM +0200, Ants Aasma wrote:

On Mon, Nov 12, 2012 at 10:59 PM, Bruce Momjian <bruce@momjian.us> wrote:

You can see a significant speedup with those loops removed. The 16k
case is improved, but still not linear. The 16k dump/restore scale
looks fine, so it must be something in pg_upgrade, or in the kernel.

I can confirm the speedup. Profiling results for 9.3 to 9.3 upgrade
for 8k and 64k tables are attached. pg_upgrade itself is now taking
negligible time.

I generated these timings from the attached test script.

-------------------------- 9.3 ------------------------
---- normal ---- -- binary_upgrade -- -- pg_upgrade -
- dmp - - res - - dmp - - res - git patch
1 0.12 0.07 0.13 0.07 11.06 11.02
1000 2.20 2.46 3.57 2.82 19.15 18.61
2000 4.51 5.01 8.22 5.80 29.12 26.89
4000 8.97 10.88 14.76 12.43 45.87 43.08
8000 15.30 24.72 30.57 27.10 100.31 79.75
16000 36.14 54.88 62.27 61.69 248.03 167.94
32000 55.29 162.20 115.16 179.15 695.05 376.84
64000 149.86 716.46 265.77 724.32 2323.73 1122.38

You can see the speedup of the patch, particularly for a greater number
of tables, e.g. 2x faster for 64k tables.

The 64k profile shows the AtEOXact_RelationCache scaling problem. For
the 8k profile nothing really pops out as a clear bottleneck. CPU time
distributes 83.1% to postgres, 4.9% to pg_dump, 7.4% to psql and 0.7%
to pg_upgrade.

At 64k I see pg_upgrade taking 12% of the duration time, if I subtract
out the dump/restore times.

I am attaching an updated pg_upgrade patch, which I believe is ready for
application for 9.3.

Postgres time itself breaks down with 10% for shutdown checkpoint and
90% for regular running, consisting of 16% parsing, 13% analyze, 20%
plan, 30% execute, 11% commit (AtEOXact_RelationCache) and 6% network.

That SVG graph was quite impressive.

It looks to me that most benefit could be had from introducing more
parallelism. Are there any large roadblocks to pipelining the dump and
restore to have them happen in parallel?

I talked to Andrew Dustan about parallelization in pg_restore. First,
we currently use pg_dumpall, which isn't in the custom format required
for parallel restore, but if we changed to custom format, create table
isn't done in parallel, only create index/check constraints, and trigger
creation, etc. Not sure if it worth perusing this just for pg_upgrade.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#50

Dimitri Fontaine

dimitri@2ndQuadrant.fr

about 13 years ago

In reply to: Alvaro Herrera (#48)

Re: Further pg_upgrade analysis for many tables

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Could we use some adaptive mechanism here? Say we use a list for the
first ten entries, and if an eleventh one comes in, we create a hash
table for that one and all subsequent ones. All future calls would
have to examine both the list for the first few and then the hash table.

Is it necessary to do so? Do we know for sure that a 10 elements hash
table is slower than a 10 elements list when only doing key based
lookups, for the object data type we're interested into here?

--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

#51

Tom Lane

tgl@sss.pgh.pa.us

about 13 years ago

In reply to: Dimitri Fontaine (#50)

Re: Further pg_upgrade analysis for many tables

Dimitri Fontaine <dimitri@2ndQuadrant.fr> writes:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Could we use some adaptive mechanism here? Say we use a list for the
first ten entries, and if an eleventh one comes in, we create a hash
table for that one and all subsequent ones. All future calls would
have to examine both the list for the first few and then the hash table.

Is it necessary to do so? Do we know for sure that a 10 elements hash
table is slower than a 10 elements list when only doing key based
lookups, for the object data type we're interested into here?

Well, we'd want to do some testing to choose the cutover point.
Personally I'd bet on that point being quite a bit higher than ten,
for the case that sequence.c is using where the key being compared is
just an OID. You can compare a lot of OIDs in the time it takes
dynahash.c to do something.

(I think the above sketch is wrong in detail, btw. What we should do
once we decide to create a hash table is move all the existing entries
into the hash table, not continue to scan a list for them. There's a
similar case in the planner for tracking join RelOptInfos.)

regards, tom lane

#52

Jeff Janes

jeff.janes@gmail.com

about 13 years ago

In reply to: Tom Lane (#47)

Re: Further pg_upgrade analysis for many tables

On Wed, Nov 14, 2012 at 11:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jeff Janes <jeff.janes@gmail.com> writes:

On Thu, Nov 8, 2012 at 9:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

There are at least three ways we could whack that mole: ...

* Keep a separate list (or data structure of your choice) so that
relcache entries created in the current xact could be found directly
rather than having to scan the whole relcache. That'd add complexity
though, and could perhaps be a net loss for cases where the relcache
isn't so bloated.

Maybe a static list that can overflow, like the ResourceOwner/Lock
table one recently added. The overhead of that should be very low.

Are the three places where "need_eoxact_work = true;" the only places
where things need to be added to the new structure?

Yeah. The problem is not so much the number of places that do that,
as that places that flush entries from the relcache would need to know
to remove them from the separate list, else you'd have dangling
pointers.

If the list is of hash-tags rather than pointers, all we would have to
do is ignore entries that are not still in the hash table, right?

On a related thought, is a shame that "create temp table on commit
drop" sets "need_eoxact_work", because by the time we get to
AtEOXact_RelationCache upon commit, the entry is already gone and so
there is actual work to do (unless a non-temp table was also
created). But on abort, the entry is still there. I don't know if
there is an opportunity for optimization there for people who use temp
tables a lot. If we go with a caching list, that would render it moot
unless they use so many as to routinely overflow the cache.

Cheers,

Jeff

#53

Bruce Momjian

bruce@momjian.us

about 13 years ago

In reply to: Jeff Janes (#52)

1 attachment(s)

Re: Further pg_upgrade analysis for many tables

On Thu, Nov 15, 2012 at 07:05:00PM -0800, Jeff Janes wrote:

On Wed, Nov 14, 2012 at 11:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jeff Janes <jeff.janes@gmail.com> writes:

On Thu, Nov 8, 2012 at 9:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

There are at least three ways we could whack that mole: ...

* Keep a separate list (or data structure of your choice) so that
relcache entries created in the current xact could be found directly
rather than having to scan the whole relcache. That'd add complexity
though, and could perhaps be a net loss for cases where the relcache
isn't so bloated.

Maybe a static list that can overflow, like the ResourceOwner/Lock
table one recently added. The overhead of that should be very low.

Are the three places where "need_eoxact_work = true;" the only places
where things need to be added to the new structure?

Yeah. The problem is not so much the number of places that do that,
as that places that flush entries from the relcache would need to know
to remove them from the separate list, else you'd have dangling
pointers.

If the list is of hash-tags rather than pointers, all we would have to
do is ignore entries that are not still in the hash table, right?

On a related thought, is a shame that "create temp table on commit
drop" sets "need_eoxact_work", because by the time we get to
AtEOXact_RelationCache upon commit, the entry is already gone and so
there is actual work to do (unless a non-temp table was also
created). But on abort, the entry is still there. I don't know if
there is an opportunity for optimization there for people who use temp
tables a lot. If we go with a caching list, that would render it moot
unless they use so many as to routinely overflow the cache.

I added the attached C comment last year to mention why temp tables are
not as isolated as we think, and can't be optimized as much as you would
think.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

Attachments:

temp.difftext/x-diff; charset=us-asciiDownload

commit f458c90bff45ecae91fb55ef2b938af37d977af3
Author: Bruce Momjian <bruce@momjian.us>
Date:   Mon Sep 5 22:08:14 2011 -0400

    Add C comment about why we send cache invalidation messages for
    session-local objects.

diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
new file mode 100644
index 337fe64..98dc3ad
*** a/src/backend/utils/cache/inval.c
--- b/src/backend/utils/cache/inval.c
*************** ProcessCommittedInvalidationMessages(Sha
*** 812,817 ****
--- 812,821 ----
   * about CurrentCmdInvalidMsgs too, since those changes haven't touched
   * the caches yet.
   *
+  * We still send invalidation messages for session-local objects to other
+  * backends because, while other backends cannot see any tuples, they can
+  * drop tables that are session-local to another session.
+  * 
   * In any case, reset the various lists to empty.  We need not physically
   * free memory here, since TopTransactionContext is about to be emptied
   * anyway.

#54

Jeff Janes

jeff.janes@gmail.com

about 13 years ago

In reply to: Jeff Janes (#52)

1 attachment(s)

Re: Further pg_upgrade analysis for many tables

On Thu, Nov 15, 2012 at 7:05 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Wed, Nov 14, 2012 at 11:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jeff Janes <jeff.janes@gmail.com> writes:

On Thu, Nov 8, 2012 at 9:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

There are at least three ways we could whack that mole: ...

* Keep a separate list (or data structure of your choice) so that
relcache entries created in the current xact could be found directly
rather than having to scan the whole relcache. That'd add complexity
though, and could perhaps be a net loss for cases where the relcache
isn't so bloated.

Maybe a static list that can overflow, like the ResourceOwner/Lock
table one recently added. The overhead of that should be very low.

Are the three places where "need_eoxact_work = true;" the only places
where things need to be added to the new structure?

Yeah. The problem is not so much the number of places that do that,
as that places that flush entries from the relcache would need to know
to remove them from the separate list, else you'd have dangling
pointers.

If the list is of hash-tags rather than pointers, all we would have to
do is ignore entries that are not still in the hash table, right?

I've attached a proof-of-concept patch to implement this.

I got rid of need_eoxact_work entirely and replaced it with a short
list that fulfills the functions of indicating that work is needed,
and suggesting which rels might need that work. There is no attempt
to prevent duplicates, nor to remove invalidated entries from the
list. Invalid entries are skipped when the hash entry is not found,
and processing is idempotent so duplicates are not a problem.

Formally speaking, if MAX_EOXACT_LIST were 0, so that the list
overflowed the first time it was accessed, then it would be identical
to the current behavior or having only a flag. So formally all I did
was increase the max from 0 to 10.

I wasn't so sure about the idempotent nature of Sub transaction
processing, so I chickened out and left that part alone. I know of no
workflow for which that was a bottleneck.

AtEOXact_release is oddly indented because that makes the patch
smaller and easier to read.

This makes the non "-1" restore of large dumps very much faster (and
makes them faster than "-1" restores, as well)

I added a "create temp table foo (x integer) on commit drop;" line to
the default pgbench transaction and tested that. I was hoping to see
a performance improvement there was well (the transaction has ~110
entries in the RelationIdCache at eoxact each time), but the
performance was too variable (probably due to the intense IO it
causes) to detect any changes. At least it is not noticeably slower.
If I hack pgbench to bloat the RelationIdCache by touching 20,000
useless tables as part of the connection start up process, then this
patch does show a win.

It is not obvious what value to set the MAX list size to. Since this
array is only allocated once per back-end, and since it not groveled
through to invalidate relations at each invalidation, there is no
particular reason it must be small. But if the same table is assigned
new filenodes (or forced index lists, whatever those are) repeatedly
within a transaction, the list could become bloated with replicate
entries, potentially becoming even larger than the hash table whose
scan it is intended to short-cut.

In any event, 10 seems to be large enough to overcome the currently
known bottle-neck. Maybe 100 would be a more principled number, as
that is about where the list could start to become as big as the basal
size of the RelationIdCache table.

I don't think this patch replaces having some mechanism for
restricting how large RelationIdCache can get or how LRU entries in it
can get as Robert suggested. But this approach seems like it is
easier to implement and agree upon; and doesn't preclude doing other
optimizations later.

Cheers,

Jeff

Attachments:

relcache_list_v1.patchapplication/octet-stream; name=relcache_list_v1.patchDownload

diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
new file mode 100644
index 8c9ebe0..e49464c
*** a/src/backend/utils/cache/relcache.c
--- b/src/backend/utils/cache/relcache.c
*************** static long relcacheInvalsReceived = 0L;
*** 136,145 ****
   */
  static List *initFileRelationIds = NIL;
  
! /*
!  * This flag lets us optimize away work in AtEO(Sub)Xact_RelationCache().
   */
! static bool need_eoxact_work = false;
  
  
  /*
--- 136,160 ----
   */
  static List *initFileRelationIds = NIL;
  
! /* 
!  * To avoid the need to inspect the entire hash table in 
!  * AtEO(Sub)Xact_RelationCache(), keep a small list of 
!  * relations that may need work.  If it overflows, then 
!  * revert to inspecting the entire hash table.  If 
!  * n_eoxact_list == -1, then no work can be necessary.
!  * If the same entry shows up in the list more than once,
!  * no harm is done as the processing is idempotent.
   */
! #define MAX_EOXACT_LIST 10
! static Oid eoxact_list[MAX_EOXACT_LIST];
! static int n_eoxact_list=-1;
! 
! #define EOXactListAdd(rel) \
! do { \
! 	if (n_eoxact_list < MAX_EOXACT_LIST) \
! 		if (++n_eoxact_list < MAX_EOXACT_LIST) \
! 			eoxact_list[n_eoxact_list]=rel->rd_id;  \
! } while (0)
  
  
  /*
*************** static void RelationClearRelation(Relati
*** 204,209 ****
--- 219,225 ----
  
  static void RelationReloadIndexInfo(Relation relation);
  static void RelationFlushRelation(Relation relation);
+ static void AtEOXact_release(bool isCommit, RelIdCacheEnt *idhentry);
  static bool load_relcache_init_file(bool shared);
  static void write_relcache_init_file(bool shared);
  static void write_item(const void *data, Size len, FILE *fp);
*************** AtEOXact_RelationCache(bool isCommit)
*** 2246,2269 ****
  	 * To speed up transaction exit, we want to avoid scanning the relcache
  	 * unless there is actually something for this routine to do.  Other than
  	 * the debug-only Assert checks, most transactions don't create any work
! 	 * for us to do here, so we keep a static flag that gets set if there is
  	 * anything to do.	(Currently, this means either a relation is created in
  	 * the current xact, or one is given a new relfilenode, or an index list
! 	 * is forced.)	For simplicity, the flag remains set till end of top-level
  	 * transaction, even though we could clear it at subtransaction end in
! 	 * some cases.
  	 */
! 	if (!need_eoxact_work
  #ifdef USE_ASSERT_CHECKING
  		&& !assert_enabled
  #endif
  		)
- 		return;
- 
- 	hash_seq_init(&status, RelationIdCache);
- 
- 	while ((idhentry = (RelIdCacheEnt *) hash_seq_search(&status)) != NULL)
  	{
  		Relation	relation = idhentry->reldesc;
  
  		/*
--- 2262,2303 ----
  	 * To speed up transaction exit, we want to avoid scanning the relcache
  	 * unless there is actually something for this routine to do.  Other than
  	 * the debug-only Assert checks, most transactions don't create any work
! 	 * for us to do here, so we keep a small static list that gets pushed onto if there is
  	 * anything to do.	(Currently, this means either a relation is created in
  	 * the current xact, or one is given a new relfilenode, or an index list
! 	 * is forced.)	For simplicity, the list remains set till end of top-level
  	 * transaction, even though we could clear it at subtransaction end in
! 	 * some cases, or remove relations from it if they are cleared for other reasons.
  	 */
! 
! 	if (n_eoxact_list < MAX_EOXACT_LIST
  #ifdef USE_ASSERT_CHECKING
  		&& !assert_enabled
  #endif
  		)
  	{
+ 		for ( ; n_eoxact_list>=0; n_eoxact_list--) { 
+ 			idhentry = (RelIdCacheEnt*)hash_search(RelationIdCache, 
+ 										 (void *) &(eoxact_list[n_eoxact_list]),
+ 										 HASH_FIND, NULL);
+ 			if (idhentry)
+ 				AtEOXact_release(isCommit, idhentry);
+ 		}
+ 	}
+ 	else 
+ 	{
+ 		hash_seq_init(&status, RelationIdCache);
+ 		while ((idhentry = (RelIdCacheEnt *) hash_seq_search(&status)) != NULL)
+ 		{
+ 			AtEOXact_release(isCommit, idhentry);
+ 		}
+ 	}
+ 	n_eoxact_list = -1;
+ }
+ 
+ void 
+ AtEOXact_release(bool isCommit, RelIdCacheEnt *idhentry)
+ {
  		Relation	relation = idhentry->reldesc;
  
  		/*
*************** AtEOXact_RelationCache(bool isCommit)
*** 2301,2307 ****
  			else
  			{
  				RelationClearRelation(relation, false);
! 				continue;
  			}
  		}
  
--- 2335,2341 ----
  			else
  			{
  				RelationClearRelation(relation, false);
! 				return;
  			}
  		}
  
*************** AtEOXact_RelationCache(bool isCommit)
*** 2320,2329 ****
  			relation->rd_oidindex = InvalidOid;
  			relation->rd_indexvalid = 0;
  		}
- 	}
- 
- 	/* Once done with the transaction, we can reset need_eoxact_work */
- 	need_eoxact_work = false;
  }
  
  /*
--- 2354,2359 ----
*************** AtEOSubXact_RelationCache(bool isCommit,
*** 2344,2350 ****
  	 * Skip the relcache scan if nothing to do --- see notes for
  	 * AtEOXact_RelationCache.
  	 */
! 	if (!need_eoxact_work)
  		return;
  
  	hash_seq_init(&status, RelationIdCache);
--- 2374,2380 ----
  	 * Skip the relcache scan if nothing to do --- see notes for
  	 * AtEOXact_RelationCache.
  	 */
! 	if (n_eoxact_list == -1)
  		return;
  
  	hash_seq_init(&status, RelationIdCache);
*************** RelationBuildLocalRelation(const char *r
*** 2482,2489 ****
  	rel->rd_createSubid = GetCurrentSubTransactionId();
  	rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
  
- 	/* must flag that we have rels created in this transaction */
- 	need_eoxact_work = true;
  
  	/*
  	 * create a new tuple descriptor from the one passed in.  We do this
--- 2512,2517 ----
*************** RelationBuildLocalRelation(const char *r
*** 2568,2575 ****
  	RelationInitPhysicalAddr(rel);
  
  	/*
! 	 * Okay to insert into the relcache hash tables.
  	 */
  	RelationCacheInsert(rel);
  
  	/*
--- 2596,2604 ----
  	RelationInitPhysicalAddr(rel);
  
  	/*
! 	 * Okay to insert into the relcache hash tables.  Must flag that we have rels created in this transaction 
  	 */
+ 	EOXactListAdd(rel);
  	RelationCacheInsert(rel);
  
  	/*
*************** RelationSetNewRelfilenode(Relation relat
*** 2696,2702 ****
  	 */
  	relation->rd_newRelfilenodeSubid = GetCurrentSubTransactionId();
  	/* ... and now we have eoxact cleanup work to do */
! 	need_eoxact_work = true;
  }
  
  
--- 2725,2731 ----
  	 */
  	relation->rd_newRelfilenodeSubid = GetCurrentSubTransactionId();
  	/* ... and now we have eoxact cleanup work to do */
! 	EOXactListAdd(relation);
  }
  
  
*************** RelationSetIndexList(Relation relation,
*** 3469,3475 ****
  	relation->rd_oidindex = oidIndex;
  	relation->rd_indexvalid = 2;	/* mark list as forced */
  	/* must flag that we have a forced index list */
! 	need_eoxact_work = true;
  }
  
  /*
--- 3498,3504 ----
  	relation->rd_oidindex = oidIndex;
  	relation->rd_indexvalid = 2;	/* mark list as forced */
  	/* must flag that we have a forced index list */
! 	EOXactListAdd(relation);
  }
  
  /*

#55

Bruce Momjian

bruce@momjian.us

about 13 years ago

In reply to: Tom Lane (#8)

Re: Further pg_upgrade analysis for many tables

Added to TODO:

Improve cache lookup speed for sessions accessing many relations

http://archives.postgresql.org/pgsql-hackers/2012-11/msg00356.php

---------------------------------------------------------------------------

On Fri, Nov 9, 2012 at 12:50:34AM -0500, Tom Lane wrote:

Jeff Janes <jeff.janes@gmail.com> writes:

Are sure the server you are dumping out of is head?

I experimented a bit with dumping/restoring 16000 tables matching
Bruce's test case (ie, one serial column apiece). The pg_dump profile
seems fairly flat, without any easy optimization targets. But
restoring the dump script shows a rather interesting backend profile:

samples % image name symbol name
30861 39.6289 postgres AtEOXact_RelationCache
9911 12.7268 postgres hash_seq_search
2682 3.4440 postgres init_sequence
2218 2.8482 postgres _bt_compare
2120 2.7223 postgres hash_search_with_hash_value
1976 2.5374 postgres XLogInsert
1429 1.8350 postgres CatalogCacheIdInvalidate
1282 1.6462 postgres LWLockAcquire
973 1.2494 postgres LWLockRelease
702 0.9014 postgres hash_any

The hash_seq_search time is probably mostly associated with
AtEOXact_RelationCache, which is run during transaction commit and scans
the relcache hashtable looking for tables created in the current
transaction. So that's about 50% of the runtime going into that one
activity.

There are at least three ways we could whack that mole:

* Run the psql script in --single-transaction mode, as I was mumbling
about the other day. If we were doing AtEOXact_RelationCache only once,
rather than once per CREATE TABLE statement, it wouldn't be a problem.
Easy but has only a narrow scope of applicability.

* Keep a separate list (or data structure of your choice) so that
relcache entries created in the current xact could be found directly
rather than having to scan the whole relcache. That'd add complexity
though, and could perhaps be a net loss for cases where the relcache
isn't so bloated.

* Limit the size of the relcache (eg by aging out
not-recently-referenced entries) so that we aren't incurring O(N^2)
costs for scripts touching N tables. Again, this adds complexity and
could be counterproductive in some scenarios.

regards, tom lane

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#56

Bruce Momjian

bruce@momjian.us

about 13 years ago

In reply to: Alvaro Herrera (#32)

Re: Further pg_upgrade analysis for many tables

On Mon, Nov 12, 2012 at 06:14:59PM -0300, Alvaro Herrera wrote:

Bruce Momjian escribiï¿½:
--- 17,24 ----
static void transfer_single_new_db(pageCnvCtx *pageConverter,
FileNameMap *maps, int size);
! static int transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
! const char *suffix);
Uh, does this code assume that forks other than the main one are not
split in segments? I think that's a bug, is it not?

Actually, the segment scanning now happens inside transfer_relfile().

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#57

Robert Haas

robertmhaas@gmail.com

about 13 years ago

In reply to: Jeff Janes (#54)

Re: Further pg_upgrade analysis for many tables

On Fri, Nov 23, 2012 at 5:34 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Thu, Nov 15, 2012 at 7:05 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Wed, Nov 14, 2012 at 11:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jeff Janes <jeff.janes@gmail.com> writes:

On Thu, Nov 8, 2012 at 9:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

There are at least three ways we could whack that mole: ...

* Keep a separate list (or data structure of your choice) so that
relcache entries created in the current xact could be found directly
rather than having to scan the whole relcache. That'd add complexity
though, and could perhaps be a net loss for cases where the relcache
isn't so bloated.

Maybe a static list that can overflow, like the ResourceOwner/Lock
table one recently added. The overhead of that should be very low.

Are the three places where "need_eoxact_work = true;" the only places
where things need to be added to the new structure?

Yeah. The problem is not so much the number of places that do that,
as that places that flush entries from the relcache would need to know
to remove them from the separate list, else you'd have dangling
pointers.

If the list is of hash-tags rather than pointers, all we would have to
do is ignore entries that are not still in the hash table, right?

I've attached a proof-of-concept patch to implement this.

I got rid of need_eoxact_work entirely and replaced it with a short
list that fulfills the functions of indicating that work is needed,
and suggesting which rels might need that work. There is no attempt
to prevent duplicates, nor to remove invalidated entries from the
list. Invalid entries are skipped when the hash entry is not found,
and processing is idempotent so duplicates are not a problem.

Formally speaking, if MAX_EOXACT_LIST were 0, so that the list
overflowed the first time it was accessed, then it would be identical
to the current behavior or having only a flag. So formally all I did
was increase the max from 0 to 10.

I wasn't so sure about the idempotent nature of Sub transaction
processing, so I chickened out and left that part alone. I know of no
workflow for which that was a bottleneck.

AtEOXact_release is oddly indented because that makes the patch
smaller and easier to read.

This makes the non "-1" restore of large dumps very much faster (and
makes them faster than "-1" restores, as well)

I added a "create temp table foo (x integer) on commit drop;" line to
the default pgbench transaction and tested that. I was hoping to see
a performance improvement there was well (the transaction has ~110
entries in the RelationIdCache at eoxact each time), but the
performance was too variable (probably due to the intense IO it
causes) to detect any changes. At least it is not noticeably slower.
If I hack pgbench to bloat the RelationIdCache by touching 20,000
useless tables as part of the connection start up process, then this
patch does show a win.

It is not obvious what value to set the MAX list size to. Since this
array is only allocated once per back-end, and since it not groveled
through to invalidate relations at each invalidation, there is no
particular reason it must be small. But if the same table is assigned
new filenodes (or forced index lists, whatever those are) repeatedly
within a transaction, the list could become bloated with replicate
entries, potentially becoming even larger than the hash table whose
scan it is intended to short-cut.

In any event, 10 seems to be large enough to overcome the currently
known bottle-neck. Maybe 100 would be a more principled number, as
that is about where the list could start to become as big as the basal
size of the RelationIdCache table.

I don't think this patch replaces having some mechanism for
restricting how large RelationIdCache can get or how LRU entries in it
can get as Robert suggested. But this approach seems like it is
easier to implement and agree upon; and doesn't preclude doing other
optimizations later.

I haven't reviewed this terribly closely, but I think this is likely
worth pursuing. I see you already added it to the next CommitFest,
which is good.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#58

Bruce Momjian

bruce@momjian.us

about 13 years ago

In reply to: Bruce Momjian (#43)

2 attachment(s)

Re: Further pg_upgrade analysis for many tables

On Wed, Nov 14, 2012 at 10:08:15AM -0500, Bruce Momjian wrote:

I agree that parallel restore for schemas is a hard problem. But I
didn't mean parallelism within the restore, I meant that we could
start both postmasters and pipe the output from dump directly to
restore. This way the times for dumping and restoring can overlap.

Wow, that is a very creative idea. The current code doesn't do that,
but this has the potential of doubling pg_upgrade's speed, without
adding a lot of complexity. Here are the challenges of this approach:

* I would need to log the output of pg_dumpall as it is passed to psql
so users can debug problems

* pg_upgrade never runs the old and new clusters at the same time for
fear that it will run out of resources, e.g. shared memory, or if they
are using the same port number. We can make this optional and force
different port numbers.

Let me work up a prototype in the next few days and see how it performs.
Thanks for the great idea.

I have developed the attached proof-of-concept patch to test this idea.
Unfortunately, I got poor results:

---- pg_upgrade ----
dump restore dmp|res git dmp/res
1 0.12 0.07 0.13 11.16 13.03
1000 3.80 2.83 5.46 18.78 20.27
2000 5.39 5.65 13.99 26.78 28.54
4000 16.08 12.40 28.34 41.90 44.03
8000 32.77 25.70 57.97 78.61 80.09
16000 57.67 63.42 134.43 158.49 165.78
32000 131.84 176.27 302.85 380.11 389.48
64000 270.37 708.30 1004.39 1085.39 1094.70

The last two columns show the patch didn't help at all, and the third
column shows it is just executing the pg_dump, then the restore, not in
parallel, i.e. column 1 + column 2 ~= column 3.

Testing pg_dump for 4k tables (16 seconds) shows the first row is not
output by pg_dump until 15 seconds, meaning there can't be any
parallelism with a pipe. (Test script attached.) Does anyone know how
to get pg_dump to send some output earlier? In summary, it doesn't seem
pg_dump makes any attempt to output its data early. pg_dump.c has some
details:

/*
* And finally we can do the actual output.
*
* Note: for non-plain-text output formats, the output file is written
* inside CloseArchive(). This is, um, bizarre; but not worth changing
* right now.
*/
if (plainText)
RestoreArchive(fout);

CloseArchive(fout);

FYI, log_min_duration_statement shows queries taking 11.2 seconds, even
without the network overhead --- not sure how that can be optimized.

I will now test using PRIMARY KEY and custom dump format with pg_restore
--jobs to see if I can get parallelism that way.

A further parallelism would be to allow multiple database to be
dump/restored at the same time. I will test for that once this is done.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

Attachments:

pipe.difftext/x-diff; charset=us-asciiDownload

diff --git a/contrib/pg_upgrade/dump.c b/contrib/pg_upgrade/dump.c
new file mode 100644
index 577ccac..dbdf9a5
*** a/contrib/pg_upgrade/dump.c
--- b/contrib/pg_upgrade/dump.c
*************** generate_old_dump(void)
*** 24,30 ****
  	 * restores the frozenid's for databases and relations.
  	 */
  	exec_prog(UTILITY_LOG_FILE, NULL, true,
! 			  "\"%s/pg_dumpall\" %s --schema-only --binary-upgrade %s -f %s",
  			  new_cluster.bindir, cluster_conn_opts(&old_cluster),
  			  log_opts.verbose ? "--verbose" : "",
  			  ALL_DUMP_FILE);
--- 24,30 ----
  	 * restores the frozenid's for databases and relations.
  	 */
  	exec_prog(UTILITY_LOG_FILE, NULL, true,
! 			  "\"%s/pg_dumpall\" %s --schema-only --globals-only --binary-upgrade %s -f %s",
  			  new_cluster.bindir, cluster_conn_opts(&old_cluster),
  			  log_opts.verbose ? "--verbose" : "",
  			  ALL_DUMP_FILE);
*************** generate_old_dump(void)
*** 47,63 ****
  void
  split_old_dump(void)
  {
! 	FILE	   *all_dump,
! 			   *globals_dump,
! 			   *db_dump;
! 	FILE	   *current_output;
  	char		line[LINE_ALLOC];
  	bool		start_of_line = true;
  	char		create_role_str[MAX_STRING];
  	char		create_role_str_quote[MAX_STRING];
  	char		filename[MAXPGPATH];
- 	bool		suppressed_username = false;
- 
  
  	/* 
  	 * Open all files in binary mode to avoid line end translation on Windows,
--- 47,58 ----
  void
  split_old_dump(void)
  {
! 	FILE	   *all_dump, *globals_dump;
  	char		line[LINE_ALLOC];
  	bool		start_of_line = true;
  	char		create_role_str[MAX_STRING];
  	char		create_role_str_quote[MAX_STRING];
  	char		filename[MAXPGPATH];
  
  	/* 
  	 * Open all files in binary mode to avoid line end translation on Windows,
*************** split_old_dump(void)
*** 70,80 ****
  	snprintf(filename, sizeof(filename), "%s", GLOBALS_DUMP_FILE);
  	if ((globals_dump = fopen_priv(filename, PG_BINARY_W)) == NULL)
  		pg_log(PG_FATAL, "Could not write to dump file \"%s\": %s\n", filename, getErrorText(errno));
- 	snprintf(filename, sizeof(filename), "%s", DB_DUMP_FILE);
- 	if ((db_dump = fopen_priv(filename, PG_BINARY_W)) == NULL)
- 		pg_log(PG_FATAL, "Could not write to dump file \"%s\": %s\n", filename, getErrorText(errno));
- 
- 	current_output = globals_dump;
  
  	/* patterns used to prevent our own username from being recreated */
  	snprintf(create_role_str, sizeof(create_role_str),
--- 65,70 ----
*************** split_old_dump(void)
*** 84,102 ****
  
  	while (fgets(line, sizeof(line), all_dump) != NULL)
  	{
- 		/* switch to db_dump file output? */
- 		if (current_output == globals_dump && start_of_line &&
- 			suppressed_username &&
- 			strncmp(line, "\\connect ", strlen("\\connect ")) == 0)
- 			current_output = db_dump;
- 
  		/* output unless we are recreating our own username */
! 		if (current_output != globals_dump || !start_of_line ||
  			(strncmp(line, create_role_str, strlen(create_role_str)) != 0 &&
  			 strncmp(line, create_role_str_quote, strlen(create_role_str_quote)) != 0))
! 			fputs(line, current_output);
! 		else
! 			suppressed_username = true;
  
  		if (strlen(line) > 0 && line[strlen(line) - 1] == '\n')
  			start_of_line = true;
--- 74,84 ----
  
  	while (fgets(line, sizeof(line), all_dump) != NULL)
  	{
  		/* output unless we are recreating our own username */
! 		if (!start_of_line ||
  			(strncmp(line, create_role_str, strlen(create_role_str)) != 0 &&
  			 strncmp(line, create_role_str_quote, strlen(create_role_str_quote)) != 0))
! 			fputs(line, globals_dump);
  
  		if (strlen(line) > 0 && line[strlen(line) - 1] == '\n')
  			start_of_line = true;
*************** split_old_dump(void)
*** 106,110 ****
  
  	fclose(all_dump);
  	fclose(globals_dump);
- 	fclose(db_dump);
  }
--- 88,91 ----
diff --git a/contrib/pg_upgrade/function.c b/contrib/pg_upgrade/function.c
new file mode 100644
index 77bd3a0..d95dd6f
*** a/contrib/pg_upgrade/function.c
--- b/contrib/pg_upgrade/function.c
*************** uninstall_support_functions_from_new_clu
*** 102,111 ****
  
  	prep_status("Removing support functions from new cluster");
  
! 	for (dbnum = 0; dbnum < new_cluster.dbarr.ndbs; dbnum++)
  	{
! 		DbInfo	   *new_db = &new_cluster.dbarr.dbs[dbnum];
! 		PGconn	   *conn = connectToServer(&new_cluster, new_db->db_name);
  
  		/* suppress NOTICE of dropped objects */
  		PQclear(executeQueryOrDie(conn,
--- 102,112 ----
  
  	prep_status("Removing support functions from new cluster");
  
! 	/* use old db names because there might be a mismatch */
! 	for (dbnum = 0; dbnum < old_cluster.dbarr.ndbs; dbnum++)
  	{
! 		DbInfo	   *old_db = &old_cluster.dbarr.dbs[dbnum];
! 		PGconn	   *conn = connectToServer(&new_cluster, old_db->db_name);
  
  		/* suppress NOTICE of dropped objects */
  		PQclear(executeQueryOrDie(conn,
diff --git a/contrib/pg_upgrade/pg_upgrade.c b/contrib/pg_upgrade/pg_upgrade.c
new file mode 100644
index 4d2e79c..e218138
*** a/contrib/pg_upgrade/pg_upgrade.c
--- b/contrib/pg_upgrade/pg_upgrade.c
*************** main(int argc, char **argv)
*** 117,129 ****
  	/* New now using xids of the old system */
  
  	/* -- NEW -- */
  	start_postmaster(&new_cluster);
  
  	prepare_new_databases();
! 
  	create_new_objects();
  
  	stop_postmaster(false);
  
  	/*
  	 * Most failures happen in create_new_objects(), which has completed at
--- 117,134 ----
  	/* New now using xids of the old system */
  
  	/* -- NEW -- */
+ 	old_cluster.port++;
+ 	start_postmaster(&old_cluster);
  	start_postmaster(&new_cluster);
  
  	prepare_new_databases();
! 	
  	create_new_objects();
  
  	stop_postmaster(false);
+ 	os_info.running_cluster = &old_cluster;
+ 	stop_postmaster(false);
+ 	old_cluster.port--;
  
  	/*
  	 * Most failures happen in create_new_objects(), which has completed at
*************** static void
*** 279,308 ****
  create_new_objects(void)
  {
  	int			dbnum;
  
  	prep_status("Adding support functions to new cluster");
  
! 	for (dbnum = 0; dbnum < new_cluster.dbarr.ndbs; dbnum++)
  	{
! 		DbInfo	   *new_db = &new_cluster.dbarr.dbs[dbnum];
  
  		/* skip db we already installed */
! 		if (strcmp(new_db->db_name, "template1") != 0)
! 			install_support_functions_in_new_db(new_db->db_name);
  	}
  	check_ok();
  
  	prep_status("Restoring database schema to new cluster");
! 	exec_prog(RESTORE_LOG_FILE, NULL, true,
! 			  "\"%s/psql\" " EXEC_PSQL_ARGS " %s -f \"%s\"",
! 			  new_cluster.bindir, cluster_conn_opts(&new_cluster),
! 			  DB_DUMP_FILE);
  	check_ok();
  
  	/* regenerate now that we have objects in the databases */
  	get_db_and_rel_infos(&new_cluster);
  
  	uninstall_support_functions_from_new_cluster();
  }
  
  /*
--- 284,330 ----
  create_new_objects(void)
  {
  	int			dbnum;
+ 	/* save off conn_opts because it is a static local var */
+ 	char	   *old_conn_opts = pg_strdup(cluster_conn_opts(&old_cluster));
  
  	prep_status("Adding support functions to new cluster");
  
! 	/*
! 	 *	The new cluster might have databases that don't exist in the old
! 	 *	one, so cycle over the old database names.
! 	 */
! 	for (dbnum = 0; dbnum < old_cluster.dbarr.ndbs; dbnum++)
  	{
! 		DbInfo	   *old_db = &old_cluster.dbarr.dbs[dbnum];
  
  		/* skip db we already installed */
! 		if (strcmp(old_db->db_name, "template1") != 0)
! 			install_support_functions_in_new_db(old_db->db_name);
  	}
  	check_ok();
  
  	prep_status("Restoring database schema to new cluster");
! 
! 	for (dbnum = 0; dbnum < old_cluster.dbarr.ndbs; dbnum++)
! 	{
! 		DbInfo	   *old_db = &old_cluster.dbarr.dbs[dbnum];
! 		
! 		exec_prog(RESTORE_LOG_FILE, NULL, true,
! 				  "\"%s/pg_dump\" %s --schema-only --binary-upgrade %s \"%s\" | "
! 				  "\"%s/psql\" " EXEC_PSQL_ARGS " %s -d \"%s\"",
! 				  new_cluster.bindir, old_conn_opts,
! 				  log_opts.verbose ? "--verbose" : "", old_db->db_name,
! 				  new_cluster.bindir, cluster_conn_opts(&new_cluster),
! 				  old_db->db_name);
! 	}
  	check_ok();
  
  	/* regenerate now that we have objects in the databases */
  	get_db_and_rel_infos(&new_cluster);
  
  	uninstall_support_functions_from_new_cluster();
+ 
+ 	pg_free(old_conn_opts);
  }
  
  /*
*************** cleanup(void)
*** 463,468 ****
  		/* remove SQL files */
  		unlink(ALL_DUMP_FILE);
  		unlink(GLOBALS_DUMP_FILE);
- 		unlink(DB_DUMP_FILE);
  	}
  }
--- 485,489 ----
diff --git a/contrib/pg_upgrade/pg_upgrade.h b/contrib/pg_upgrade/pg_upgrade.h
new file mode 100644
index ace56e5..72ed3bd
*** a/contrib/pg_upgrade/pg_upgrade.h
--- b/contrib/pg_upgrade/pg_upgrade.h
***************
*** 32,38 ****
  #define ALL_DUMP_FILE		"pg_upgrade_dump_all.sql"
  /* contains both global db information and CREATE DATABASE commands */
  #define GLOBALS_DUMP_FILE	"pg_upgrade_dump_globals.sql"
- #define DB_DUMP_FILE		"pg_upgrade_dump_db.sql"
  
  #define SERVER_LOG_FILE		"pg_upgrade_server.log"
  #define RESTORE_LOG_FILE	"pg_upgrade_restore.log"
--- 32,37 ----
diff --git a/src/bin/pg_dump/pg_dumpall.c b/src/bin/pg_dump/pg_dumpall.c
new file mode 100644
index ca95bad..ce27fe3
*** a/src/bin/pg_dump/pg_dumpall.c
--- b/src/bin/pg_dump/pg_dumpall.c
*************** main(int argc, char *argv[])
*** 502,508 ****
  		}
  
  		/* Dump CREATE DATABASE commands */
! 		if (!globals_only && !roles_only && !tablespaces_only)
  			dumpCreateDB(conn);
  
  		/* Dump role/database settings */
--- 502,508 ----
  		}
  
  		/* Dump CREATE DATABASE commands */
! 		if (binary_upgrade || (!globals_only && !roles_only && !tablespaces_only))
  			dumpCreateDB(conn);
  
  		/* Dump role/database settings */

test_many_tablestext/plain; charset=us-asciiDownload

#59

Tom Lane

tgl@sss.pgh.pa.us

about 13 years ago

In reply to: Bruce Momjian (#58)

Re: Further pg_upgrade analysis for many tables

Bruce Momjian <bruce@momjian.us> writes:

Testing pg_dump for 4k tables (16 seconds) shows the first row is not
output by pg_dump until 15 seconds, meaning there can't be any
parallelism with a pipe. (Test script attached.) Does anyone know how
to get pg_dump to send some output earlier?

You can't. By the time it knows what order to emit the objects in,
it's done all the preliminary work you're griping about.

(In a dump with data, there would be a meaningful amount of computation
remaining, but not in a schema-only dump.)

I will now test using PRIMARY KEY and custom dump format with pg_restore
--jobs to see if I can get parallelism that way.

This seems likely to be a waste of effort for the same reason: you only
get meaningful parallelism when there's a substantial data component to
be restored.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#60

Bruce Momjian

bruce@momjian.us

about 13 years ago

In reply to: Bruce Momjian (#58)

2 attachment(s)

Re: Further pg_upgrade analysis for many tables

On Mon, Nov 26, 2012 at 05:26:42PM -0500, Bruce Momjian wrote:

I have developed the attached proof-of-concept patch to test this idea.
Unfortunately, I got poor results:

---- pg_upgrade ----
dump restore dmp|res git dmp/res
1 0.12 0.07 0.13 11.16 13.03
1000 3.80 2.83 5.46 18.78 20.27
2000 5.39 5.65 13.99 26.78 28.54
4000 16.08 12.40 28.34 41.90 44.03
8000 32.77 25.70 57.97 78.61 80.09
16000 57.67 63.42 134.43 158.49 165.78
32000 131.84 176.27 302.85 380.11 389.48
64000 270.37 708.30 1004.39 1085.39 1094.70

The last two columns show the patch didn't help at all, and the third
column shows it is just executing the pg_dump, then the restore, not in
parallel, i.e. column 1 + column 2 ~= column 3.

...

I will now test using PRIMARY KEY and custom dump format with pg_restore
--jobs to see if I can get parallelism that way.

I have some new interesting results (in seconds, test script attached):

---- -Fc ---- ------- dump | pg_restore/psql ------ - pg_upgrade -
dump restore -Fc -Fc|-1 -Fc|-j -Fp -Fp|-1 git patch
1 0.14 0.08 0.14 0.16 0.19 0.13 0.15 11.04 13.07
1000 3.08 3.65 6.53 6.60 5.39 6.37 6.54 21.05 22.18
2000 6.06 6.52 12.15 11.78 10.52 12.89 12.11 31.93 31.65
4000 11.07 14.68 25.12 24.47 22.07 26.77 26.77 56.03 47.03
8000 20.85 32.03 53.68 45.23 45.10 59.20 51.33 104.99 85.19
16000 40.28 88.36 127.63 96.65 106.33 136.68 106.64 221.82 157.36
32000 93.78 274.99 368.54 211.30 294.76 376.36 229.80 544.73 321.19
64000 197.79 1109.22 1336.83 577.83 1117.55 1327.98 567.84 1766.12 763.02

I tested custom format with pg_restore -j and -1, as well as text
restore. The winner was pg_dump -Fc | pg_restore -1; even -j could not
beat it. (FYI, Andrew Dunstan told me that indexes can be restored in
parallel with -j.) That is actually helpful because we can use process
parallelism to restore multiple databases at the same time without
having to use processes for -j parallelism.

Attached is my pg_upgrade patch for this. I am going to polish it up
for 9.3 application.

A further parallelism would be to allow multiple database to be
dump/restored at the same time. I will test for that once this is done.

I will work on this next.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

Attachments:

test_many_tablestext/plain; charset=us-asciiDownload

pg_upgrade.difftext/x-diff; charset=us-asciiDownload

diff --git a/contrib/pg_upgrade/dump.c b/contrib/pg_upgrade/dump.c
new file mode 100644
index 577ccac..dbdf9a5
*** a/contrib/pg_upgrade/dump.c
--- b/contrib/pg_upgrade/dump.c
*************** generate_old_dump(void)
*** 24,30 ****
  	 * restores the frozenid's for databases and relations.
  	 */
  	exec_prog(UTILITY_LOG_FILE, NULL, true,
! 			  "\"%s/pg_dumpall\" %s --schema-only --binary-upgrade %s -f %s",
  			  new_cluster.bindir, cluster_conn_opts(&old_cluster),
  			  log_opts.verbose ? "--verbose" : "",
  			  ALL_DUMP_FILE);
--- 24,30 ----
  	 * restores the frozenid's for databases and relations.
  	 */
  	exec_prog(UTILITY_LOG_FILE, NULL, true,
! 			  "\"%s/pg_dumpall\" %s --schema-only --globals-only --binary-upgrade %s -f %s",
  			  new_cluster.bindir, cluster_conn_opts(&old_cluster),
  			  log_opts.verbose ? "--verbose" : "",
  			  ALL_DUMP_FILE);
*************** generate_old_dump(void)
*** 47,63 ****
  void
  split_old_dump(void)
  {
! 	FILE	   *all_dump,
! 			   *globals_dump,
! 			   *db_dump;
! 	FILE	   *current_output;
  	char		line[LINE_ALLOC];
  	bool		start_of_line = true;
  	char		create_role_str[MAX_STRING];
  	char		create_role_str_quote[MAX_STRING];
  	char		filename[MAXPGPATH];
- 	bool		suppressed_username = false;
- 
  
  	/* 
  	 * Open all files in binary mode to avoid line end translation on Windows,
--- 47,58 ----
  void
  split_old_dump(void)
  {
! 	FILE	   *all_dump, *globals_dump;
  	char		line[LINE_ALLOC];
  	bool		start_of_line = true;
  	char		create_role_str[MAX_STRING];
  	char		create_role_str_quote[MAX_STRING];
  	char		filename[MAXPGPATH];
  
  	/* 
  	 * Open all files in binary mode to avoid line end translation on Windows,
*************** split_old_dump(void)
*** 70,80 ****
  	snprintf(filename, sizeof(filename), "%s", GLOBALS_DUMP_FILE);
  	if ((globals_dump = fopen_priv(filename, PG_BINARY_W)) == NULL)
  		pg_log(PG_FATAL, "Could not write to dump file \"%s\": %s\n", filename, getErrorText(errno));
- 	snprintf(filename, sizeof(filename), "%s", DB_DUMP_FILE);
- 	if ((db_dump = fopen_priv(filename, PG_BINARY_W)) == NULL)
- 		pg_log(PG_FATAL, "Could not write to dump file \"%s\": %s\n", filename, getErrorText(errno));
- 
- 	current_output = globals_dump;
  
  	/* patterns used to prevent our own username from being recreated */
  	snprintf(create_role_str, sizeof(create_role_str),
--- 65,70 ----
*************** split_old_dump(void)
*** 84,102 ****
  
  	while (fgets(line, sizeof(line), all_dump) != NULL)
  	{
- 		/* switch to db_dump file output? */
- 		if (current_output == globals_dump && start_of_line &&
- 			suppressed_username &&
- 			strncmp(line, "\\connect ", strlen("\\connect ")) == 0)
- 			current_output = db_dump;
- 
  		/* output unless we are recreating our own username */
! 		if (current_output != globals_dump || !start_of_line ||
  			(strncmp(line, create_role_str, strlen(create_role_str)) != 0 &&
  			 strncmp(line, create_role_str_quote, strlen(create_role_str_quote)) != 0))
! 			fputs(line, current_output);
! 		else
! 			suppressed_username = true;
  
  		if (strlen(line) > 0 && line[strlen(line) - 1] == '\n')
  			start_of_line = true;
--- 74,84 ----
  
  	while (fgets(line, sizeof(line), all_dump) != NULL)
  	{
  		/* output unless we are recreating our own username */
! 		if (!start_of_line ||
  			(strncmp(line, create_role_str, strlen(create_role_str)) != 0 &&
  			 strncmp(line, create_role_str_quote, strlen(create_role_str_quote)) != 0))
! 			fputs(line, globals_dump);
  
  		if (strlen(line) > 0 && line[strlen(line) - 1] == '\n')
  			start_of_line = true;
*************** split_old_dump(void)
*** 106,110 ****
  
  	fclose(all_dump);
  	fclose(globals_dump);
- 	fclose(db_dump);
  }
--- 88,91 ----
diff --git a/contrib/pg_upgrade/function.c b/contrib/pg_upgrade/function.c
new file mode 100644
index 77bd3a0..d95dd6f
*** a/contrib/pg_upgrade/function.c
--- b/contrib/pg_upgrade/function.c
*************** uninstall_support_functions_from_new_clu
*** 102,111 ****
  
  	prep_status("Removing support functions from new cluster");
  
! 	for (dbnum = 0; dbnum < new_cluster.dbarr.ndbs; dbnum++)
  	{
! 		DbInfo	   *new_db = &new_cluster.dbarr.dbs[dbnum];
! 		PGconn	   *conn = connectToServer(&new_cluster, new_db->db_name);
  
  		/* suppress NOTICE of dropped objects */
  		PQclear(executeQueryOrDie(conn,
--- 102,112 ----
  
  	prep_status("Removing support functions from new cluster");
  
! 	/* use old db names because there might be a mismatch */
! 	for (dbnum = 0; dbnum < old_cluster.dbarr.ndbs; dbnum++)
  	{
! 		DbInfo	   *old_db = &old_cluster.dbarr.dbs[dbnum];
! 		PGconn	   *conn = connectToServer(&new_cluster, old_db->db_name);
  
  		/* suppress NOTICE of dropped objects */
  		PQclear(executeQueryOrDie(conn,
diff --git a/contrib/pg_upgrade/pg_upgrade.c b/contrib/pg_upgrade/pg_upgrade.c
new file mode 100644
index 4d2e79c..1e72108
*** a/contrib/pg_upgrade/pg_upgrade.c
--- b/contrib/pg_upgrade/pg_upgrade.c
*************** main(int argc, char **argv)
*** 117,129 ****
  	/* New now using xids of the old system */
  
  	/* -- NEW -- */
  	start_postmaster(&new_cluster);
  
  	prepare_new_databases();
! 
  	create_new_objects();
  
  	stop_postmaster(false);
  
  	/*
  	 * Most failures happen in create_new_objects(), which has completed at
--- 117,134 ----
  	/* New now using xids of the old system */
  
  	/* -- NEW -- */
+ 	old_cluster.port++;
+ 	start_postmaster(&old_cluster);
  	start_postmaster(&new_cluster);
  
  	prepare_new_databases();
! 	
  	create_new_objects();
  
  	stop_postmaster(false);
+ 	os_info.running_cluster = &old_cluster;
+ 	stop_postmaster(false);
+ 	old_cluster.port--;
  
  	/*
  	 * Most failures happen in create_new_objects(), which has completed at
*************** static void
*** 279,308 ****
  create_new_objects(void)
  {
  	int			dbnum;
  
  	prep_status("Adding support functions to new cluster");
  
! 	for (dbnum = 0; dbnum < new_cluster.dbarr.ndbs; dbnum++)
  	{
! 		DbInfo	   *new_db = &new_cluster.dbarr.dbs[dbnum];
  
  		/* skip db we already installed */
! 		if (strcmp(new_db->db_name, "template1") != 0)
! 			install_support_functions_in_new_db(new_db->db_name);
  	}
  	check_ok();
  
  	prep_status("Restoring database schema to new cluster");
! 	exec_prog(RESTORE_LOG_FILE, NULL, true,
! 			  "\"%s/psql\" " EXEC_PSQL_ARGS " %s -f \"%s\"",
! 			  new_cluster.bindir, cluster_conn_opts(&new_cluster),
! 			  DB_DUMP_FILE);
  	check_ok();
  
  	/* regenerate now that we have objects in the databases */
  	get_db_and_rel_infos(&new_cluster);
  
  	uninstall_support_functions_from_new_cluster();
  }
  
  /*
--- 284,335 ----
  create_new_objects(void)
  {
  	int			dbnum;
+ 	/* save off conn_opts because it is a static local var */
+ 	char	   *old_conn_opts = pg_strdup(cluster_conn_opts(&old_cluster));
  
  	prep_status("Adding support functions to new cluster");
  
! 	/*
! 	 *	The new cluster might have databases that don't exist in the old
! 	 *	one, so cycle over the old database names.
! 	 */
! 	for (dbnum = 0; dbnum < old_cluster.dbarr.ndbs; dbnum++)
  	{
! 		DbInfo	   *old_db = &old_cluster.dbarr.dbs[dbnum];
  
  		/* skip db we already installed */
! 		if (strcmp(old_db->db_name, "template1") != 0)
! 			install_support_functions_in_new_db(old_db->db_name);
  	}
  	check_ok();
  
  	prep_status("Restoring database schema to new cluster");
! 
! 	for (dbnum = 0; dbnum < old_cluster.dbarr.ndbs; dbnum++)
! 	{
! 		DbInfo	   *old_db = &old_cluster.dbarr.dbs[dbnum];
! 		
! 		/*
! 		 *	Using pg_restore --single-transaction is faster than other
! 		 *	methods, like --jobs.  pg_dump only produces its output at the
! 		 *	end, so there is little parallelism using the pipe.
! 		 */
! 		exec_prog(RESTORE_LOG_FILE, NULL, true,
! 				  "\"%s/pg_dump\" %s --schema-only --binary-upgrade --format=custom %s \"%s\" |"
! 				  "\"%s/pg_restore\" %s --exit-on-error --single-transaction %s --dbname \"%s\"",
! 				  new_cluster.bindir, old_conn_opts,
! 				  log_opts.verbose ? "--verbose" : "", old_db->db_name,
! 				  new_cluster.bindir, cluster_conn_opts(&new_cluster),
! 				  log_opts.verbose ? "--verbose" : "", old_db->db_name);
! 	}
  	check_ok();
  
  	/* regenerate now that we have objects in the databases */
  	get_db_and_rel_infos(&new_cluster);
  
  	uninstall_support_functions_from_new_cluster();
+ 
+ 	pg_free(old_conn_opts);
  }
  
  /*
*************** cleanup(void)
*** 463,468 ****
  		/* remove SQL files */
  		unlink(ALL_DUMP_FILE);
  		unlink(GLOBALS_DUMP_FILE);
- 		unlink(DB_DUMP_FILE);
  	}
  }
--- 490,494 ----
diff --git a/contrib/pg_upgrade/pg_upgrade.h b/contrib/pg_upgrade/pg_upgrade.h
new file mode 100644
index ace56e5..72ed3bd
*** a/contrib/pg_upgrade/pg_upgrade.h
--- b/contrib/pg_upgrade/pg_upgrade.h
***************
*** 32,38 ****
  #define ALL_DUMP_FILE		"pg_upgrade_dump_all.sql"
  /* contains both global db information and CREATE DATABASE commands */
  #define GLOBALS_DUMP_FILE	"pg_upgrade_dump_globals.sql"
- #define DB_DUMP_FILE		"pg_upgrade_dump_db.sql"
  
  #define SERVER_LOG_FILE		"pg_upgrade_server.log"
  #define RESTORE_LOG_FILE	"pg_upgrade_restore.log"
--- 32,37 ----
diff --git a/src/bin/pg_dump/pg_dumpall.c b/src/bin/pg_dump/pg_dumpall.c
new file mode 100644
index ca95bad..ce27fe3
*** a/src/bin/pg_dump/pg_dumpall.c
--- b/src/bin/pg_dump/pg_dumpall.c
*************** main(int argc, char *argv[])
*** 502,508 ****
  		}
  
  		/* Dump CREATE DATABASE commands */
! 		if (!globals_only && !roles_only && !tablespaces_only)
  			dumpCreateDB(conn);
  
  		/* Dump role/database settings */
--- 502,508 ----
  		}
  
  		/* Dump CREATE DATABASE commands */
! 		if (binary_upgrade || (!globals_only && !roles_only && !tablespaces_only))
  			dumpCreateDB(conn);
  
  		/* Dump role/database settings */

#61

Jeff Janes

jeff.janes@gmail.com

about 13 years ago

In reply to: Bruce Momjian (#60)

Re: Further pg_upgrade analysis for many tables

On Tue, Nov 27, 2012 at 8:13 PM, Bruce Momjian <bruce@momjian.us> wrote:

I have some new interesting results (in seconds, test script attached):

---- -Fc ---- ------- dump | pg_restore/psql ------ - pg_upgrade -
dump restore -Fc -Fc|-1 -Fc|-j -Fp -Fp|-1 git patch
1 0.14 0.08 0.14 0.16 0.19 0.13 0.15 11.04 13.07
1000 3.08 3.65 6.53 6.60 5.39 6.37 6.54 21.05 22.18
2000 6.06 6.52 12.15 11.78 10.52 12.89 12.11 31.93 31.65
4000 11.07 14.68 25.12 24.47 22.07 26.77 26.77 56.03 47.03
8000 20.85 32.03 53.68 45.23 45.10 59.20 51.33 104.99 85.19
16000 40.28 88.36 127.63 96.65 106.33 136.68 106.64 221.82 157.36
32000 93.78 274.99 368.54 211.30 294.76 376.36 229.80 544.73 321.19
64000 197.79 1109.22 1336.83 577.83 1117.55 1327.98 567.84 1766.12 763.02

I tested custom format with pg_restore -j and -1, as well as text
restore. The winner was pg_dump -Fc | pg_restore -1;

I don't have the numbers at hand, but if my relcache patch is
accepted, then "-1" stops being faster.

-1 gets rid of the AtOEXAct relcache N^2 behavior, but at the cost of
invoking a different N^2, that one in the stats system.

Cheers,

Jeff

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#62

Bruce Momjian

bruce@momjian.us

about 13 years ago

In reply to: Jeff Janes (#61)

Re: Further pg_upgrade analysis for many tables

On Tue, Nov 27, 2012 at 09:35:10PM -0800, Jeff Janes wrote:

On Tue, Nov 27, 2012 at 8:13 PM, Bruce Momjian <bruce@momjian.us> wrote:

I have some new interesting results (in seconds, test script attached):

---- -Fc ---- ------- dump | pg_restore/psql ------ - pg_upgrade -
dump restore -Fc -Fc|-1 -Fc|-j -Fp -Fp|-1 git patch
1 0.14 0.08 0.14 0.16 0.19 0.13 0.15 11.04 13.07
1000 3.08 3.65 6.53 6.60 5.39 6.37 6.54 21.05 22.18
2000 6.06 6.52 12.15 11.78 10.52 12.89 12.11 31.93 31.65
4000 11.07 14.68 25.12 24.47 22.07 26.77 26.77 56.03 47.03
8000 20.85 32.03 53.68 45.23 45.10 59.20 51.33 104.99 85.19
16000 40.28 88.36 127.63 96.65 106.33 136.68 106.64 221.82 157.36
32000 93.78 274.99 368.54 211.30 294.76 376.36 229.80 544.73 321.19
64000 197.79 1109.22 1336.83 577.83 1117.55 1327.98 567.84 1766.12 763.02

I tested custom format with pg_restore -j and -1, as well as text
restore. The winner was pg_dump -Fc | pg_restore -1;

I don't have the numbers at hand, but if my relcache patch is
accepted, then "-1" stops being faster.

-1 gets rid of the AtOEXAct relcache N^2 behavior, but at the cost of
invoking a different N^2, that one in the stats system.

I was going to ask you that. :-) Let me run a test with your patch
now.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#63

Bruce Momjian

bruce@momjian.us

about 13 years ago

In reply to: Jeff Janes (#61)

Re: Further pg_upgrade analysis for many tables

On Tue, Nov 27, 2012 at 09:35:10PM -0800, Jeff Janes wrote:

I tested custom format with pg_restore -j and -1, as well as text
restore. The winner was pg_dump -Fc | pg_restore -1;

I don't have the numbers at hand, but if my relcache patch is
accepted, then "-1" stops being faster.

-1 gets rid of the AtOEXAct relcache N^2 behavior, but at the cost of
invoking a different N^2, that one in the stats system.

OK, here are the testing results:

#tbls git -1 AtOEXAct both
1 11.06 13.06 10.99 13.20
1000 21.71 22.92 22.20 22.51
2000 32.86 31.09 32.51 31.62
4000 55.22 49.96 52.50 49.99
8000 105.34 82.10 95.32 82.94
16000 223.67 164.27 187.40 159.53
32000 543.93 324.63 366.44 317.93
64000 1697.14 791.82 767.32 752.57

Up to 2k, they are all similar. 4k & 8k have the -1 patch as a win, and
16k+ really need both patches.

I will continue working on the -1 patch, and hopefully we can get your
AtOEXAct patch in soon. Is someone reviewing that?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#64

Bruce Momjian

bruce@momjian.us

about 13 years ago

In reply to: Bruce Momjian (#63)

1 attachment(s)

Re: Further pg_upgrade analysis for many tables

On Wed, Nov 28, 2012 at 03:22:32PM -0500, Bruce Momjian wrote:

On Tue, Nov 27, 2012 at 09:35:10PM -0800, Jeff Janes wrote:

I tested custom format with pg_restore -j and -1, as well as text
restore. The winner was pg_dump -Fc | pg_restore -1;

I don't have the numbers at hand, but if my relcache patch is
accepted, then "-1" stops being faster.

-1 gets rid of the AtOEXAct relcache N^2 behavior, but at the cost of
invoking a different N^2, that one in the stats system.

OK, here are the testing results:

#tbls git -1 AtOEXAct both
1 11.06 13.06 10.99 13.20
1000 21.71 22.92 22.20 22.51
2000 32.86 31.09 32.51 31.62
4000 55.22 49.96 52.50 49.99
8000 105.34 82.10 95.32 82.94
16000 223.67 164.27 187.40 159.53
32000 543.93 324.63 366.44 317.93
64000 1697.14 791.82 767.32 752.57

Up to 2k, they are all similar. 4k & 8k have the -1 patch as a win, and
16k+ really need both patches.

I will continue working on the -1 patch, and hopefully we can get your
AtOEXAct patch in soon. Is someone reviewing that?

I have polished up the patch (attached) and it is ready for application
to 9.3.

Since there is no pg_dump/pg_restore pipe parallelism, I had the old
cluster create per-database dump files, so I don't need to have the old
and new clusters running at the same time, which would have required two
port numbers and make shared memory exhaustion more likely.

We now create a dump file per database, so thousands of database dump
files might cause a performance problem.

This also adds status output so you can see the database names as their
schemas are dumped and restored. This was requested by users.

I retained custom mode for pg_dump because it is measurably faster than
text mode (not sure why, psql overhead?):

git -Fc -Fp
1 11.04 11.08 11.02
1000 22.37 19.68 21.64
2000 32.39 28.62 31.40
4000 56.18 48.53 51.15
8000 105.15 81.23 91.84
16000 227.64 156.72 177.79
32000 542.80 323.19 371.81
64000 1711.77 789.17 865.03

Text dump files are slightly easier to debug, but probably not by much.

Single-transaction restores were recommended to me over a year ago (by
Magnus?), but I wanted to get pg_upgrade rock-solid before doing
optimization, and now is the right time to optimize.

One risk of single-transaction restores is max_locks_per_transaction
exhaustion, but you will need to increase that on the old cluster for
pg_dump anyway because that is done a single transaction, so the only
new thing is that the new cluster might also need to adjust
max_locks_per_transaction.

I was able to remove split_old_dump() because pg_dumpall now produces a
full global restore file and we do database dumps separately.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

Attachments:

pg_upgrade.difftext/x-diff; charset=us-asciiDownload

diff --git a/contrib/pg_upgrade/check.c b/contrib/pg_upgrade/check.c
new file mode 100644
index 285f10c..bccceb1
*** a/contrib/pg_upgrade/check.c
--- b/contrib/pg_upgrade/check.c
*************** output_check_banner(bool *live_check)
*** 72,78 ****
  
  
  void
! check_old_cluster(bool live_check, char **sequence_script_file_name)
  {
  	/* -- OLD -- */
  
--- 72,78 ----
  
  
  void
! check_and_dump_old_cluster(bool live_check, char **sequence_script_file_name)
  {
  	/* -- OLD -- */
  
*************** check_old_cluster(bool live_check, char
*** 131,140 ****
  	 * the old server is running.
  	 */
  	if (!user_opts.check)
- 	{
  		generate_old_dump();
- 		split_old_dump();
- 	}
  
  	if (!live_check)
  		stop_postmaster(false);
--- 131,137 ----
diff --git a/contrib/pg_upgrade/dump.c b/contrib/pg_upgrade/dump.c
new file mode 100644
index 577ccac..d206e98
*** a/contrib/pg_upgrade/dump.c
--- b/contrib/pg_upgrade/dump.c
***************
*** 16,110 ****
  void
  generate_old_dump(void)
  {
! 	/* run new pg_dumpall binary */
! 	prep_status("Creating catalog dump");
  
! 	/*
! 	 * --binary-upgrade records the width of dropped columns in pg_class, and
! 	 * restores the frozenid's for databases and relations.
! 	 */
  	exec_prog(UTILITY_LOG_FILE, NULL, true,
! 			  "\"%s/pg_dumpall\" %s --schema-only --binary-upgrade %s -f %s",
  			  new_cluster.bindir, cluster_conn_opts(&old_cluster),
  			  log_opts.verbose ? "--verbose" : "",
! 			  ALL_DUMP_FILE);
! 	check_ok();
! }
! 
! 
! /*
!  *	split_old_dump
!  *
!  *	This function splits pg_dumpall output into global values and
!  *	database creation, and per-db schemas.	This allows us to create
!  *	the support functions between restoring these two parts of the
!  *	dump.  We split on the first "\connect " after a CREATE ROLE
!  *	username match;  this is where the per-db restore starts.
!  *
!  *	We suppress recreation of our own username so we don't generate
!  *	an error during restore
!  */
! void
! split_old_dump(void)
! {
! 	FILE	   *all_dump,
! 			   *globals_dump,
! 			   *db_dump;
! 	FILE	   *current_output;
! 	char		line[LINE_ALLOC];
! 	bool		start_of_line = true;
! 	char		create_role_str[MAX_STRING];
! 	char		create_role_str_quote[MAX_STRING];
! 	char		filename[MAXPGPATH];
! 	bool		suppressed_username = false;
! 
! 
! 	/* 
! 	 * Open all files in binary mode to avoid line end translation on Windows,
! 	 * both for input and output.
! 	 */
! 
! 	snprintf(filename, sizeof(filename), "%s", ALL_DUMP_FILE);
! 	if ((all_dump = fopen(filename, PG_BINARY_R)) == NULL)
! 		pg_log(PG_FATAL, "Could not open dump file \"%s\": %s\n", filename, getErrorText(errno));
! 	snprintf(filename, sizeof(filename), "%s", GLOBALS_DUMP_FILE);
! 	if ((globals_dump = fopen_priv(filename, PG_BINARY_W)) == NULL)
! 		pg_log(PG_FATAL, "Could not write to dump file \"%s\": %s\n", filename, getErrorText(errno));
! 	snprintf(filename, sizeof(filename), "%s", DB_DUMP_FILE);
! 	if ((db_dump = fopen_priv(filename, PG_BINARY_W)) == NULL)
! 		pg_log(PG_FATAL, "Could not write to dump file \"%s\": %s\n", filename, getErrorText(errno));
! 
! 	current_output = globals_dump;
! 
! 	/* patterns used to prevent our own username from being recreated */
! 	snprintf(create_role_str, sizeof(create_role_str),
! 			 "CREATE ROLE %s;", os_info.user);
! 	snprintf(create_role_str_quote, sizeof(create_role_str_quote),
! 			 "CREATE ROLE %s;", quote_identifier(os_info.user));
  
! 	while (fgets(line, sizeof(line), all_dump) != NULL)
  	{
! 		/* switch to db_dump file output? */
! 		if (current_output == globals_dump && start_of_line &&
! 			suppressed_username &&
! 			strncmp(line, "\\connect ", strlen("\\connect ")) == 0)
! 			current_output = db_dump;
  
! 		/* output unless we are recreating our own username */
! 		if (current_output != globals_dump || !start_of_line ||
! 			(strncmp(line, create_role_str, strlen(create_role_str)) != 0 &&
! 			 strncmp(line, create_role_str_quote, strlen(create_role_str_quote)) != 0))
! 			fputs(line, current_output);
! 		else
! 			suppressed_username = true;
  
! 		if (strlen(line) > 0 && line[strlen(line) - 1] == '\n')
! 			start_of_line = true;
! 		else
! 			start_of_line = false;
  	}
  
! 	fclose(all_dump);
! 	fclose(globals_dump);
! 	fclose(db_dump);
  }
--- 16,49 ----
  void
  generate_old_dump(void)
  {
! 	int			dbnum;
  
! 	prep_status("Creating catalog dump\n");
! 
! 	pg_log(PG_REPORT, OVERWRITE_MESSAGE, "global objects");
! 
! 	/* run new pg_dumpall binary for globals */
  	exec_prog(UTILITY_LOG_FILE, NULL, true,
! 			  "\"%s/pg_dumpall\" %s --schema-only --globals-only --binary-upgrade %s -f %s",
  			  new_cluster.bindir, cluster_conn_opts(&old_cluster),
  			  log_opts.verbose ? "--verbose" : "",
! 			  GLOBALS_DUMP_FILE);
  
!  	/* create per-db dump files */
! 	for (dbnum = 0; dbnum < old_cluster.dbarr.ndbs; dbnum++)
  	{
! 		char 		file_name[MAXPGPATH];
! 		DbInfo     *old_db = &old_cluster.dbarr.dbs[dbnum];
  
! 		pg_log(PG_REPORT, OVERWRITE_MESSAGE, old_db->db_name);
! 		snprintf(file_name, sizeof(file_name), DB_DUMP_FILE_MASK, old_db->db_oid);
  
! 		exec_prog(RESTORE_LOG_FILE, NULL, true,
! 				  "\"%s/pg_dump\" %s --schema-only --binary-upgrade --format=custom %s --file=\"%s\" \"%s\"",
! 				  new_cluster.bindir, cluster_conn_opts(&old_cluster),
! 				  log_opts.verbose ? "--verbose" : "", file_name, old_db->db_name);
  	}
  
! 	end_progress_output();
! 	check_ok();
  }
diff --git a/contrib/pg_upgrade/exec.c b/contrib/pg_upgrade/exec.c
new file mode 100644
index 76247fd..35de541
*** a/contrib/pg_upgrade/exec.c
--- b/contrib/pg_upgrade/exec.c
*************** exec_prog(const char *log_file, const ch
*** 104,111 ****
  
  	if (result != 0)
  	{
! 		report_status(PG_REPORT, "*failure*");
  		fflush(stdout);
  		pg_log(PG_VERBOSE, "There were problems executing \"%s\"\n", cmd);
  		if (opt_log_file)
  			pg_log(throw_error ? PG_FATAL : PG_REPORT,
--- 104,113 ----
  
  	if (result != 0)
  	{
! 		/* we might be in on a progress status line, so go to the next line */
! 		report_status(PG_REPORT, "\n*failure*");
  		fflush(stdout);
+ 
  		pg_log(PG_VERBOSE, "There were problems executing \"%s\"\n", cmd);
  		if (opt_log_file)
  			pg_log(throw_error ? PG_FATAL : PG_REPORT,
diff --git a/contrib/pg_upgrade/pg_upgrade.c b/contrib/pg_upgrade/pg_upgrade.c
new file mode 100644
index 4d2e79c..bdc6d13
*** a/contrib/pg_upgrade/pg_upgrade.c
--- b/contrib/pg_upgrade/pg_upgrade.c
*************** main(int argc, char **argv)
*** 92,98 ****
  
  	check_cluster_compatibility(live_check);
  
! 	check_old_cluster(live_check, &sequence_script_file_name);
  
  
  	/* -- NEW -- */
--- 92,98 ----
  
  	check_cluster_compatibility(live_check);
  
! 	check_and_dump_old_cluster(live_check, &sequence_script_file_name);
  
  
  	/* -- NEW -- */
*************** create_new_objects(void)
*** 282,287 ****
--- 282,292 ----
  
  	prep_status("Adding support functions to new cluster");
  
+ 	/*
+ 	 *	Technically, we only need to install these support functions in new
+ 	 *	databases that also exist in the old cluster, but for completeness
+ 	 *	we process all new databases.
+ 	 */
  	for (dbnum = 0; dbnum < new_cluster.dbarr.ndbs; dbnum++)
  	{
  		DbInfo	   *new_db = &new_cluster.dbarr.dbs[dbnum];
*************** create_new_objects(void)
*** 292,302 ****
  	}
  	check_ok();
  
! 	prep_status("Restoring database schema to new cluster");
! 	exec_prog(RESTORE_LOG_FILE, NULL, true,
! 			  "\"%s/psql\" " EXEC_PSQL_ARGS " %s -f \"%s\"",
! 			  new_cluster.bindir, cluster_conn_opts(&new_cluster),
! 			  DB_DUMP_FILE);
  	check_ok();
  
  	/* regenerate now that we have objects in the databases */
--- 297,323 ----
  	}
  	check_ok();
  
! 	prep_status("Restoring database schema to new cluster\n");
! 
! 	for (dbnum = 0; dbnum < old_cluster.dbarr.ndbs; dbnum++)
! 	{
! 		char file_name[MAXPGPATH];
! 		DbInfo     *old_db = &old_cluster.dbarr.dbs[dbnum];
! 
! 		pg_log(PG_REPORT, OVERWRITE_MESSAGE, old_db->db_name);
! 		snprintf(file_name, sizeof(file_name), DB_DUMP_FILE_MASK, old_db->db_oid);
! 
! 		/*
! 		 *	Using pg_restore --single-transaction is faster than other
! 		 *	methods, like --jobs.  pg_dump only produces its output at the
! 		 *	end, so there is little parallelism using the pipe.
! 		 */
! 		exec_prog(RESTORE_LOG_FILE, NULL, true,
! 				  "\"%s/pg_restore\" %s --exit-on-error --single-transaction --verbose --dbname \"%s\" \"%s\"",
! 				  new_cluster.bindir, cluster_conn_opts(&new_cluster),
! 				  old_db->db_name, file_name);
! 	}
! 	end_progress_output();
  	check_ok();
  
  	/* regenerate now that we have objects in the databases */
*************** cleanup(void)
*** 455,468 ****
  	/* Remove dump and log files? */
  	if (!log_opts.retain)
  	{
  		char	  **filename;
  
  		for (filename = output_files; *filename != NULL; filename++)
  			unlink(*filename);
  
! 		/* remove SQL files */
! 		unlink(ALL_DUMP_FILE);
  		unlink(GLOBALS_DUMP_FILE);
! 		unlink(DB_DUMP_FILE);
  	}
  }
--- 476,498 ----
  	/* Remove dump and log files? */
  	if (!log_opts.retain)
  	{
+ 		int			dbnum;
  		char	  **filename;
  
  		for (filename = output_files; *filename != NULL; filename++)
  			unlink(*filename);
  
! 		/* remove dump files */
  		unlink(GLOBALS_DUMP_FILE);
! 
! 		if (old_cluster.dbarr.dbs)
! 			for (dbnum = 0; dbnum < old_cluster.dbarr.ndbs; dbnum++)
! 			{
! 				char file_name[MAXPGPATH];
! 				DbInfo     *old_db = &old_cluster.dbarr.dbs[dbnum];
! 
! 				snprintf(file_name, sizeof(file_name), DB_DUMP_FILE_MASK, old_db->db_oid);
! 				unlink(file_name);
! 			}
  	}
  }
diff --git a/contrib/pg_upgrade/pg_upgrade.h b/contrib/pg_upgrade/pg_upgrade.h
new file mode 100644
index ace56e5..d981035
*** a/contrib/pg_upgrade/pg_upgrade.h
--- b/contrib/pg_upgrade/pg_upgrade.h
***************
*** 29,38 ****
  #define OVERWRITE_MESSAGE	"  %-" MESSAGE_WIDTH "." MESSAGE_WIDTH "s\r"
  #define GET_MAJOR_VERSION(v)	((v) / 100)
  
- #define ALL_DUMP_FILE		"pg_upgrade_dump_all.sql"
  /* contains both global db information and CREATE DATABASE commands */
  #define GLOBALS_DUMP_FILE	"pg_upgrade_dump_globals.sql"
! #define DB_DUMP_FILE		"pg_upgrade_dump_db.sql"
  
  #define SERVER_LOG_FILE		"pg_upgrade_server.log"
  #define RESTORE_LOG_FILE	"pg_upgrade_restore.log"
--- 29,37 ----
  #define OVERWRITE_MESSAGE	"  %-" MESSAGE_WIDTH "." MESSAGE_WIDTH "s\r"
  #define GET_MAJOR_VERSION(v)	((v) / 100)
  
  /* contains both global db information and CREATE DATABASE commands */
  #define GLOBALS_DUMP_FILE	"pg_upgrade_dump_globals.sql"
! #define DB_DUMP_FILE_MASK	"pg_upgrade_dump_%u.custom"
  
  #define SERVER_LOG_FILE		"pg_upgrade_server.log"
  #define RESTORE_LOG_FILE	"pg_upgrade_restore.log"
*************** extern OSInfo os_info;
*** 296,307 ****
  /* check.c */
  
  void		output_check_banner(bool *live_check);
! void check_old_cluster(bool live_check,
  				  char **sequence_script_file_name);
  void		check_new_cluster(void);
  void		report_clusters_compatible(void);
  void		issue_warnings(char *sequence_script_file_name);
! void output_completion_banner(char *analyze_script_file_name,
  						 char *deletion_script_file_name);
  void		check_cluster_versions(void);
  void		check_cluster_compatibility(bool live_check);
--- 295,306 ----
  /* check.c */
  
  void		output_check_banner(bool *live_check);
! void		check_and_dump_old_cluster(bool live_check,
  				  char **sequence_script_file_name);
  void		check_new_cluster(void);
  void		report_clusters_compatible(void);
  void		issue_warnings(char *sequence_script_file_name);
! void		output_completion_banner(char *analyze_script_file_name,
  						 char *deletion_script_file_name);
  void		check_cluster_versions(void);
  void		check_cluster_compatibility(bool live_check);
*************** void		disable_old_cluster(void);
*** 319,325 ****
  /* dump.c */
  
  void		generate_old_dump(void);
- void		split_old_dump(void);
  
  
  /* exec.c */
--- 318,323 ----
*************** __attribute__((format(PG_PRINTF_ATTRIBUT
*** 433,438 ****
--- 431,437 ----
  void
  pg_log(eLogType type, char *fmt,...)
  __attribute__((format(PG_PRINTF_ATTRIBUTE, 2, 3)));
+ void		end_progress_output(void);
  void
  prep_status(const char *fmt,...)
  __attribute__((format(PG_PRINTF_ATTRIBUTE, 1, 2)));
diff --git a/contrib/pg_upgrade/relfilenode.c b/contrib/pg_upgrade/relfilenode.c
new file mode 100644
index 7dbaac9..14e66df
*** a/contrib/pg_upgrade/relfilenode.c
--- b/contrib/pg_upgrade/relfilenode.c
*************** transfer_all_new_dbs(DbInfoArr *old_db_a
*** 82,90 ****
  		}
  	}
  
! 	prep_status(" ");			/* in case nothing printed; pass a space so
! 								 * gcc doesn't complain about empty format
! 								 * string */
  	check_ok();
  
  	return msg;
--- 82,88 ----
  		}
  	}
  
! 	end_progress_output();
  	check_ok();
  
  	return msg;
diff --git a/contrib/pg_upgrade/util.c b/contrib/pg_upgrade/util.c
new file mode 100644
index 1d4bc89..0c1eccc
*** a/contrib/pg_upgrade/util.c
--- b/contrib/pg_upgrade/util.c
*************** report_status(eLogType type, const char
*** 35,40 ****
--- 35,52 ----
  }
  
  
+ /* force blank output for progress display */
+ void
+ end_progress_output(void)
+ {
+ 	/*
+ 	 *	In case nothing printed; pass a space so gcc doesn't complain about
+ 	 *	empty format string.
+ 	 */
+ 	prep_status(" ");
+ }
+ 
+ 
  /*
   * prep_status
   *
diff --git a/src/bin/pg_dump/pg_dumpall.c b/src/bin/pg_dump/pg_dumpall.c
new file mode 100644
index ca95bad..83eae81
*** a/src/bin/pg_dump/pg_dumpall.c
--- b/src/bin/pg_dump/pg_dumpall.c
*************** main(int argc, char *argv[])
*** 502,508 ****
  		}
  
  		/* Dump CREATE DATABASE commands */
! 		if (!globals_only && !roles_only && !tablespaces_only)
  			dumpCreateDB(conn);
  
  		/* Dump role/database settings */
--- 502,508 ----
  		}
  
  		/* Dump CREATE DATABASE commands */
! 		if (binary_upgrade || (!globals_only && !roles_only && !tablespaces_only))
  			dumpCreateDB(conn);
  
  		/* Dump role/database settings */
*************** dumpRoles(PGconn *conn)
*** 745,753 ****
  		 * will acquire the right properties even if it already exists (ie, it
  		 * won't hurt for the CREATE to fail).  This is particularly important
  		 * for the role we are connected as, since even with --clean we will
! 		 * have failed to drop it.
  		 */
! 		appendPQExpBuffer(buf, "CREATE ROLE %s;\n", fmtId(rolename));
  		appendPQExpBuffer(buf, "ALTER ROLE %s WITH", fmtId(rolename));
  
  		if (strcmp(PQgetvalue(res, i, i_rolsuper), "t") == 0)
--- 745,755 ----
  		 * will acquire the right properties even if it already exists (ie, it
  		 * won't hurt for the CREATE to fail).  This is particularly important
  		 * for the role we are connected as, since even with --clean we will
! 		 * have failed to drop it.  binary_upgrade cannot generate any errors,
! 		 * so we assume the role is already created.
  		 */
! 		if (!binary_upgrade)
! 			appendPQExpBuffer(buf, "CREATE ROLE %s;\n", fmtId(rolename));
  		appendPQExpBuffer(buf, "ALTER ROLE %s WITH", fmtId(rolename));
  
  		if (strcmp(PQgetvalue(res, i, i_rolsuper), "t") == 0)

#65

Bruce Momjian

bruce@momjian.us

about 13 years ago

In reply to: Bruce Momjian (#64)

Re: Further pg_upgrade analysis for many tables

On Thu, Nov 29, 2012 at 12:59:19PM -0500, Bruce Momjian wrote:

I have polished up the patch (attached) and it is ready for application
to 9.3.

Applied.

---------------------------------------------------------------------------

Since there is no pg_dump/pg_restore pipe parallelism, I had the old
cluster create per-database dump files, so I don't need to have the old
and new clusters running at the same time, which would have required two
port numbers and make shared memory exhaustion more likely.

We now create a dump file per database, so thousands of database dump
files might cause a performance problem.

This also adds status output so you can see the database names as their
schemas are dumped and restored. This was requested by users.

I retained custom mode for pg_dump because it is measurably faster than
text mode (not sure why, psql overhead?):

git -Fc -Fp
1 11.04 11.08 11.02
1000 22.37 19.68 21.64
2000 32.39 28.62 31.40
4000 56.18 48.53 51.15
8000 105.15 81.23 91.84
16000 227.64 156.72 177.79
32000 542.80 323.19 371.81
64000 1711.77 789.17 865.03

Text dump files are slightly easier to debug, but probably not by much.

Single-transaction restores were recommended to me over a year ago (by
Magnus?), but I wanted to get pg_upgrade rock-solid before doing
optimization, and now is the right time to optimize.

One risk of single-transaction restores is max_locks_per_transaction
exhaustion, but you will need to increase that on the old cluster for
pg_dump anyway because that is done a single transaction, so the only
new thing is that the new cluster might also need to adjust
max_locks_per_transaction.

I was able to remove split_old_dump() because pg_dumpall now produces a
full global restore file and we do database dumps separately.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#66

Simon Riggs

simon@2ndQuadrant.com

about 13 years ago

In reply to: Jeff Janes (#54)

Re: Further pg_upgrade analysis for many tables

On 23 November 2012 22:34, Jeff Janes <jeff.janes@gmail.com> wrote:

I got rid of need_eoxact_work entirely and replaced it with a short
list that fulfills the functions of indicating that work is needed,
and suggesting which rels might need that work. There is no attempt
to prevent duplicates, nor to remove invalidated entries from the
list. Invalid entries are skipped when the hash entry is not found,
and processing is idempotent so duplicates are not a problem.

Formally speaking, if MAX_EOXACT_LIST were 0, so that the list
overflowed the first time it was accessed, then it would be identical
to the current behavior or having only a flag. So formally all I did
was increase the max from 0 to 10.

...

It is not obvious what value to set the MAX list size to.

A few questions, that may help you...

Why did you pick 10, when your create temp table example needs 110?

Why does the list not grow as needed?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#67

Simon Riggs

simon@2ndQuadrant.com

about 13 years ago

In reply to: Jeff Janes (#11)

Re: Further pg_upgrade analysis for many tables

On 9 November 2012 18:50, Jeff Janes <jeff.janes@gmail.com> wrote:

quadratic behavior in the resource owner/lock table

I didn't want to let that particular phrase go by without saying
"exactly what behaviour is that?", so we can discuss fixing that also.

This maybe something I already know about, but its worth asking about.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#68

Jeff Janes

jeff.janes@gmail.com

about 13 years ago

In reply to: Simon Riggs (#67)

Re: Further pg_upgrade analysis for many tables

On Wed, Jan 9, 2013 at 3:59 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 9 November 2012 18:50, Jeff Janes <jeff.janes@gmail.com> wrote:

quadratic behavior in the resource owner/lock table

I didn't want to let that particular phrase go by without saying
"exactly what behaviour is that?", so we can discuss fixing that also.

It is the thing that was fixed in commit eeb6f37d89fc60c6449ca1, "Add
a small cache of locks owned by a resource owner in ResourceOwner."
But that fix is only in 9.3devel.

Cheers,

Jeff

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#69

Simon Riggs

simon@2ndQuadrant.com

about 13 years ago

In reply to: Jeff Janes (#68)

Re: Further pg_upgrade analysis for many tables

On 9 January 2013 17:50, Jeff Janes <jeff.janes@gmail.com> wrote:

On Wed, Jan 9, 2013 at 3:59 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 9 November 2012 18:50, Jeff Janes <jeff.janes@gmail.com> wrote:

quadratic behavior in the resource owner/lock table

I didn't want to let that particular phrase go by without saying
"exactly what behaviour is that?", so we can discuss fixing that also.

It is the thing that was fixed in commit eeb6f37d89fc60c6449ca1, "Add
a small cache of locks owned by a resource owner in ResourceOwner."
But that fix is only in 9.3devel.

That's good, it fixes the problem I reported in 2010, under
"SAVEPOINTs and COMMIT performance".

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#70

Jeff Janes

jeff.janes@gmail.com

about 13 years ago

In reply to: Simon Riggs (#66)

Re: Further pg_upgrade analysis for many tables

On Wednesday, January 9, 2013, Simon Riggs wrote:

On 23 November 2012 22:34, Jeff Janes <jeff.janes@gmail.com <javascript:;>>
wrote:

I got rid of need_eoxact_work entirely and replaced it with a short
list that fulfills the functions of indicating that work is needed,
and suggesting which rels might need that work. There is no attempt
to prevent duplicates, nor to remove invalidated entries from the
list. Invalid entries are skipped when the hash entry is not found,
and processing is idempotent so duplicates are not a problem.

Formally speaking, if MAX_EOXACT_LIST were 0, so that the list
overflowed the first time it was accessed, then it would be identical
to the current behavior or having only a flag. So formally all I did
was increase the max from 0 to 10.

...

It is not obvious what value to set the MAX list size to.

A few questions, that may help you...

Why did you pick 10, when your create temp table example needs 110?

The 110 is the size of the RelationIdCache at the end of my augmented
pgbench transaction. But, only one of those entries needs any work, so for
that example a MAX of 1 would suffice. But 1 seems to be cutting it rather
close, so I picked the next largest power of 10.

The downsides of making the MAX larger are:

1) For ordinary work loads, each backend needs a very little bit more
memory for the static array. (this would change if we want to extend this
to EOsubXACT as well as EOXACT, beause there can be only 1 XACT but an
unlimited number of SubXACT)

2) For pathological work loads that add the same relation to the list over
and over again thousands of times, they have to
grovel through that list at EOX, which in theory could be more work than
going through the entire non-redundant RelationIdCache hash. (I have no
idea what a pathological work load might actually look like in practice,
but it seems like a good idea to assume that one might exist)

We could prevent duplicates from being added to the list in the first
place, but the overhead need to do that seems like a sure loser for
ordinary work loads.

By making the list over-flowable, we fix a demonstrated pathological
workload (restore of huge schemas); we impose no detectable penalty to
normal workloads; and we fail to improve, but also fail to make worse, a
hypothetical pathological workload. All at the expense of a few bytes per
backend.

If the list overflowed at 100 rather than 10, the only cost would probably
be the extra bytes used per process. (Because the minimum realistic size
of RelationIdCache is 100, and I assume iterating over 100 hash tags which
may or may not exist and/or be redundant is about the same amount of work
as iterating over a hash which has at least 100 entries)

If we increase the overflow above 100, we might be making things worse for
some pathological workload whose existence is entirely hypothetical--but
the workload that benefits from setting it above 100 is also hypothetical.
So while 10 might be too small, above 100 doesn't seem to be defensible in
the absence of known cases.

Why does the list not grow as needed?

It would increase the code complexity for no concretely-known benefit.

If we are concerned about space, the extra bytes of compiled code needed to
implement dynamic growth would certainly exceed the bytes need to just jack
the MAX setting up to static setting 100 or 500 or so.

For dynamic growth to be a win, would have to have a work-load that
satisfies these conditions:

1) It would have to have some transactions that cause >10 or >100 of
relations to need clean up.
2) It would have to have even more hundreds of relations
in RelationIdCache but which don't need cleanup (otherwise, if most
of RelationIdCache needs cleanup then iterating over that hash would be
just as efficient as iterating over a list which contains most of the said
hash)
3) The above described transaction would have to happen over and over
again, because if it only happens once there is no point in worrying about
a little inefficiency.

Cheers,

Jeff

#71

Stephen Frost

sfrost@snowman.net

almost 13 years ago

In reply to: Jeff Janes (#70)

Re: Further pg_upgrade analysis for many tables

* Jeff Janes (jeff.janes@gmail.com) wrote:

By making the list over-flowable, we fix a demonstrated pathological
workload (restore of huge schemas); we impose no detectable penalty to
normal workloads; and we fail to improve, but also fail to make worse, a
hypothetical pathological workload. All at the expense of a few bytes per
backend.

[...]

Why does the list not grow as needed?

It would increase the code complexity for no concretely-known benefit.

I'm curious if this is going to help with rollback's of transactions
which created lots of tables..? We've certainly seen that take much
longer than we'd like, although I've generally attributed it to doing
all of the unlink'ing and truncating of files.

I also wonder about making this a linked-list or something which can
trivially grow as we go and then walk later. That would also keep the
size of it small instead of a static/fixed amount.

1) It would have to have some transactions that cause >10 or >100 of
relations to need clean up.

That doesn't seem hard.

2) It would have to have even more hundreds of relations
in RelationIdCache but which don't need cleanup (otherwise, if most
of RelationIdCache needs cleanup then iterating over that hash would be
just as efficient as iterating over a list which contains most of the said
hash)

Good point.

3) The above described transaction would have to happen over and over
again, because if it only happens once there is no point in worrying about
a little inefficiency.

We regularly do builds where we have lots of created tables which are
later either committed or dropped (much of that is due to our
hand-crafted partitioning system..).

Looking through the pach itself, it looks pretty clean to me.

Thanks,

Stephen

#72

Tom Lane

tgl@sss.pgh.pa.us

almost 13 years ago

In reply to: Jeff Janes (#70)

Re: Further pg_upgrade analysis for many tables

Jeff Janes <jeff.janes@gmail.com> writes:

[ patch for AtEOXact_RelationCache ]

I've reviewed and committed this with some mostly-cosmetic adjustments,
notably:

* Applied it to AtEOSubXact cleanup too. AFAICS that's just as
idempotent, and it seemed weird to not use the same technique both
places.

* Dropped the hack to force a full-table scan in Assert mode. Although
that's a behavioral change that I suspect Jeff felt was above his pay
grade, it seemed to me that not exercising the now-normal hash_search
code path in assert-enabled testing was a bad idea. Also, the value of
exhaustive checking for relcache reference leaks is vastly lower than it
once was, because those refcounts are managed mostly automatically now.

* Redid the representation of the overflowed state a bit --- the way
that n_eoxact_list worked seemed a bit too cute/complicated for my
taste.

On Wednesday, January 9, 2013, Simon Riggs wrote:

Why does the list not grow as needed?

It would increase the code complexity for no concretely-known benefit.

Actually there's a better argument for that: at some point a long list
is actively counterproductive, because N hash_search lookups will cost
more than the full-table scan would.

I did some simple measurements that told me that with 100-odd entries
in the hashtable (which seems to be about the minimum for an active
backend), the hash_seq_search() traversal is about 40x more expensive
than one hash_search() lookup. (I find this number slightly
astonishing, but that's the answer I got.) So the crossover point
is at least 40 and probably quite a bit more, since (1) my measurement
did not count the cost of uselessly doing the actual relcache-entry
cleanup logic on non-targeted entries, and (2) if the list is that
long there are probably more than 100-odd entries in the hash table,
and hash table growth hurts the seqscan approach much more than the
search approach.

Now on the other side, simple single-command transactions are very
unlikely to have created more than a few list entries anyway. So
it's probably not worth getting very tense about the exact limit
as long as it's at least a couple dozen. I set the limit to 32
as committed, because that seemed like a nice round number in the
right general area.

BTW, this measurement also convinced me that the patch is a win
even when the hashtable is near minimum size, even though there's
no practical way to isolate the cost of AtEOXact_RelationCache in
vivo in such cases. It's good to know that we're not penalizing
simple cases to speed up the huge-number-of-relations case, even
if the penalty would be small.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#73

Tom Lane

tgl@sss.pgh.pa.us

almost 13 years ago

In reply to: Bruce Momjian (#64)

Re: Further pg_upgrade analysis for many tables

Bruce Momjian <bruce@momjian.us> writes:

! * Using pg_restore --single-transaction is faster than other
! * methods, like --jobs.

Is this still the case now that Jeff's AtEOXact patch is in? The risk
of locktable overflow with --single-transaction makes me think that
pg_upgrade should avoid it unless there is a *really* strong performance
case for it, and I fear your old measurements are now invalidated.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#74

Tom Lane

tgl@sss.pgh.pa.us

almost 13 years ago

In reply to: Stephen Frost (#71)

Re: Further pg_upgrade analysis for many tables

Stephen Frost <sfrost@snowman.net> writes:

I'm curious if this is going to help with rollback's of transactions
which created lots of tables..? We've certainly seen that take much
longer than we'd like, although I've generally attributed it to doing
all of the unlink'ing and truncating of files.

If a single transaction creates lots of tables and then rolls back,
this patch won't change anything because we'll long since have
overflowed the eoxact list. But you weren't seeing an O(N^2) penalty
in such cases anyway: that penalty came from doing O(N) work in each
of N transactions. I'm sure you're right that you're mostly looking
at the filesystem cleanup work, which we can't do much about.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#75

Jeff Janes

jeff.janes@gmail.com

almost 13 years ago

In reply to: Stephen Frost (#71)

Re: Further pg_upgrade analysis for many tables

On Sunday, January 20, 2013, Stephen Frost wrote:

* Jeff Janes (jeff.janes@gmail.com <javascript:;>) wrote:

By making the list over-flowable, we fix a demonstrated pathological
workload (restore of huge schemas); we impose no detectable penalty to
normal workloads; and we fail to improve, but also fail to make worse, a
hypothetical pathological workload. All at the expense of a few bytes

per

backend.

[...]

Why does the list not grow as needed?

It would increase the code complexity for no concretely-known benefit.

I'm curious if this is going to help with rollback's of transactions
which created lots of tables..? We've certainly seen that take much
longer than we'd like, although I've generally attributed it to doing
all of the unlink'ing and truncating of files.

If you are using large shared_buffers, then you will probably get more
benefit from a different recent commit:

279628a Accelerate end-of-transaction dropping of relations.

Cheers,

Jeff

#76

Bruce Momjian

bruce@momjian.us

almost 13 years ago

In reply to: Tom Lane (#73)

2 attachment(s)

Re: Further pg_upgrade analysis for many tables

On Sun, Jan 20, 2013 at 02:11:48PM -0500, Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

! * Using pg_restore --single-transaction is faster than other
! * methods, like --jobs.

Is this still the case now that Jeff's AtEOXact patch is in? The risk
of locktable overflow with --single-transaction makes me think that
pg_upgrade should avoid it unless there is a *really* strong performance
case for it, and I fear your old measurements are now invalidated.

I had thought that the AtEOXact patch only helped single transactions
with many tables, but I now remember it mostly helps backends that have
accessed many tables.

With max_locks_per_transaction set high, I tested with the attached
patch that removes --single-transaction from pg_restore. I saw a 4%
improvement by removing that option, and 15% at 64k. (Test script
attached.) I have applied the patch. This is good news not just for
pg_upgrade but for other backends that access many tables.

git patch
1 11.06 11.03
1000 19.97 20.86
2000 28.50 27.61
4000 46.90 45.65
8000 79.38 80.68
16000 153.33 147.13
32000 317.40 302.96
64000 782.94 659.52

FYI, this is better than the tests I did on the original patch that
showed --single-transaction was still a win then:

/messages/by-id/20121128202232.GA31741@momjian.us

#tbls git -1 AtOEXAct both
1 11.06 13.06 10.99 13.20
1000 21.71 22.92 22.20 22.51
2000 32.86 31.09 32.51 31.62
4000 55.22 49.96 52.50 49.99
8000 105.34 82.10 95.32 82.94
16000 223.67 164.27 187.40 159.53
32000 543.93 324.63 366.44 317.93
64000 1697.14 791.82 767.32 752.57

Keep in mind this doesn't totally avoid the requirement to increase
max_locks_per_transaction. There are cases at >6k where pg_dump runs
out of locks, but I don't see how we can improve that. Hopefully users
have already seen pg_dump fail and have adjusted
max_locks_per_transaction.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

Attachments:

no-single.difftext/x-diff; charset=us-asciiDownload

diff --git a/contrib/pg_upgrade/pg_upgrade.c b/contrib/pg_upgrade/pg_upgrade.c
new file mode 100644
index 85997e5..88494b8
*** a/contrib/pg_upgrade/pg_upgrade.c
--- b/contrib/pg_upgrade/pg_upgrade.c
*************** create_new_objects(void)
*** 314,325 ****
  		snprintf(log_file_name, sizeof(log_file_name), DB_DUMP_LOG_FILE_MASK, old_db->db_oid);
  
  		/*
! 		 *	Using pg_restore --single-transaction is faster than other
! 		 *	methods, like --jobs.  pg_dump only produces its output at the
! 		 *	end, so there is little parallelism using the pipe.
  		 */
  		parallel_exec_prog(log_file_name, NULL,
! 				  "\"%s/pg_restore\" %s --exit-on-error --single-transaction --verbose --dbname \"%s\" \"%s\"",
  				  new_cluster.bindir, cluster_conn_opts(&new_cluster),
  				  old_db->db_name, sql_file_name);
  	}
--- 314,324 ----
  		snprintf(log_file_name, sizeof(log_file_name), DB_DUMP_LOG_FILE_MASK, old_db->db_oid);
  
  		/*
! 		 *	pg_dump only produces its output at the end, so there is little
! 		 *	parallelism if using the pipe.
  		 */
  		parallel_exec_prog(log_file_name, NULL,
! 				  "\"%s/pg_restore\" %s --exit-on-error --verbose --dbname \"%s\" \"%s\"",
  				  new_cluster.bindir, cluster_conn_opts(&new_cluster),
  				  old_db->db_name, sql_file_name);
  	}

many_tablestext/plain; charset=us-asciiDownload